Exploring and Manipulating Data with SAP HANA DataFrames

Objective

After completing this lesson, you will be able to demonstrate how to access and visualize data using SAP HANA DataFrames.

Loading the Employee Churn Dataset

The Employee Churn dataset is freely available and can be uploaded from [1]. When it is loaded, it can be explored using SAP HANA DataFrames, which provide a high performance, columnar, in-memory data structure for efficient data manipulation and analysis within SAP HANA Cloud.

To begin, the dataset is read into a Pandas DataFrame using the read_csv() function, and the first five observations are displayed to get an initial overview:

Code Snippet
12
df_data = pd.read_csv("./Emp_Churn_Train.csv", sep = ',') df_data.head(5)

Overview

ITEM _NUMBEREMPLOYEE_IDAGEAGE_GROUP10AGE_GROUP5GENERATIONCRITICAL_JOB_ROLERISK_OF_LOSSIMPACT_OF_LOSSFUTURE_LEADERGENDER ...CURRENT_COUNTRYCURCOUNTRYLATCURCOUNTRYLONPROMOTION_WITHIN_LAST_3_YEARSCHANGED_POSITION_WITHIN_LAST_2_YEARSCHANGE_IN_PERFORMANCE_RATINGFUNCTIONALAREACHANGETYPE

JOBLEVELCHANGETYPE

HEADSFLIGHT_RISK
01003233(25-35](30-35]Generation YCriticalHighHighNo Future LeaderFemale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1No
11003343(35-45](40-45]Generation XCriticalLowHighNo Future LeaderFemale ...Germany51.0834210.423447No PromotionNo Change0 - Not availableNo changeNo change1No
21003433(25-35](30-35]Generation YCriticalMediumHighNo Future LeaderFemale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableExternal HireExternal Hire1No
31003533(25-35](30-35]Generation YCriticalHighHighNo Future LeaderMale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1No
41003633(25-35](30-35]Generation YCriticalLowLowNo Future LeaderMale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1Yes

5 rows x 43 columns

‎‎Additional materials related to the Employee Churn scenario can be found in the same folder [2].

Creating a SAP HANA DataFrame

To leverage SAP HANA’s in-memory processing capabilities, the dataset must be converted into a SAP HANA DataFrame. This can be done using the hana_ml.dataframe.create_dataframe_from_pandas() function [1], which creates a table in SAP HANA:

Code Snippet
123456
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, pandas_df = df_data, table_name = 'EMPLOYEE_CHURN_DATA', force = True, replace = False) display(df_remote.head(5).collect())
100%|██████████| 1/1 [00:00<00:00, 1.32it/s]
ITEM_NUMBEREMPLOYEE_IDAGEAGE_GROUP10AGE_GROUP5GENERATIONCRITICAL_JOB_ROLERISK_OF_LOSSIMPACT_OF_LOSSFUTURE_LEADERGENDER ...CURRENT_COUNTRYCURCOUNTRYLATCURCOUNTRYLONPROMOTION_WITHIN_LAST_3_YEARSCHANGED_POSITION_WITHIN_LAST_2_YEARSCHANGE_IN_PERFORMANCE_RATINGFUNCTIONALAREACHANGETYPEJOBLEVELCHANGETYPEHEADSFLIGHT_RISK
01003233(25-35](30-35]Generation YCriticalHighHighNo Future LeaderFemale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1No
11003343(35-45](40-45]Generation XCriticalLowHighNo Future LeaderFemale ...Germany51.0834210.423447No PromotionNo Change0 - Not availableNo changeNo change1No
21003433(25-35](30-35]Generation YCriticalMediumHighNo Future LeaderFemale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableExternal HireExternal Hire1No
31003533(25-35](30-35]Generation YCriticalHighHighNo Future LeaderMale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1No
41003633(25-35](30-35]Generation YCriticalLowLowNo Future LeaderMale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1Yes
51003733(25-35](30-35]Generation YNon-CriticalLowLowNo Future LeaderMale ...Mexico19.432601-99.133342No PromotionNo Change0 - Not availableNo changeNo change1Yes

5 rows × 43 columns

‏‏‎ ‎Additional materials related SAP HANA DataFrame can be found in the reference [1].

Loading Data into SAP HANA DataFrame

Alternatively, you can explicitly load the Employee Churn data into a SAP HANA DataFrame. The conn.table() method allows you to access the SAP HANA table:

Code Snippet
123
hdf_employeechurn = conn.table('EMPLOYEE_CHURN_DATA') display(hdf_employeechurn.select_statement) display(hdf_employeechurn.head(6).collect())
'SELECT * FROM "EMPLOYEE_CHURN_DATA"'
ITEM_NUMBEREMPLOYEE_IDAGEAGE_GROUP10AGE_GROUP5GENERATIONCRITICAL_JOB_ROLERISK_OF_LOSSIMPACT_OF_LOSSFUTURE_LEADERGENDER ...CURRENT_COUNTRYCURCOUNTRYLATCURCOUNTRYLONPROMOTION_WITHIN_LAST_3_YEARSCHANGED_POSITION_WITHIN_LAST_2_YEARSCHANGE_IN_PERFORMANCE_RATINGFUNCTIONALAREACHANGETYPEJOBLEVELCHANGETYPEHEADSFLIGHT_RISK
01003233(25-35](30-35]Generation YCriticalHighHighNo Future LeaderFemale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1No
11003343(35-45](40-45]Generation XCriticalLowHighNo Future LeaderFemale ...Germany51.0834210.423447No PromotionNo Change0 - Not availableNo changeNo change1No
21003433(25-35](30-35]Generation YCriticalMediumHighNo Future LeaderFemale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableExternal HireExternal Hire1No
31003533(25-35](30-35]Generation YCriticalHighHighNo Future LeaderMale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1No
41003633(25-35](30-35]Generation YCriticalLowLowNo Future LeaderMale ...USA39.78373-100.445882No PromotionNo Change0 - Not availableNo changeNo change1Yes
51003733(25-35](30-35]Generation YNon-CriticalLowLowNo Future LeaderMale ...Mexico19.432601-99.133342No PromotionNo Change0 - Not availableNo changeNo change1Yes

6 rows × 43 columns

Counting Rows in the Dataset

To compute the number of rows in the SAP HANA DataFrame, the DataFrame.count()[1] method is used. This method provides the total row count for the dataset, as displayed below:

Code Snippet
1
display(hdf_employeechurn.count())

19115

Aggregating Data by Grouping and Counting

Below, a SAP HANA DataFrame is returned with the group_by column along with the aggregates. This method supports all aggregation functions in the SAP HANA database instance, such as 'max', 'min', 'count', 'avg', 'sum', 'median', 'stddev', 'var'.

The name of the column in the returned DataFrame is the same as the original column, that is, FLIGHT_RISK. The FLIGHT_RISK column indicates whether an employee left the company in the last 12 months or not.

More information about the Employee Churn dataset can be explored in the folder Employee Chum dataset description [1] .

To count the number of employees based on 'FLIGHT_RISK', the following code can be used:

Code Snippet
1
hdf_employeechurn.agg([('count', 'EMPLOYEE_ID', 'N')], group_by='FLIGHT_RISK').collect()
ITEM_NUMBERFLIGHT_RISKN
0No17015
1Yes2100

References

Understanding Descriptive Statistics of Data Attributes/Columns

Next, the SQL query backing the returned SAP HANA DataFrame (with the group_by column) is displayed.

Code Snippet
1
hdf_employeechurn.agg([('count', 'EMPLOYEE_ID', 'N')], group_by='FLIGHT_RISK').select_statement

'SELECT "FLIGHT_RISK", count("EMPLOYEE_ID") AS "N" FROM (SELECT * FROM "EMPLOYEE_CHURN_DATA") AS "DT_6" GROUP BY "FLIGHT_RISK"'

Data manipulation and analysis are vital in Data Science. The SAP class method DataFrame.describe() returns a DataFrame containing various statistics of data attributes/columns [1]. These descriptive statistics summarize central tendency, dispersion, and the shape of a dataset's distribution, helping organize and present data meaningfully.

For example, row number 3 (shown below) displays the statistics for the 'SALARY' column, which represents an employee's salary in € (Euros). This column has '19,115' observations, with no null values, and its mean value is '€62,003.15', ranging from a minimum of '€4,900.00' to a maximum of '€120,000.00'. The median is '€63,000.00' [2], which splits the dataset into its higher and lower halves.

Code Snippet
1
hdf_employeechurn.describe().head(10).collect()
ITEM_NUMBERcolumncountuniquenullsmeanstdminmaxmedian25_percent_cont25_percent_disc50_percent_cont50_percent_disc75_percent_cont75_percent_disc
0EMPLOYEE_ID1911519115019589.0000005518.16953310032.00000029146.00000019589.0014810.50000014810.00000019589.00000019589.00000024367.50000024368.000000
1AGE1911540039.0749159.18886318.00000057.00000039.00000032.00000032.00000039.00000039.00000046.00000046.000000
2TENURE_MONTHS191154720140.363484103.1637661.000000486.000000120.00000055.00000055.000000120.000000120.000000209.000000209.000000
3SALARY1911514062003.15982224708.7918384900.000000120000.00000063000.00000040000.00000040000.00000063000.00000063000.00000070000.00000070000.000000
4PREVCOUNTRYLAT1259455652138.76848618.956135-41.50008364.68631440.03326535.00007435.00007440.03326540.03326551.08342051.083420
5PREVCOUNTRYLON125945565215.96006871.502329-107.991707172.83440810.423447-7.979460-7.97946010.42344710.42344778.66774378.667743
6TIMEINPREVPOSITIONMONTH1911558019.88590119.8645300.000000396.0000009.0000003.0000003.0000009.0000009.00000036.00000036.000000
7CURCOUNTRYLAT1911555039.74910918.354538-41.50008364.68631446.60335436.57484436.57484446.60335446.60335451.08342051.083420
8CURCOUNTRYLON191155504.28276369.918683-107.991707172.83440810.423447-7.979460-7.97946010.42344710.42344725.92091625.920916
9HEADS19115101.0000000.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000

Based on the dataset's statistical summary, the columns 'PREVCOUNTRYLAT' and 'PREVCOUNTRYLON' contain a significant number of null values—approximately 51% in each. Given this high proportion of missing data, these columns are dropped from the dataset for further analysis.

References

Log in to track your progress & complete quizzes