Handling Data for Model Training

Objectives

After completing this lesson, you will be able to:
  • Utilize SAP HANA DataFrames for data handling.
  • Visualize the attributes of the Employee Churn dataset.

Preparing Data for Employee Churn Prediction

The Python machine learning client for SAP HANA (hana-ml) provides an interface to leverage Predictive Analysis Library (PAL) and Automated Predictive Library (APL) functions within Python. It enables seamless execution of machine learning and predictive analytics workflows on SAP HANA dataframes, allowing for training, evaluation, and deployment of models directly within the SAP HANA environment.

This lesson focuses on preparing the Employee Churn dataset for model training by partitioning data, and structuring it for classification. This ensures the dataset is optimized for training a robust predictive model using PAL classification algorithms.

Partitioning the Input Data

After preparing the dataset, the next step is to partition the data into training and testing subsets to ensure an effective machine learning model evaluation. Now, let's explore the partitioning process in detail.

To build and evaluate a predictive model effectively, the input dataset is randomly partitioned into two disjoint subsets: training and testing. If the 'ID' column is not explicitly specified, the first column of the dataframe is assumed to contain the 'ID'. For more details, refer to [1].

There is no universally optimal partition percentage for the subsets mentioned above. The choice of partitioning should align with the specific objectives of the predictive project.

Common partition percentages include:

  • Training: 80% / Testing: 20%
  • Training: 70% / Testing: 30%
  • Training: 60% / Testing: 40%

In this case, the goal is to maximize the data available for training while ensuring enough data points remain for a robust model evaluation. Therefore, the following data split has been chosen:

Training: 85% / Testing: 15%

Additionally, the training subset is further summarized by categorizing employees into two groups:

  • Employees who remained with the company in the last 12 months.
  • Employees who left within the same period.

The 'FLIGHT_RISK' column serves as an indicator, marking whether an employee has left the company in the past 12 months. The 'N' column represents the count of employees in each category.

Code Snippet
123456789101112
# Split the station classification dataframe into a training and test subset df_train, df_test, df_val = train_test_val_split(data=hdf_employeechurn, id_column='EMPLOYEE_ID', random_seed=1234, partition_method='stratified', stratified_column='FLIGHT_RISK', training_percentage=0.85, testing_percentage=0.15, validation_percentage=0.00) #df_train.describe().collect() df_train.agg([('count', 'EMPLOYEE_ID', 'N')], group_by='FLIGHT_RISK').collect()
ITEM_NUMBERFLIGHT_RISKN
0No14463
1Yes1785

Log in to track your progress & complete quizzes