After preparing the dataset, the next step is to partition the data into training and testing subsets to ensure an effective machine learning model evaluation. Now, let's explore the partitioning process in detail.
To build and evaluate a predictive model effectively, the input dataset is randomly partitioned into two disjoint subsets: training and testing. If the 'ID' column is not explicitly specified, the first column of the dataframe is assumed to contain the 'ID'. For more details, refer to [1].
There is no universally optimal partition percentage for the subsets mentioned above. The choice of partitioning should align with the specific objectives of the predictive project.
Common partition percentages include:
- Training: 80% / Testing: 20%
- Training: 70% / Testing: 30%
- Training: 60% / Testing: 40%
In this case, the goal is to maximize the data available for training while ensuring enough data points remain for a robust model evaluation. Therefore, the following data split has been chosen:
Training: 85% / Testing: 15%Additionally, the training subset is further summarized by categorizing employees into two groups:
- Employees who remained with the company in the last 12 months.
- Employees who left within the same period.
The 'FLIGHT_RISK' column serves as an indicator, marking whether an employee has left the company in the past 12 months. The 'N' column represents the count of employees in each category.
123456789101112
# Split the station classification dataframe into a training and test subset
df_train, df_test, df_val = train_test_val_split(data=hdf_employeechurn, id_column='EMPLOYEE_ID',
random_seed=1234,
partition_method='stratified', stratified_column='FLIGHT_RISK',
training_percentage=0.85,
testing_percentage=0.15,
validation_percentage=0.00)
#df_train.describe().collect()
df_train.agg([('count', 'EMPLOYEE_ID', 'N')], group_by='FLIGHT_RISK').collect()
ITEM_NUMBER | FLIGHT_RISK | N |
---|
0 | No | 14463 |
1 | Yes | 1785 |