This section presents the predicted results generated by the trained model on a subset of 1000 rows from the testing dataset. The predictions are based on the model’s learned patterns and include key metrics that help assess its reliability.
Step 1: Selecting a Subset of Data
A subset of 1000 employees is selected from the test dataset after dropping the 'FLIGHT_RISK' column to ensure unbiased predictions.
123hdf_new=df_test.drop('FLIGHT_RISK').head(1000)
display(hdf_new.collect())
ITEM_NUMBER | EMPLOYEE_ID | AGE | AGE_GROUP10 | AGE_GROUPS | GENERATION | CRITICAL_JOB_ROLE | RISK_OF_LOSS | IMPACT_OF_LOSS | FUTURE_LEADER | GENDER ... | CURRENT_REGION | CURRENT_COUNTRY | CURCOUNTRYLAT | CURCOUNTRYLON | PROMOTION_WITHIN_LAST_3_YEARS | CHANGED_POSITION_WITHIN_LAST_2_YEARS | CHANGE_IN_PERFORMANCE_RATING | FUNCTIONALAREACHANGETYPE | JOBLEVELCHANGETYPE | HEADS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10037 | 33 | (25-35] | (30-35] | Generation Y | Non-Critical | Low | Low | No Future Leader | Male ... | Americas | Mexico | 19.432601 | -99.133342 | No Promotion | No Change | 0 - Not available | No change | No change | 1 |
1 | 10045 | 33 | (25-35) | (30-35) | Generation Y | Critical | Medium | Medium | No Future Leader | Female ... | Americas | USA | 39.783730 | -100.445882 | No Promotion | No Change | 0 - Not available | No change | No change | 1 |
2 | 10082 | 33 | (25-35) | (30-35) | Generation Y | Critical | Low | Medium | No Future Leader | Male | Americas | USA | 39.783730 | -100.445882 | No Promotion | No Change | 0 - Not available | No change | No change | 1 |
3 | 10086 | 33 | (25-35] | (30-35] | Generation Y | Non-Critical | Medium | Low | Future Leader | Male ... | Americas | USA | 39.783730 | -100.445882 | No Promotion | No Change | 0 - Not available | No change | No change | 1 |
4 | 10092 | 33 | (25-35) | (30-35) | Generation Y | Non-Critical | Low | Low | Future Leader | Male ... | Americas | USA | 39.783730 | -100.445882 | No Promotion | No Change | 0 - Not available | No change | No change | 1 |
. ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
995 | 16699 | 36 | (35-45) | (35-40) | Generation Y | Critical | Medium | Low | No Future Leader | Male ... | EMEA | Ireland | 52.865196 | -7.979460 | Promotion | No Change | 3 - Decreasing | Cross-Functional Move | Promotion | 1 |
996 | 16702 | 36 | (35-45) | (35-40) | Generation Y | Critical | High | Medium | No Future Leader | Female ... | EMEA | Denmark | 55.670249 | 10.333328 | No Promotion | Change | 1 - Increasing | Cross-Functional Move | Same Level | 1 |
997 | 16704 | 36 | (35-45] | (35-40] | Generation Y | Non-Critical | Medium | High | Future Leader | Male ... | APJ | China | 35.000074 | 104.999927 | No Promotion | No Change | 3 - Decreasing | Cross-Functional Move | Same Level | 1 |
998 | 16710 | 45 | (45-55) | (40-45) | Generation X | Critical | Medium | Medium | No Future Leader | Male ... | EMEA | Italy | 42.638426 | 12.674297 | Promotion | Change | 1 - Increasing | Cross-Functional Move | Promotion | 1 |
999 | 16731 | 50 | (45-55] | (45-50] | Generation X | Critical | High | High | No Future Leader | Male ... | Americas | USA | 39.783730 | -100.445882 | No Promotion | Change | 2 - Constant | Cross-Functional Move | Same Level | 1 |
Step 2: Running Predictions
Predictions are generated using the trained model:
12predicted_classification = hgbc.predict(hdf_new, key = 'EMPLOYEE_ID', attribution_method='tree-shap',
missing_replacement='feature_marginalized')
Step 3: Filtering and Displaying Results
Notice that a few prediction examples are displayed in the table shown below. The table includes the following columns: 'EMPLOYEE_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE', 'Top 1', and 'PCT 1'.
The 'SCORE' [1] column represents the predicted category values, while the 'CONFIDENCE' column indicates the probability or confidence level of the classification prediction made by the model.
Additionally, the 'REASON_CODE' column provides insight into the feature importance related to the predicted classification (i.e., local feature importance or explainability) [2].
For instance, row number 'zero' in the table represents an employee's record classified as 'Yes' (indicating employee churn). The most influential feature for this classification is 'FUNCTIONALAREACHANGETYPE' (listed in the 'Top 1' column), with a percentage value of 29% (shown in the 'PCT 1' column).
123456pd.set_option('max_colwidth', None)
display(predicted_classification.filter('"SCORE" = \'Yes\'').select(
'EMPLOYEE_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE',
('json_query("REASON_CODE", \'$[0].attr\')', 'TOP 1'),
('json_query("REASON_CODE", \'$[0].pct\')', 'PCT 1') ).head(3).collect())
ITEM_NUMBER | EMPLOYEE_ID | SCORE | CONFIDENCE | REASON_CODE | TOP 1 | PCT 1 |
---|---|---|---|---|---|---|
0 | 10772 | Yes | 0.9805 | Code Snippet
| "FUNCTIONALAREACHANGETYPE" | 29.6167 |
1 | 12996 | Yes | 0.6626 | Code Snippet
| "FUNCTIONALAREACHANGETYPE" | 26.1833 |
2 | 16484 | Yes | 0.5256 | Code Snippet
| "FUNCTIONALAREACHANGETYPE" | 36.9316 |
ITEM_NUMBER | EMPLOYEE_ID | SCORE | CONFIDENCE | REASON_CODE | TOP 1 | PCT 1 |
---|---|---|---|---|---|---|
0 | 27221 | Yes | 0.522728 | Code Snippet
| "TIMEINPREVPOSITIONMONTH" | 34.21097120019775 |
1 | 27858 | Yes | 0.990345 | Code Snippet
| "TIMEINPREVPOSITIONMONTH" | 41.194646979837177 |
2 | 28272 | Yes | 0.703382 | Code Snippet
| "TIMEINPREVPOSITIONMONTH" | 30.169212529463996 |
Summarizing the Results
The output table includes the following key columns:
- 'EMPLOYEE_ID': A unique identifier for each employee.
- 'SCORE': The model's prediction indicating whether an employee is likely to leave ('Yes' for churn, 'No' otherwise).
- 'CONFIDENCE': A measure of certainty in the prediction, representing how strongly the model associates the input data with the predicted outcome.
A higher 'SCORE' indicates an increased likelihood of churn, while a higher 'CONFIDENCE' value suggests that the prediction is more reliable.
These insights can be used to improve targeted employee retention strategies, leveraging data-driven analysis to reduce turnover.
References
Conclusion
The AUC performance metric value obtained during the model evaluation is above '0.90', which indicates a sufficiently well performing classifier.
By reviewing model training results, we have shown the feature importance section. That is, the relative importance of all attributes explaining and contributing to the model's global classification performance.
During the model prediction section, we have highlighted a higher 'CONFIDENCE' value indicates greater reliability in the model prediction. In addition, we have shown the relevance of 'explainability' (for example, output delivered in the 'REASON_CODE' column) by providing an effective tool for understanding predictive modeling [1].
We have showcased how SAP HANA PAL has seamlessly integrated 'explainability' into classification, and it also extends to various regression algorithms and to time series analysis. ML explainability is integral to achieving SAP's ethical AI goals, ensuring fairness, transparency, and trustworthiness in AI systems.
In summary, these insights can be leveraged to proactively address employee retention strategies based on model-driven analysis.