Generalizing the Model Using Test Data

Objective

After completing this lesson, you will be able to evaluate which features in the dataset have the most influence on the model’s predictions.

Predicting the Model

This section presents the predicted results generated by the trained model on a subset of 1000 rows from the testing dataset. The predictions are based on the model’s learned patterns and include key metrics that help assess its reliability.

Step 1: Selecting a Subset of Data

A subset of 1000 employees is selected from the test dataset after dropping the 'FLIGHT_RISK' column to ensure unbiased predictions.

Code Snippet
123
hdf_new=df_test.drop('FLIGHT_RISK').head(1000) display(hdf_new.collect())
ITEM_NUMBEREMPLOYEE_IDAGEAGE_GROUP10AGE_GROUPSGENERATIONCRITICAL_JOB_ROLERISK_OF_LOSSIMPACT_OF_LOSSFUTURE_LEADERGENDER ...CURRENT_REGIONCURRENT_COUNTRYCURCOUNTRYLATCURCOUNTRYLONPROMOTION_WITHIN_LAST_3_YEARSCHANGED_POSITION_WITHIN_LAST_2_YEARSCHANGE_IN_PERFORMANCE_RATINGFUNCTIONALAREACHANGETYPEJOBLEVELCHANGETYPEHEADS
01003733(25-35](30-35]Generation YNon-CriticalLowLowNo Future LeaderMale ...AmericasMexico19.432601-99.133342No PromotionNo Change0 - Not availableNo changeNo change1
11004533(25-35)(30-35)Generation YCriticalMediumMediumNo Future LeaderFemale ...AmericasUSA39.783730-100.445882No PromotionNo Change0 - Not availableNo changeNo change1
21008233(25-35)(30-35)Generation YCriticalLowMediumNo Future LeaderMaleAmericasUSA39.783730-100.445882No PromotionNo Change0 - Not availableNo changeNo change1
31008633(25-35](30-35]Generation YNon-CriticalMediumLowFuture LeaderMale ...AmericasUSA39.783730-100.445882No PromotionNo Change0 - Not availableNo changeNo change1
41009233(25-35)(30-35)Generation YNon-CriticalLowLowFuture LeaderMale ...AmericasUSA39.783730-100.445882No PromotionNo Change0 - Not availableNo changeNo change1
. ... .........................................................
9951669936(35-45)(35-40)Generation YCriticalMediumLowNo Future LeaderMale ...EMEAIreland52.865196-7.979460PromotionNo Change3 - DecreasingCross-Functional MovePromotion1
9961670236(35-45)(35-40)Generation YCriticalHighMediumNo Future LeaderFemale ...EMEADenmark55.67024910.333328No PromotionChange1 - IncreasingCross-Functional MoveSame Level1
9971670436(35-45](35-40]Generation YNon-CriticalMediumHighFuture LeaderMale ...APJChina35.000074104.999927No PromotionNo Change3 - DecreasingCross-Functional MoveSame Level1
9981671045(45-55)(40-45)Generation XCriticalMediumMediumNo Future LeaderMale ...EMEAItaly42.63842612.674297PromotionChange1 - IncreasingCross-Functional MovePromotion1
9991673150(45-55](45-50]Generation XCriticalHighHighNo Future LeaderMale ...AmericasUSA39.783730-100.445882No PromotionChange2 - ConstantCross-Functional MoveSame Level1
1000 rows × 40 columns

Step 2: Running Predictions

Predictions are generated using the trained model:

Code Snippet
12
predicted_classification = hgbc.predict(hdf_new, key = 'EMPLOYEE_ID', attribution_method='tree-shap', missing_replacement='feature_marginalized')

Step 3: Filtering and Displaying Results

Notice that a few prediction examples are displayed in the table shown below. The table includes the following columns: 'EMPLOYEE_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE', 'Top 1', and 'PCT 1'.

The 'SCORE' [1] column represents the predicted category values, while the 'CONFIDENCE' column indicates the probability or confidence level of the classification prediction made by the model.

Additionally, the 'REASON_CODE' column provides insight into the feature importance related to the predicted classification (i.e., local feature importance or explainability) [2].

For instance, row number 'zero' in the table represents an employee's record classified as 'Yes' (indicating employee churn). The most influential feature for this classification is 'FUNCTIONALAREACHANGETYPE' (listed in the 'Top 1' column), with a percentage value of 29% (shown in the 'PCT 1' column).

Code Snippet
123456
pd.set_option('max_colwidth', None) display(predicted_classification.filter('"SCORE" = \'Yes\'').select( 'EMPLOYEE_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE', ('json_query("REASON_CODE", \'$[0].attr\')', 'TOP 1'), ('json_query("REASON_CODE", \'$[0].pct\')', 'PCT 1') ).head(3).collect())
ITEM_NUMBEREMPLOYEE_IDSCORECONFIDENCEREASON_CODETOP 1PCT 1
010772Yes0.9805
Code Snippet
1
[{"attr":"FUNCTIONALAREACHANGETYPE","val":3.1659914617057178,"pct":29.616754904973626},{"attr":"JOBLEVELCHANGETYPE","val":2.9616090395572964,"pct":27.70482804829365},{"attr":"PREVIOUS_COUNTRY","val":1.1766521751513067,"pct":11.007174056333435},{"attr":"EMPLOYMENT_TYPE_2","val":1.1184850498463573,"pct":10.463040721003644},{"attr":"PROMOTION_WITHIN_LAST_3_YEARS","val":0.9856507423595541,"pct":9.220421726166244},{"attr":"PREVIOUS_JOB_LEVEL","val":0.20846321891052906,"pct":1.9501012987093403},{"attr":"TIMEINPREVPOSITIONMONTH","val":-0.20690779780323047,"pct":1.9355508723212456},{"attr":"RISK_OF_LOSS","val":-0.17780521059984248,"pct":1.6633062365637659},{"attr":"PREVIOUS_FUNCTIONAL_AREA","val":-0.10049090995097196,"pct":0.9400576995214759},{"attr":"IMPACT_OF_LOSS","val":-0.09804435450407491,"pct":0.9171710196587514}]
"FUNCTIONALAREACHANGETYPE"29.6167
112996Yes0.6626
Code Snippet
1
[{"attr":"FUNCTIONALAREACHANGETYPE","val":1.779508727486707,"pct":26.18333928119176},{"attr":"PREVIOUS_COUNTRY","val":1.7020204158662773,"pct":25.04319159765106},{"attr":"EMPLOYMENT_TYPE_2","val":0.7729003104726951,"pct":11.372302224236322},{"attr":"PREVIOUS_JOB_LEVEL","val":0.32942920001859057,"pct":4.847156060538495},{"attr":"PREVIOUS_REGION","val":-0.3104891995443306,"pct":4.568476641469905},{"attr":"EMPLOYMENT_TYPE","val":0.20022089183811793,"pct":2.94601058213658},{"attr":"PREVIOUS_PERFORMANCE_RATING","val":0.19403810814327597,"pct":2.8550383263202758},{"attr":"PROMOTION_WITHIN_LAST_3_YEARS","val":0.17748349440876774,"pct":2.6114570157122087},{"attr":"IMPACT_OF_LOSS","val":0.15021065632142925,"pct":2.2101698729341319},{"attr":"PREVIOUS_FUNCTIONAL_AREA","val":0.14947864972308989,"pct":2.199399272698043}]
"FUNCTIONALAREACHANGETYPE"26.1833
216484Yes0.5256
Code Snippet
1
[{"attr":"FUNCTIONALAREACHANGETYPE","val":2.362132834166609,"pct":36.93164842346884},{"attr":"EMPLOYMENT_TYPE_2","val":0.8705457026314862,"pct":13.61087207336961},{"attr":"PREVIOUS_COUNTRY","val":0.5551789479896835,"pct":8.68015270889719},{"attr":"PROMOTION_WITHIN_LAST_3_YEARS","val":0.5522170115232447,"pct":8.633843206464077},{"attr":"PREVIOUS_REGION","val":0.361208302694118,"pct":5.647446176515984},{"attr":"CHANGE_IN_PERFORMANCE_RATING","val":0.28138966471557627,"pct":4.39949185624014},{"attr":"PREVIOUS_PERFORMANCE_RATING","val":0.17816522478609154,"pct":2.785590779618334},{"attr":"TIMEINPREVPOSITIONMONTH","val":-0.17551115090711354,"pct":2.744094669843919},{"attr":"AGE","val":-0.11292548102606302,"pct":1.765575628509087},{"attr":"CURRENT_REGION","val":0.10269902670112985,"pct":1.605686307178712}]
"FUNCTIONALAREACHANGETYPE"36.9316
ITEM_NUMBEREMPLOYEE_IDSCORECONFIDENCEREASON_CODETOP 1PCT 1
027221Yes0.522728
Code Snippet
1
[{"attr":"TIMEINPREVPOSITIONMONTH","val":2.3799464640557694,"pct":34.21097120019775},{"attr":"EMPLOYMENT_TYPE_2","val":1.2305701014297253,"pct":17.689052647047377},{"attr":"FUNCTIONALAREACHANGETYPE","val":0.6205941708448476,"pct":8.920843231743456},{"attr":"PREVIOUS_COUNTRY","val":0.5274604833035208,"pct":7.5820761836751509},{"attr":"SALARY","val":0.49680440144873647,"pct":7.141404786530497},{"attr":"PREVIOUS_REGION","val":0.4019386053051078,"pct":5.7777392298596139},{"attr":"PREVIOUS_JOB_LEVEL","val":0.18958353451210387,"pct":2.7252028300554658},{"attr":"PROMOTION_WITHIN_LAST_3_YEARS","val":0.17505423756532619,"pct":2.516348821399208},{"attr":"AGE","val":-0.16361173162882435,"pct":2.3518664488064009},{"attr":"CHANGE_IN_PERFORMANCE_RATING","val":-0.13605646817959733,"pct":1.9557683270575257}]
"TIMEINPREVPOSITIONMONTH"34.21097120019775
127858Yes0.990345
Code Snippet
1
[{"attr":"TIMEINPREVPOSITIONMONTH","val":5.075260131915508,"pct":41.194646979837177},{"attr":"EMPLOYMENT_TYPE_2","val":2.0529135214516618,"pct":16.66299767858746},{"attr":"SALARY","val":0.9937257943954739,"pct":8.065829579346043},{"attr":"PREVIOUS_COUNTRY","val":-0.9562490170007192,"pct":7.761639730039505},{"attr":"FUNCTIONALAREACHANGETYPE","val":0.6624436830095447,"pct":5.3768935889601139},{"attr":"PREVIOUS_JOB_LEVEL","val":0.4505844835984997,"pct":3.6572842088838648},{"attr":"JOBLEVELCHANGETYPE","val":0.3425983543256225,"pct":2.780787170605691},{"attr":"PREVIOUS_FUNCTIONAL_AREA","val":0.33966466357291666,"pct":2.7569751192497895},{"attr":"PROMOTION_WITHIN_LAST_3_YEARS","val":0.25199722374185459,"pct":2.0453999208168196},{"attr":"AGE","val":0.2398786666673686,"pct":1.9470365527109877}]
"TIMEINPREVPOSITIONMONTH"41.194646979837177
228272Yes0.703382
Code Snippet
1
[{"attr":"TIMEINPREVPOSITIONMONTH","val":2.2255270451611879,"pct":30.169212529463996},{"attr":"FUNCTIONALAREACHANGETYPE","val":1.1938265379189222,"pct":16.183495331633496},{"attr":"PREVIOUS_COUNTRY","val":0.7302494941417288,"pct":9.899251611520969},{"attr":"CHANGE_IN_PERFORMANCE_RATING","val":0.5236095049193351,"pct":7.098042897616069},{"attr":"PROMOTION_WITHIN_LAST_3_YEARS","val":0.3902212107921695,"pct":5.289833106045707},{"attr":"PREVIOUS_FUNCTIONAL_AREA","val":0.30448378867275469,"pct":4.127577848230294},{"attr":"EMPLOYMENT_TYPE_2","val":-0.26312981446335928,"pct":3.5669839702211125},{"attr":"CURRENT_REGION","val":0.25556687764890786,"pct":3.464460907830933},{"attr":"PREVIOUS_PERFORMANCE_RATING","val":0.19837898315664025,"pct":2.689222634811058},{"attr":"SALARY","val":0.17664378482008814,"pct":2.3945805996081117}]
"TIMEINPREVPOSITIONMONTH"30.169212529463996
The above rows show the output (filtered results for employees predicted to leave).

Summarizing the Results

The output table includes the following key columns:

  • 'EMPLOYEE_ID': A unique identifier for each employee.
  • 'SCORE': The model's prediction indicating whether an employee is likely to leave ('Yes' for churn, 'No' otherwise).
  • 'CONFIDENCE': A measure of certainty in the prediction, representing how strongly the model associates the input data with the predicted outcome.

A higher 'SCORE' indicates an increased likelihood of churn, while a higher 'CONFIDENCE' value suggests that the prediction is more reliable.

These insights can be used to improve targeted employee retention strategies, leveraging data-driven analysis to reduce turnover.

References

Conclusion

The AUC performance metric value obtained during the model evaluation is above '0.90', which indicates a sufficiently well performing classifier.

By reviewing model training results, we have shown the feature importance section. That is, the relative importance of all attributes explaining and contributing to the model's global classification performance.

During the model prediction section, we have highlighted a higher 'CONFIDENCE' value indicates greater reliability in the model prediction. In addition, we have shown the relevance of 'explainability' (for example, output delivered in the 'REASON_CODE' column) by providing an effective tool for understanding predictive modeling [1].

We have showcased how SAP HANA PAL has seamlessly integrated 'explainability' into classification, and it also extends to various regression algorithms and to time series analysis. ML explainability is integral to achieving SAP's ethical AI goals, ensuring fairness, transparency, and trustworthiness in AI systems.

In summary, these insights can be leveraged to proactively address employee retention strategies based on model-driven analysis.

References

Log in to track your progress & complete quizzes