Training the Classification Model

Objectives

After completing this lesson, you will be able to:
  • Learn how to train a classification model using SAP HANA’s PAL functions.
  • Understand how to fine-tune the model using different settings.

Training of Hybrid Gradient Boosting Tree (HGBT) Classification Model for Feature/Data Column 'FlightRisk'

The SAP HANA PAL Unified Classification function [1] provides an efficient way to train classification models with enhanced features such as:

  • Seamless switching between classification algorithms
  • Automatic dataset partitioning
  • Built-in model evaluation procedures
  • Support for additional evaluation metrics

For this task, the Hybrid Gradient Boosting Tree (HGBT) algorithm is selected by setting the 'func' parameter to 'HybridGradientBoostingTree'. Finally, the training time, that is, the time taken to fit the model to the training dataset, is displayed.

Code Snippet
123456789101112131415161718192021
# Train the classifer model using PAL HybridGradientBoostingTree # Initialize the model object hgbc = UnifiedClassification(func='HybridGradientBoostingTree', n_estimators = 101, split_threshold=0.1, learning_rate=0.1, max_depth=6, split_method='histogram', max_bin_num=256, feature_grouping=True, tolerant_iter_num=5, resampling_method='cv', fold_num=5, ref_metric=['auc'], evaluation_metric = 'error_rate') # Execute the training of the model # key= 'EMPLOYEE_ID', hgbc.fit(data=df_train.drop('EMPLOYEE_ID'), label='FLIGHT_RISK', partition_method='stratified', stratified_column='FLIGHT_RISK', training_percent=0.8, ntiles=20, build_report=True) display(hgbc.runtime)

1.82820272445678

Reviewing Model Training Results

The PAL/APL report generator is used to evaluate model performance. Currently, it supports only UnifiedClassification and UnifiedRegression models [1].

A variety of classification performance metrics exist, with AUC-ROC (Area Under the Receiver Operating Characteristic Curve) [2] being one of the most widely used. This metric, ranging from 0 to 1, assesses a binary classification model’s ability to distinguish between positive and negative classes. The closer the AUC value is to 1, the better the model performs in separating the classes.

In the report shown below, under the Statistics tab in the Stats Table, an AUC value of 0.95 is observed. Additionally, the AUC-ROC plot is available for visualization under the 'Scoring Metrics' tab.

The Variable Importance tab presents a feature importance pie chart, which assigns scores to input features based on their contribution to predicting the target variable [3]. Higher scores indicate a greater influence on the model’s predictions, with importance values ranging between 0 and 1.

From the results, 'PREVIOUS_COUNTRY' has the highest importance score of 0.23, followed by 'FUNCTIONALAREACHANGETYPE' at 0.21.

Code Snippet
123
# Build Model Report UnifiedReport(hgbc).build().display()

Unified Classification Model Report: Statistic - Stats Table

STAT NAMESTAT VALUECLASS
AUC0.9509None
ACCURACY0.9193None
KAPPA0.4390None
MCC0.4951None
This figure shows you an example of the pie chart for Variable Importance
This figure shows you an example of the bar chart for Variable Importance
This figure shows you an example of the ROC Curve

AUC Value

We can also obtain the AUC value using a line of code as shown below:

hgbc.get_performance_metrics()['AUC']

0.9509

Log in to track your progress & complete quizzes