Training a Regression Model with SAP HANA PAL

Objective

After completing this lesson, you will be able to explore the capabilities of the Hybrid Gradient Boosting Tree (HGBT) algorithm for regression

Train a PAL Regression Model for House Prices

The Python machine learning client for SAP HANA (hana-ml) provides access to:

  • All functions of the Predictive Analysis Library (PAL)
  • Automated Predictive Library (APL) functions in Python

These functions can be used with SAP HANA DataFrames as input data.

Preparing the input dataset

The algorithm partitions an input dataset randomly into three disjoint subsets: training, testing, and validation. These subsets are crucial for executing the Machine Learning Workflow.

Note that the input dataset must include an 'ID' column for creating these partitions, as required by the partitioning algorithm [1]. If the 'ID' column is not explicitly specified, the algorithm assumes that the first column of the DataFrame contains the 'ID'. For additional details, please refer to [1].

Here’s how to insert the 'ID' column into the dataset:

Python
1
hdf_input = hdf.add_id(id_col='ID')
Python
1
hdf_input.head(5).collect()
Output:
 IDMedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeTarget
011.2452.02.920.91396.04.6537.80-122.275.00
121.1652.02.430.941349.05.3937.87-122.255.00
237.8552.07.791.05517.02.4137.86-122.245.00
349.3952.07.510.951366.02.7537.85-122.245.00
457.8752.08.281.04947.02.6237.83-122.235.00

Partitioning the input dataset

There is no optimal partition percentage for each of the above subsets. You need to choose a partition percentage that meets the predictive project’s objectives.

For instance, common partition percentages include:

  • Training: 80% / Testing: 20%
  • Training: 70% / Testing: 30%
  • Training: 60% / Testing: 40%

In this particular case, you will need to maximize the amount of data for training the model and still leave enough datapoints for a robust model evaluation. Therefore, the suggested data split is the following:

Training: 70% / Testing: 30%

The input parameters for the partitioning algorithm are provided below:

Python
12345678
# Partitioning the input data into train, test, validation sub-sets from hana_ml.algorithms.pal.partition import train_test_val_split regressdata_hdf=hdf_input.select('ID', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'Target') train_hdf, test_hdf, val_hdf = train_test_val_split(data=regressdata_hdf, id_column='ID', random_seed=2, partition_method='random', training_percentage = 0.7, testing_percentage = 0.3, validation_percentage = 0.0) print(regressdata_hdf.select_statement)

Output:

SELECT "ID", "MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude", "Target" FROM (SELECT CAST(ROW_NUMBER() OVER() AS INTEGER) + 0 AS "ID", * FROM (SELECT * FROM "ML_DEMO"."california_housing")) AS "DT_269"

Train a Hybrid Gradient Boosting Tree (HGBT) Model

The PAL hybrid gradient boosting tree (HGBT) algorithm [1], is a HANA optimized gradient boosting tree implementation supporting mixed feature types (continuous and categorical) as input. It supports regression and classification scenarios.

The Hybrid Gradient Boosting model for regression is trained based on the training subset shown below:

Code Snippet
123456789101112
HybridGradientBoostingRegressor %%time hgr = HybridGradientBoostingRegressor( n_estimators = 20, split_threshold=0.75, split_method = 'exact', learning_rate=0.3, max_depth=2, resampling_method = 'cv', fold_num=5, evaluation_metric = 'rmse', ref_metric=['mae'] ) hgr.fit(train_hdf, features=['ID', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], label='Target')

CPU times: user 4.81 ms, sys: 408 μs, total: 5.22 ms

Wall time: 633 ms

Output:

<hana_ml.algorithms.pal.trees.HybridGradientBoostingRegressor at 0x7fe3333e9510>

References

[1] SAP algorithm hana_ml.algorithms.pal package:HybridGradientBoostingRegressor

Feature importance

Feature importance assigns a score to input features based on their contribution at predicting the response or dependent variable [1].

The higher the score for a feature, the larger influence it has on the model to predict the target variable. The importance scores are in the range of [0, 1].

Here's how the model's feature importance can be checked:

It is observed that the most influential feature is 'MedInc', with an importance value of '0.51'. This feature represents the median income in a block group. "A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data" [2].

Moreover, in positions three and five we find features 'Latitude' and 'Longitude', respectively. Their respective importance values are '0.09' and '0.05'.

Python
1
hgr.feature_importances_.sort('IMPORTANCE', desc='TRUE').collect()
 VARIABLE_NAMEIMPORTANCE
0MedInc0.513025
1ID0.230721
2Latitude0.097631
3AveOccup0.085743
4Longitude0.057718
5AveRooms0.011069
6HouseAge0.004093
7AveBedrms0.000000
8Population0.000000

References

[1] SAP algorithm hana_ml.algorithms.pal package: HybridGradientBoostingRegressor

[2] California Housing dataset description

Log in to track your progress & complete quizzes