The Python machine learning client for SAP HANA (hana-ml) provides access to:
- All functions of the Predictive Analysis Library (PAL)
- Automated Predictive Library (APL) functions in Python
These functions can be used with SAP HANA DataFrames as input data.
Preparing the input dataset
The algorithm partitions an input dataset randomly into three disjoint subsets: training, testing, and validation. These subsets are crucial for executing the Machine Learning Workflow.
Note that the input dataset must include an 'ID' column for creating these partitions, as required by the partitioning algorithm [1]. If the 'ID' column is not explicitly specified, the algorithm assumes that the first column of the DataFrame contains the 'ID'. For additional details, please refer to [1].
Here’s how to insert the 'ID' column into the dataset:
1hdf_input = hdf.add_id(id_col='ID')
1hdf_input.head(5).collect()
ID | MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | Target | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1.24 | 52.0 | 2.92 | 0.91 | 396.0 | 4.65 | 37.80 | -122.27 | 5.00 |
1 | 2 | 1.16 | 52.0 | 2.43 | 0.94 | 1349.0 | 5.39 | 37.87 | -122.25 | 5.00 |
2 | 3 | 7.85 | 52.0 | 7.79 | 1.05 | 517.0 | 2.41 | 37.86 | -122.24 | 5.00 |
3 | 4 | 9.39 | 52.0 | 7.51 | 0.95 | 1366.0 | 2.75 | 37.85 | -122.24 | 5.00 |
4 | 5 | 7.87 | 52.0 | 8.28 | 1.04 | 947.0 | 2.62 | 37.83 | -122.23 | 5.00 |
Partitioning the input dataset
There is no optimal partition percentage for each of the above subsets. You need to choose a partition percentage that meets the predictive project’s objectives.
For instance, common partition percentages include:
- Training: 80% / Testing: 20%
- Training: 70% / Testing: 30%
- Training: 60% / Testing: 40%
In this particular case, you will need to maximize the amount of data for training the model and still leave enough datapoints for a robust model evaluation. Therefore, the suggested data split is the following:
Training: 70% / Testing: 30%
The input parameters for the partitioning algorithm are provided below:
12345678# Partitioning the input data into train, test, validation sub-sets
from hana_ml.algorithms.pal.partition import train_test_val_split
regressdata_hdf=hdf_input.select('ID', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'Target')
train_hdf, test_hdf, val_hdf = train_test_val_split(data=regressdata_hdf, id_column='ID', random_seed=2, partition_method='random', training_percentage = 0.7, testing_percentage = 0.3, validation_percentage = 0.0)
print(regressdata_hdf.select_statement)
Output:
SELECT "ID", "MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude", "Target" FROM (SELECT CAST(ROW_NUMBER() OVER() AS INTEGER) + 0 AS "ID", * FROM (SELECT * FROM "ML_DEMO"."california_housing")) AS "DT_269"
Train a Hybrid Gradient Boosting Tree (HGBT) Model
The PAL hybrid gradient boosting tree (HGBT) algorithm [1], is a HANA optimized gradient boosting tree implementation supporting mixed feature types (continuous and categorical) as input. It supports regression and classification scenarios.
The Hybrid Gradient Boosting model for regression is trained based on the training subset shown below:
123456789101112HybridGradientBoostingRegressor
%%time
hgr = HybridGradientBoostingRegressor(
n_estimators = 20, split_threshold=0.75,
split_method = 'exact', learning_rate=0.3,
max_depth=2,
resampling_method = 'cv', fold_num=5,
evaluation_metric = 'rmse', ref_metric=['mae'] )
hgr.fit(train_hdf, features=['ID', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], label='Target')
CPU times: user 4.81 ms, sys: 408 μs, total: 5.22 ms
Wall time: 633 ms
Output:
<hana_ml.algorithms.pal.trees.HybridGradientBoostingRegressor at 0x7fe3333e9510>
References
[1] SAP algorithm hana_ml.algorithms.pal package:HybridGradientBoostingRegressor
Feature importance
Feature importance assigns a score to input features based on their contribution at predicting the response or dependent variable [1].
The higher the score for a feature, the larger influence it has on the model to predict the target variable. The importance scores are in the range of [0, 1].
Here's how the model's feature importance can be checked:
It is observed that the most influential feature is 'MedInc', with an importance value of '0.51'. This feature represents the median income in a block group. "A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data" [2].
Moreover, in positions three and five we find features 'Latitude' and 'Longitude', respectively. Their respective importance values are '0.09' and '0.05'.
1hgr.feature_importances_.sort('IMPORTANCE', desc='TRUE').collect()
VARIABLE_NAME | IMPORTANCE | |
---|---|---|
0 | MedInc | 0.513025 |
1 | ID | 0.230721 |
2 | Latitude | 0.097631 |
3 | AveOccup | 0.085743 |
4 | Longitude | 0.057718 |
5 | AveRooms | 0.011069 |
6 | HouseAge | 0.004093 |
7 | AveBedrms | 0.000000 |
8 | Population | 0.000000 |
References
[1] SAP algorithm hana_ml.algorithms.pal package: HybridGradientBoostingRegressor