The model performance metric used to evaluate the linear regression model is the coefficient of determination. It is known as "R squared", and denoted as "R2" [ 1, 2 ].
R2 is basically the proportion of the variation in the response variable that is predictable from the independent variables. That is, the larger the value of R2 , the more variability is explained by the model.
Typically, R2 varies from 0 to 1; however, there might be cases where negative values can be obtained, refer to [1] for further details.
In this case, an R2 score of '0.75' is achieved.
123456#Computes the R2 score
R2=hgr.score( test_hdf, key='ID',
features=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
label='Target')
R2
Output:
0.7535287346593224
References
Improve the HGBT model with optimal parameter search
To improve the HGBT model, a search for the optimal parameter values for the linear regression object is initiated. ParamSearchCV does an exhaustive or random search over specified parameter values with crossover validation (CV) [1]
References
[1] SAP algorithm hana_ml.algorithms.pal package: ParamSearchCV
123456789101112131415161718from hana_ml.algorithms.pal.model_selection import ParamSearchCV
hgbr=HybridGradientBoostingRegressor(n_estimators=50, subsample = 0.8, col_subsample_tree=0.7)
ps_hgr3=ParamSearchCV(estimator=hgbr, search_strategy='grid',
param_grid={ 'learning_rate': [0.05, 0.1, 0.025, 0.04, 0.01],
'max_depth': [4, 5, 6, 7, 8, 10],
'split_threshold': [0.1, 0.4, 0.7, 1],
'min_samples_leaf': [2,3,4,5,6],
'col_subsample_split': [0.2,0.4,0.6, 0.8] },
train_control={"fold_num": 10, "evaluation_metric": 'rmse'},
scoring='mae'
)
ps_hgr3.set_scoring_metric('mae')
ps_hgr3.set_resampling_method('cv')
ps_hgr3.fit(data=train_hdf, features=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
label='Target', key='ID')
Inspecting optimized parameters
The mapping between the optimized parameter names (as shown on the left-side) and the HybridGradientBoostingRegressor parameter names (as shown on the right-side) are provided below:
- MAX_DEPTH = max_depth
- ETA = learning_rate
- COL_SAMPLE_RATE_BYSPLIT = col_subsample_split
- NODE_SIZE = min_samples_leaf
- GAMMA = split_threshold
1ps_hgr3.estimator.selected_param_.collect()
PARAM_NAME | INT_VALUE | DOUBLE_VALUE | STRING_VALUE | |
---|---|---|---|---|
0 | MAX_DEPTH | 10.0 | NaN | None |
1 | ETA | NaN | 0.1 | None |
2 | COL_SAMPLE_RATE_BYSPLIT | NaN | 0.6 | None |
3 | NODE_SIZE | 6.0 | NaN | None |
4 | GAMMA | NaN | 0.1 | None |
12# Optimal parameter values selected
hgbt_params = dict(n_estimators = 50, subsample = 0.8, col_subsample_tree=0.7, split_method = 'exact', fold_num=10, resampling_method = 'cv', evaluation_metric = 'rmse', ref_metric=['mae'], max_depth=10, learning_rate=0.1, col_subsample_split=0.6, min_samples_leaf=6, split_threshold=0.1)