Understanding Model Evaluation and Optimization

Objectives

After completing this lesson, you will be able to:
  • Evaluate the linear regression model's performance using the R-squared metric
  • Evaluate changes in feature importance in the re-trained model

Model Evaluation

The model performance metric used to evaluate the linear regression model is the coefficient of determination. It is known as "R squared", and denoted as "R2" [ 1, 2 ].

R2 is basically the proportion of the variation in the response variable that is predictable from the independent variables. That is, the larger the value of R2 , the more variability is explained by the model.

Typically, R2 varies from 0 to 1; however, there might be cases where negative values can be obtained, refer to [1] for further details.

In this case, an R2 score of '0.75' is achieved.

Python
123456
#Computes the R2 score R2=hgr.score( test_hdf, key='ID', features=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], label='Target') R2

Output:

0.7535287346593224

References

[1] Coefficient of determination

[2] Coefficient of Determination (R-Squared)

Improve the HGBT model with optimal parameter search

To improve the HGBT model, a search for the optimal parameter values for the linear regression object is initiated. ParamSearchCV does an exhaustive or random search over specified parameter values with crossover validation (CV) [1]

References

[1] SAP algorithm hana_ml.algorithms.pal package: ParamSearchCV

Python
123456789101112131415161718
from hana_ml.algorithms.pal.model_selection import ParamSearchCV hgbr=HybridGradientBoostingRegressor(n_estimators=50, subsample = 0.8, col_subsample_tree=0.7) ps_hgr3=ParamSearchCV(estimator=hgbr, search_strategy='grid', param_grid={ 'learning_rate': [0.05, 0.1, 0.025, 0.04, 0.01], 'max_depth': [4, 5, 6, 7, 8, 10], 'split_threshold': [0.1, 0.4, 0.7, 1], 'min_samples_leaf': [2,3,4,5,6], 'col_subsample_split': [0.2,0.4,0.6, 0.8] }, train_control={"fold_num": 10, "evaluation_metric": 'rmse'}, scoring='mae' ) ps_hgr3.set_scoring_metric('mae') ps_hgr3.set_resampling_method('cv') ps_hgr3.fit(data=train_hdf, features=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], label='Target', key='ID')

Inspecting optimized parameters

The mapping between the optimized parameter names (as shown on the left-side) and the HybridGradientBoostingRegressor parameter names (as shown on the right-side) are provided below:

  • MAX_DEPTH = max_depth
  • ETA = learning_rate
  • COL_SAMPLE_RATE_BYSPLIT = col_subsample_split
  • NODE_SIZE = min_samples_leaf
  • GAMMA = split_threshold
Python
1
ps_hgr3.estimator.selected_param_.collect()
 PARAM_NAMEINT_VALUEDOUBLE_VALUESTRING_VALUE
0MAX_DEPTH10.0NaNNone
1ETANaN0.1None
2COL_SAMPLE_RATE_BYSPLITNaN0.6None
3NODE_SIZE6.0NaNNone
4GAMMANaN0.1None
Python
12
# Optimal parameter values selected hgbt_params = dict(n_estimators = 50, subsample = 0.8, col_subsample_tree=0.7, split_method = 'exact', fold_num=10, resampling_method = 'cv', evaluation_metric = 'rmse', ref_metric=['mae'], max_depth=10, learning_rate=0.1, col_subsample_split=0.6, min_samples_leaf=6, split_threshold=0.1)

Model Re-Training

Model re-training is needed to incorporate the newly optimized algorithm's parameters aiming to maximize model performance; these parameters were obtained in the previous section.

Python
12345
%%time #retrain model with optimal parameters hgr_optimized = HybridGradientBoostingRegressor( **hgbt_params ) hgr_optimized.fit(train_hdf, features=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], label='Target')

CPU times: user 4.18 ms, sys: 2.36 ms, total: 6.55 ms

Wall time: 2.19 s

Output:

<hana_ml.algorithms.pal.trees.HybridGradientBoostingRegressor at 0x7fe333c01e10>

Model re-trained: Feature importance

Feature importance assigns a score to input features based on their contribution at predicting the response or dependent variable [1].

The higher the score for a feature, the larger influence it has on the model to predict the target variable. The importance scores are in the range of [0, 1].

Model re-trained: Feature importance assessment

By comparing against the initial "Model Trained", you will be able to observe that the most influential feature remains 'MedInc' with a decreased importance value of '0.34' (previous value was '0.51'). This feature represents the median income in a block group. [2].

Interestingly, features 'Latitude' and 'Longitude' become the second and third most influential features from positions three and five (initial "Model Trained"), respectively. As a matter of fact, they became approximately 2 times more influential crossing the '0.10' value.

The enhanced model might be more accurately highlighting that geographical locations have stronger influence onto the target variable than thought before. The 'target' variable is the 'Median House Value' for California districts [2].

Python
1
hgr_optimized.feature_importances_.sort('IMPORTANCE', desc='TRUE').collect()
 VARIABLE_NAMEIMPORTANCE
0MedInc0.340796
1Longitude0.216604
2Latitude0.153428
3AveOccup0.116022
4AveRooms0.097980
5HouseAge0.036293
6Population0.019592
7AveBedrms0.019286

References

[1] SAP algorithm hana_ml.pal package: HybridGradientBoostingRegressor

[2] California Housing dataset description

Model Re-Trained: Model Evaluation

To evaluate the performance of the linear regression model, as outlined in the "Model Evaluation" section, the coefficient of determination, commonly known as R squared and denoted R2, is used [1, 2].

R2 measures the proportion of the variation in the response variable that is explained by the independent variables. A higher R2 value indicates that the model explains a greater proportion of the variability.

While the value of R2 typically ranges from 0 to 1, negative values can occur in some cases - refer to [1] for more details.

In this case, the enhanced model achieves an R2 score of 0.829, an improvement from the initial score of 0.75, representing an increase of 0.079.

Python
123456
#Computes the R2 score R2=hgr_optimized.score( test_hdf, key='ID', features=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], label='Target') R2

Output:

0.7535287346593224

References

[1] Coefficient of determination

[2] Coefficient of Determination (R-Squared)

Log in to track your progress & complete quizzes