Input dataset recap for apply regression scenarios
The application dataset must contain the same information structure as the corresponding training dataset as follows:
- The same number of variables (additional columns will be ignored)
- The same variable names as the corresponding training dataset
- The same order of presentation of the variables
By analyzing the training dataset Smart Predict generates a regression model that explains and predicts the target variable, based on the variables identified as influencers.
Once the regression model is trained it can be applied on an application dataset. This will generate the predicted values of the target in the output dataset.
Apply your predictive model
Open the relevant predictive model and select the Apply Predictive Model icon, opening the Apply Predictive Model dialog.
Apply to population
In the Data Source field, select the new dataset (application dataset) onto which you want to apply your predictive model.
In this section, you have a number of options to select the additional columns you want to include in your output dataset.
Replicated columns: Select the variables from the dataset that you used to train the model that should be part of the output dataset. The application process does not take into account any columns of the application dataset that do not belong to the training dataset.
Statistics & Predictions: In the statistics and predictions dropdown as seen below, there are various data options that can be selected to be included in the output dataset. If you do not select any statistics or predictions, only the target variable and the key variable(s) are included.
The Statistics & Predictions options include:
- Apply Date: It's the start date of the predictive model application. The type of the column is TIMESTAMP.
- Train Date: It's the start date of the predictive model training. The type of the column is TIMESTAMP.
- Assigned Bin: While applying a regression predictive model to an input dataset, the output statistics information for assigned bins can be applied. During the training step, Smart Predict uses past observations in a training dataset to create a predictive model and then in the application step, Smart Predict associates each observation with a predicted value.
- Based on this value, it groups the list of observations ranged from the highest to the lowest predicted value in 10 bins (or groups). Each bin represents 10% of those observations, and in each bin the observations have the same value or range of values.
- Smart Predict refers to the bins defined in the training step to assign the current observations from the input dataset to the relevant bin. It compares each value obtained by the predictive model with the limits of each assigned bin defined in the training step, then it assigns each observation to the relevant bin.
In the example below, a regression model is used to predict the deal values for the next quarter. The dataset contains observations on 3,000 customers and assigned bins is used to monitor the population structure. As each bin should contain approximately 10% of the observations, if these figures increase or decrease for one or several bins, it indicates that the population is changing and the predictive model might need to be retrained with more recent data.
- On the left, the distribution per bin is quite similar in the output dataset as in the training dataset.
- On the right, in the apply dataset, there are 14% of customers in the top bin, which is clearly more than the 10% of customers expected when looking at the build dataset.
- Outlier Indicator: For each row in the application dataset, the outlier indicator is 1 if the row is an outlier with respect to the target, otherwise it is 0. An observation is considered an outlier when the prediction error is greater than 3 times the average prediction error found on similar observations.
- Predicted Value: Selecting this option creates the predicted value from the regression model in the output table.
- Prediction Explanation: Can be used to display the reasons explaining why Smart Predict has generated a specific prediction for a specific entity of the application dataset.
An explanation (or reason) is a combination of variable and its value, for example: age, 35. It corresponds to the value assigned for a given variable in order to produce a specific prediction.
The strength, tells how much this value is impacting the prediction and the direction of this impact. In a regression model, a positive strength increases the predicted value, while a negative strength decreases the predicted value.
Smart Predict can only generate up to a maximum of 10 explanations. When the predictive model uses more than 10 influencers to generate the predictions, Smart Predict aggregates the explanations with the lowest absolute strength (less contributing influencers) into two groups:
- Positive Others: aggregates the smallest positive influencers
- Negative Others: aggregates the smallest negative influencers.
The strength associated to positive others and negative others is the sum of the strength of the aggregated explanations. When the predictive model uses less than 10 or exactly 10 influencers to generate the predictions, the others group is not generated as the provided list of explanations is complete.
Output As: Give a name to your generated dataset.