Input Data Set Recap for Apply Regression Scenarios
The application data set must contain the same information structure as the corresponding training data set as follows:
- The same number of variables (extra columns are ignored.)
- The same variable names as the corresponding training data set.
- The same order of presentation of the variables.
By analyzing the training data set Smart Predict generates a regression model that explains and predicts the target variable, based on the variables identified as influencers.
Once the regression model is trained, it can be applied on an application data set, generating the predicted values of the target in the output data set.
Apply Your Predictive Model
Open the relevant predictive model and select the Apply Predictive Model icon, opening the Apply Predictive Model dialog.
Apply to Population
In the Data Source field, select the new data set (application data set) onto which you want to apply your predictive model.
Generated Data Set
In this section, you have several options to select the additional columns you want to include in your output data set.
Replicated columns: Select the variables from the data set that you used to train the model that is part of the output data set. The application process does not take into account any columns of the application data set that do not belong to the training data set.
Statistics & Predictions: In the statistics and predictions dropdown list, there are various data options that can be selected in the output data set. If you do not select any statistics or predictions, only the target variable and the key variables are included.
The Statistics & Predictions options include:
- Apply Date: The apply date is the start date of the predictive model application. The type of the column is TIMESTAMP.
- Train Date: The train date is the start date of the predictive model training. The type of the column is TIMESTAMP.
- Assigned Bin: While applying a regression predictive model to an input data set, the output statistics information for assigned bins can be applied. During the training step, Smart Predict uses past observations in a training data set to create a predictive model. In the application step, Smart Predict associates each observation with a predicted value.
- Based on this value, it groups the list of observations ranged from the highest to the lowest predicted value in 10 bins (or groups). Each bin represents 10% of those observations, and in each bin the observations have the same value or range of values.
- Smart Predict refers to the bins defined in the training step to assign the current observations from the input data set to the relevant bin. It compares each value obtained by the predictive model. The limits of each assigned bin are defined in the training step: it then assigns each observation to the relevant bin.
In the following example, a regression model is used to predict the deal values for the next quarter. The data set contains observations on 3,000 customers. Assigned bins is used to monitor the population structure. Each bin must contain approximately 10% of the observations. Therefore, if these figures increase or decrease for one or several bins, it indicates that the population is changing. The predictive model may need to be retrained with more recent data.
- On the left, the distribution per bin is similar in the output data set as in the training data set.
- On the right, in the apply data set, 14% of customers are in the top bin. If you check the build data set, you see that this is more than the 10% of customers expected.
- Outlier Indicator: For each row in the application data set, the outlier indicator is one if the row is an outlier regarding the target, otherwise it is zero. An observation is considered an outlier when the prediction error is greater than three times the average prediction error found on similar observations.
- Predicted Value: Selecting this option creates the predicted value from the regression model in the output table.
- Prediction Explanation: Can be used to display the reasons explaining why Smart Predict has generated a specific prediction for a specific entity of the application data set.
An explanation (or reason) is a combination of a variable and its value, for example: age, 35. It corresponds to the value assigned for a given variable to produce a specific prediction.
The strength tells how much this value is impacting the prediction and the direction of this impact. In a regression model, a positive strength increases the predicted value, while a negative strength decreases the predicted value.
Smart Predict can only generate up to a maximum of 10 explanations. When the predictive model uses more than 10 influencers to generate the predictions, Smart Predict aggregates the explanations with the lowest absolute strength (less contributing influencers) into two groups:
- Positive Others: aggregates the smallest positive influencers
- Negative Others: aggregates the smallest negative influencers.
The strength associated to positive others and negative others is the sum of the strength of the aggregated explanations. When the predictive model uses less than ten or exactly ten influencers to generate the predictions, the others group isn't generated as the provided list of explanations is complete.
Output As: Give a name to your generated data set.