Explaining predictive modeling concepts

Objectives
After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

  • Create predictive scenarios in SAP Analytics Cloud
  • Define data sources and variables used in predictive modeling
  • Define hold out samples
  • Explain the correct data structure required for each of the different predictive models used in Smart Predict
  • Access data for a predictive model
  • Defining trained and not trained models

Create predictive scenarios

What is a predictive scenario?

A predictive scenario is a pre-configured workspace that you can use to create predictive models and reports to address a business question requiring the prediction of future events or trends.

You can create one or several predictive models within a predictive scenario. Each predictive model produces intuitive visualizations of the results making it easy to interpret its findings.

Types of predictive scenarios

In the interaction below, you will further investigate the three types of predictive scenarios.

Choose the predictive scenario that provides the best answers to your business question.

Predictive modeling overview

The two phases of predictive modeling

The learning phase

In the learning phase, the model is trained on historical data with a known target outcome.

The model is trained to identify patterns in the data in the past to predict the target in the following, or later, months. In this example, the Target data time frame (April) occurs AFTER the historical data time frame (January to March).

A Reference date of April 1st is set and the model is trained on 3 months of historical data to predict what happens in the following month (April).

The learning phase. Data for January to March is used to predict churn for April

The applying phase

In the applying phase, the model is applied on current data where the outcome is unknown.

The model predicts the outcome probability for each client ID. In this example, the model is applied on the latest 3 months of data (April – June) providing the prediction of the probability of churn in the future month (July).

The applying phase. Data for April to June is used to predict churn for July.

A Reference date of July 1st is set and the model is trained on 3 months of current data to predict what happens in the following month (July).

Training data sources

Training data sources

The model settings pane open showing the Training Data Source field.

Datasets

  • In the case of a classification or regression predictive model, the Data Source input dataset can either be a training or an application dataset.
  • In the case of a time series forecasting model, only one dataset is used for training and application.

Training dataset:

The training dataset contains the past observations that will be used to generate the predictive model.

  • In this training dataset, the values of the target variable, which is the variable corresponding to the business issue, are known.
  • By analyzing the training dataset, Smart Predict generates a predictive model that explains and predicts the target variable, based on the variables identified as influencers.

Application dataset:

Once the predictive model is built (trained), it is then applied on an application dataset.

This application dataset must contain the same information structure as the corresponding training dataset as follows:

  • The same number of variables (additional columns will be ignored)
  • The same variable names as the corresponding training dataset
  • The same order of presentation of these variables

Once the predictive model is applied, the predicted values of the target are calculated in the output dataset.

Output dataset:

An output dataset contains the result of applying the predictive model to the application dataset and any additional information requested.

Once the predictive model is applied, the predicted values of the target are created in the output dataset.

The output datasets are saved by default in the folder: Main Menu/Browse/Files, but you can choose another directory if required, or save the output in SAP HANA (if you are connecting to SAP HANA).

Predictive goal

Predictive goal variables

To build a predictive model, the following variable roles are defined:

  1. Target
  2. Influencers
The settings for a predictive model, which includes the Target and Influencer variables.

Target variables

The target is the variable that is to be explained or predicted. For example:

  1. A bank wants to predict if a customer will answer a marketing communication or not. The training dataset includes the customer information and contains the target variable <responded to mailings> <responded to mailing>. This target variable may take the values <Yes> or <No> 
  2. A company wants to predict the number of complaints that a customer support will receive this week. The target variable is <Number of customer complaints>and it will take<numerical> values.

IMPORTANT: If the value <Yes>is the least frequent value, the application automatically considers that value to be the target category of the target variable.

Influencers

The influencers are variables that describe the data and explain a target variable.

  • An influencer variable corresponds to a column in the dataset.
  • The observations relating to each influencer correspond to the rows in the dataset.

Exclude as influencers:

During the predictive model creation (learning phase), influencer variables can be excluded from the training process. These exclusions are not taken into account to compute the predictive model, not included in the statistics for the predictive model, not retrieved from the data source, and not needed when the predictive model is applied on an application dataset.

Exclude variables that are directly related to the target, especially variables that contain the target variable indirectly. They are known as the leakers or leak variables:

  • These variables are in some way causally related to the target variable. However, instead of being a cause for whatever your target variable represents, they are the result.
  • These leakers produce an incorrect model, often with a very high predictive power indicator (because of their high correlation with the target variable).
  • To prevent data leakage, any variable created or updated after the target value reference date should be excluded because when you use the model to make new predictions, that data won't be available.

Limit the number of influencers

During the training, Smart Predict chooses an optimized number of influencers to include in the predictive model so the toggle is turned off by default.

If there is a reason for overriding the Smart Predict default setup, for example, when it is necessary to focus only on a few influencers that have the most influence on the target, then switch on the toggle and set the maximum number of influencers to be kept in the model.

Hold out samples

When you are building your model in SAP Analytics Cloud, a hold out sample is automatically created. This is a sample of observations withheld from the model learning, so that the model's ability to predict future probabilities can be estimated by its ability to predict the data in the hold out sample.

Data is partitioned and split into:

  1. a training subset to train the models
  2. a validation subset hold out sample to test the model’s performance and choose the best performing model from a range of candidate models

Hold out samples

  1. The analytical data set feeds into the partition data where it is split into training and validation subsets
  2. The training subset is used to train candidate models
  3. The validation subset is used to evaluate candidate models to choose the best one and evaluate model performance.
Analytical dataset feeds into (1) Partition data. This data goes to training subset and validation subset. The training subset is used to (2) produce models, which also feeds into validation subset. (3) The candidate models are evaluated to choose the best one.

Classification and regression models

For classification and regression models, the data is automatically randomly split:

  • 75% into the training subset
  • 25% into the validation subset

Time series models

In time series models, it is important to preserve the data sequence relative to date/time, the historical data is automatically split sequentially:

  • 75% into the training
  • 25% into the validation subsets

Data structure required for classification models

Datasets used for a classification predictive scenario

In the example below, you are building a classification model to predict if a customer will buy a product or not.

To do this, you must prepare a training dataset containing the historical data on customers that have previously bought a similar product. In the training dataset, the actual values of the target variable (Bought) are known.

Training, application, and output datasets for a classification model.

Once the predictive model is built, it is applied to an application dataset to predict if other customers will buy the product. This dataset contains the same information about the customers as in the training dataset, however, the target variable (Will buy?) is an unknown value (empty) because this is to be predicted.

Smart Predict will use the predictive model to calculate the probability that each customer in the application dataset will buy the product. The target variable column is now added to the output dataset with the predicted outcome (will they buy yes/no).

Data structure required for regression models

In the example below, you are building a regression model to predict the number of complaints that customer support will receive next week.

To do this, you must prepare a training dataset containing historical values for several previous weeks. In this dataset, the historical values, or actual values, of the target variable (how many complaints per week) are known.

Training, application, and output datasets for a regression model

Once the regression predictive model is built it is applied to a new application dataset which contains the same influencers. The values of the target variable for next week (number of complaints next week) are unknown as they are in the future.

The output dataset contains the prediction of the number of complaints that can be expected next week.

Data structure required for time series models

Data structure required for time series models

In the example below, you are building a time series model to forecast the product sales for the next 3 months.

The forecasting model is trained on the training dataset and automatically creates forecasts of the signal for N periods into the future.  So, in this sense, the training dataset and application dataset are the same dataset. In this example, product sales forecasts for the next 3 months are generated in an output dataset.

Training, application, and output datasets for a time series model

You can optionally include additional variables in the data model to refine and improve the forecasts (e.g. a flag that indicates when a sales promotion occurred Super Promo). The future values for these additional variables are required inputs to generate the forecasts.

Data structure required for segmented time series models

In the example below, you are building a time series model using training/application dataset to forecast product sales for multiple products.

To do this, you must add the product-related information to the dataset. This is called the entity. In this example, you specify that the entity variable Product Name will be used to segment the forecast model.

Training, application, and output datasets for a segmented time series model

In the output dataset, the observations are divided into Product Name and forecasts are created per product entity.

Access data for a predictive model

Scenario

You are ready to build a predictive model. Before you can build and train the model, you must first access the data source for the model.

In this example, we will demonstrate accessing data using a classification model, however, the same steps apply for regression and time series as well.

What skills will you develop in this practice exercise?

In this practice exercise, you will be able to perform the following tasks in SAP Analytics Cloud:

  1. Access a data source for a predictive model
  2. Retrieve the dataset once it has been created

Training status

Once you have entered the settings of your predictive model, you need to train the predictive model steps before you can evaluate its accuracy.

If you save your predictive model before you train it, the predictive model is saved with the status Not Trained in the predictive model list in the predictive scenario.

An untrained predictive model with the status set as Not Trained.

When you "train" a predictive model Smart Predict will explore the relationships between the different influencer variables contained in your dataset to find the best combination to predict the target.

The status of your predictive model is updated in the predictive model list as Trained.

A trained predictive model with the status set to Trained. The creation date, predictive power and prediction confidence fields are filled.

Save progress to your learning plan by logging in or creating an account

Login or Register