Explaining Predictive Modeling Concepts

Objectives

After completing this lesson, you will be able to:

  • Create predictive scenarios in SAP Analytics Cloud
  • Define data sources and variables used in predictive modeling
  • Define hold out samples
  • Explain the correct data structure required for each of the different predictive models used in Smart Predict
  • Access data for a predictive model
  • Define trained and not trained models

Create Predictive Scenarios

What Is a Predictive Scenario?

A predictive scenario is a preconfigured workspace. You can use predictive scenarios to create predictive models and reports to address a business question that requires the prediction of future events or trends.

You can create one or several predictive models within a predictive scenario. Each predictive model produces intuitive visualizations of the results making it easy to interpret its findings.

Types of Predictive Scenarios

There are three types of predictive scenarios available in SAP Analytics Cloud, and each scenario is covered in a separate unit in this course. You can choose the one that best fits your business question.

  1. Classification Predictive Scenario What is the likelihood that a future event occurs? This event is observed at an individual level (customer, asset, product, etc.) and at a certain horizon (in the year, before the end of week, in a month after customer contact, etc.). For example:
    • Who is likely to by your new product?
    • Which client is or is not a candidate for churn?
  2. Regression Predictive Scenario What could be the prediction of a business value, taking into account the context of its occurrence? For example:
    • What will be the revenue generated by the product line, based on planned transport charges and tax duties?
    • What will be the estimated price of the home based on location and square footage?
  3. Time Series Forecast Predictive Scenario What are the future values of a business value over time, at a certain granularity/ place? For example:
    • How much ice cream will I sell daily over the next 12 months?
    • I have my historical daily sales information, but how does a vacation month impact on my sales?

Phases of Predictive Modeling

The Learning Phase

In the learning phase, the model is trained on historical data with a known target outcome.

The model is trained to identify patterns in the data from the past to predict the target in the following, or later, months. In this example, the Target data time frame (April) occurs AFTER the historical data time frame (January to March).

A Reference date of April 1 is set and the model is trained on three months of historical data to predict what happens in the following month (April).

The Applying Phase

In the applying phase, the model is applied on current data where the outcome is unknown.

The model predicts the outcome probability for each client ID. In this example, the model is applied on the latest three months of data (April–June) providing the prediction of the probability of churn in the future month (July).

A Reference date of July 1 is set and the model is trained on three months of current data to predict what happens in the following month (July).

Training Data Sources

Training Data Sources Settings Pane

Data Sets

  • For a classification or regression predictive model, the Data Source input data set can either be a training or an application data set.
  • For a time series forecasting model, only one data set is used for training and application.

Training data set:

The training data set contains the past observations that are used to generate the predictive model.

  • In this training data set, the values of the target variable, which is the variable corresponding to the business issue, are known.
  • By analyzing the training data set, Smart Predict generates a predictive model that explains and predicts the target variable, based on the variables identified as influencers.

Application data set:

Once the predictive model is built (trained), it is applied on an application data set.

This application data set must contain the same information structure as the corresponding training data set as follows:

  • The same number of variables (extra columns are ignored.
  • The same variable names as the corresponding training data set.
  • The same order of presentation of these variables.

Once the predictive model is applied, the predicted values of the target are calculated in the output data set.

Output data set:

An output data set contains the result of applying the predictive model to the application data set and any additional information requested.

Once the predictive model is applied, the predicted values of the target are created in the output data set.

The output data sets are saved by default in the folder: Main Menu/Browse/Files, but you can choose another directory if necessary, or save the output in SAP HANA (if you are connecting to SAP HANA).

Predictive Goal

Predictive Goal Variables

To build a predictive model, the following variable roles are defined:

  1. Target
  2. Influencers

Target Variables

The target is the variable that is to be explained or predicted. For example:

  1. A bank wants to predict if a customer answers a marketing communication or not. The training data set includes the customer information and contains the target variable <responded to mailings> <responded to mailing>. This target variable may take the values <Yes> or <No>
  2. A company wants to predict the number of complaints that a customer support receives this week. The target variable is <Number of customer complaints> and takes <numerical> values.

IMPORTANT: If the value <Yes> is the least frequent value, the application automatically considers that value to be the target category of the target variable.

Influencers

The influencers are variables that describe the data and explain a target variable.

  • An influencer variable corresponds to a column in the data set.
  • The observations relating to each influencer correspond to the rows in the data set.

Exclude as Influencers:

During the predictive model creation (learning phase), influencer variables can be excluded from the training process. These exclusions are:

  • not considered for computing the predictive model
  • not included in the statistics for the predictive model
  • not retrieved from the data source
  • not needed when the predictive model is applied on an application data set

Exclude variables that are directly related to the target, especially variables that contain the target variable indirectly. They are known as the leakers or leak variables:

  • These variables are in some way causally related to the target variable. However, instead of being a cause for whatever your target variable represents, they are the result.
  • These leakers produce an incorrect model, often with a high predictive power indicator (because of their high correlation with the target variable).
  • To prevent data leakage, variables created or updated after the target value reference date must be excluded because when you use the model to make new predictions, that data won't be available.

Limit the Number of Influencers

During the training, Smart Predict chooses an optimized number of influencers to include in the predictive model so the toggle is turned off by default.

To override the Smart Predict default setup, for example, to focus only on a few influencers that have the most influence on the target. To override the default setup, switch on the toggle and set the maximum number of influencers to keep in the model.

Hold Out Samples

When you are building your model in SAP Analytics Cloud, a hold out sample is automatically created. The hold out is a sample of observations withheld from the model learning. The model's ability to predict future probabilities is estimated by its ability to predict the data in the hold out sample.

Data is partitioned and split as follows:

  1. A training subset to train the models.
  2. A validation subset hold out sample to test the model’s performance and choose the best performing model from a range of candidate models.

Hold Out Samples Process

  1. The analytical data set feeds into the partition data where it is split into training and validation subsets.
  2. The training subset is used to train candidate models.
  3. The validation subset is used to evaluate candidate models to choose the best one and evaluate model performance.

Classification and Regression Models

For classification and regression models, the data is automatically, randomly split as follows:

  • 75% into the training subset
  • 25% into the validation subset

Time Series Models

In time series models, it is important to preserve the data sequence relative to date and time. The historical data is automatically split sequentially as follows:

  • 75% into the training
  • 25% into the validation subsets

Data Structure Required for Classification Models

Data Sets Used for a Classification Predictive Scenario

In the example below, you build a classification model to predict if a customer will buy a product or not.

You must prepare a training data set containing the historical data on customers who have previously bought a similar product. In the training data set, the actual values of the target variable (Bought) are known.

Once the predictive model is built, it is applied to an application data set to predict if other customers will buy the product. This data set contains the same information about the customers as the training data set. However, the target variable (Will buy?) is an unknown value (empty) because this is to be predicted.

Smart Predict uses the predictive model to calculate the probability that each customer in the application data set will buy the product. The target variable column is now added to the output data set with the predicted outcome (will they buy, yes or no).

Data Structure Required for Regression Models

Data Sets Used for a Regression Predictive Scenario

In the example below, you build a regression model to predict the number of complaints that customer support will receive next week.

Prepare a training data set containing historical values for several previous weeks. In this data set, the historical values, or actual values, of the target variable (how many complaints per week) are known.

Once the regression predictive model is built, it is applied to a new application data set, which contains the same influencers. The values of the target variable for next week (the amount of complaints next week) are unknown as they are in the future.

The output data set contains the prediction of the amount of complaints that can be expected next week.

Data Structure Required for Time Series Models

Data Sets Used for a Time Series Predictive Scenario

In the example below, you build a time series model to forecast the product sales for the next three months.

The forecasting model is trained on the training data set and automatically creates forecasts of the signal for N periods into the future. Therefore, the training data set and application data set are the same data set. In this example, product sales forecasts for the next three months are generated in an output data set.

You can optionally include more variables in the data model to refine and improve the forecasts. For example, a flag that indicates when a sales promotion occurred Super Promo. The future values for these additional variables are required inputs to generate the forecasts.

Data Sets Used for a Segmented Time Series Model

In the example below, you build a time series model using a training/application data set to forecast product sales for multiple products.

Add the product-related information to the data set. This data set is called the entity. In this example, you specify that the entity variable Product Name is used to segment the forecast model.

In the output data set, the observations are divided into Product Name and forecasts are created per product entity.

Access Data for a Predictive Model

Scenario

You are ready to build a predictive model. Before you can build and train the model, you must first access the data source for the model.

In this example, we demonstrate accessing data by using a classification model. However, the same steps apply for regression and time series.

What skills do you develop in this practice exercise?

In this practice exercise, you perform the following tasks in SAP Analytics Cloud:

  1. Access a data source for a predictive model.
  2. Retrieve the data set once it has been created.

Training Status

Once you have entered the settings of your predictive model, you must train the predictive model steps before you can evaluate its accuracy.

Untrained Predictive Model

If you save your predictive model before you train it, the predictive model is saved with the status Not Trained in the predictive model list in the predictive scenario.

Trained Predictive Model

When you "train" a predictive model, Smart Predict explores the relationships between the different influencer variables in your data set and find the best combination to predict the target.

The status of your predictive model is updated in the predictive model list as Trained.

Log in to track your progress & complete quizzes