Defining automated data encoding

Objectives
After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

  • Describing the automated data encoding process in SAP Analytics Cloud

Missing values

Missing values

A missing value is an empty cell in your dataset. These missing values can be due to a data collection error or because they are simply not available.

Understanding the reasons why data is missing is important and you should consider investigating, especially if there is a high percentage of missing values in some influencers.

Table on right missing values in cells. Table on right have empty cells filled with the word Missing

Smart Predict handles missing values automatically:

  • Missing values are not excluded, they are replaced with a constant called Missing and then treated by the model as any other category.
  • You can assess the influence of the missing values when you have built the model and debriefed the model output.

Outliers

For a continuous variable, an outlier is a single or low-frequency occurrence of the value of a variable that is far from the mean as well as the majority of other values for that variable.

For a categorical variable (nominal or ordinal), an outlier is a single or very low-frequency occurrence of a category of a variable.

An example using a continuous variable (binning)

Bar chart on left has outliers for 16million. Chart on right groups salaries to handle the outlier.

The influence of outliers on a predictive model can lead to inaccurate predictions, so Smart Predict handles outliers automatically:

  • For nominal/ordinal variables, outliers are grouped into a dedicated noise category Other containing categories with other infrequent or non-robust values.
  • For continuous variables, the impact of outliers is reduced by grouping them into the bin for the smallest or largest values of the encoded variable.

Save progress to your learning plan by logging in or creating an account

Login or Register