Defining Automated Data Encoding

Objectives

After completing this lesson, you will be able to:

  • Describe the automated data encoding process in SAP Analytics Cloud

Missing Values

Missing Values

A missing value is an empty cell in your data set. These missing values can be due to a data collection error or because the values are available.

Understanding why data is missing is important and you must consider investigating, especially if there is a high percentage of missing values in some influencers.

Table on right missing values in cells. The table on the right have empty cells filled with the word Missing

Smart Predict handles missing values automatically:

  • Missing values are not excluded, they are replaced with a constant called Missing and then treated by the model as any other category.
  • You can assess the influence of the missing values when you have built the model and debriefed the model output.

Outliers

For a continuous variable, an outlier is a single or low-frequency occurrence of the value of a variable that is far from the mean and the majority of other values for that variable.

For a categorical variable (nominal or ordinal), an outlier is a single or very low-frequency occurrence of a category of a variable.

An example using a continuous variable (binning)

Bar chart on left has outliers for 16 million. Chart on right groups salaries to handle the outlier.

The influence of outliers on a predictive model can lead to inaccurate predictions, so Smart Predict handles outliers automatically.

  • For nominal/ordinal variables, outliers are grouped into a dedicated noise category called Other, containing categories with other infrequent or non-robust values.
  • For continuous variables, the impact of outliers is reduced by grouping them into the bin for the smallest or largest values of the encoded variable.

Log in to track your progress & complete quizzes