Applying Classification, Regression, and Time Series Analysis

Objective

After completing this lesson, you will be able to identify use cases to apply classification, regression, and time series analysis techniques

Classic Machine Learning Scenarios

This lesson covers three key machine learning methods for SAP HANA: Regression, Classification, and Time series analysis.

These methods can be utilized in various scenarios, such as:

  • Regression: Effective for predicting car prices based on model characteristics and market trends.
  • Classification: Useful for predicting customer behaviors, including churn, fraud detection, and purchasing patterns.
  • Time series analysis: Ideal for forecasting future sales, demand, costs, and other metrics based on historical data.

Linear Regression: House Price as a Function of House‘s Living Area (size)

Linear regression is one of the most widely known statistical techniques. Understanding it is crucial for appreciating the development of various linear-based regression models, such as 'Generalized Linear Models'.

Linear regression helps to uncover linear relationships, i.e., straight-line relationships, between input and output numerical variables, making it a popular technique for predictive and statistical modeling.

In the figure shown, variable 'x' represents the set of values of houses' living areas, whereas variable 'y' represents the prices of the houses.

The continuous values of variable 'y' are predicted by 'h': the function that maps the values of 'x' to 'y'.

For simplicity, the figure only illustrates one data attribute of the houses, which is the living area. In this case, the predictor variable 'x' is continuous.

The process of predicting house prices based on a training dataset given as an input to the learning algorithm.

Regression: Training a Simple Model from Data

Similar to the previous figure, the following plot shows the input variable (houses' living areas) on the x-axis and the output variable (houses' prices) on the y-axis.

A line plotted along the x and the y axis showing a growing house price as the living area in square feet increases

The goal is to build a model that takes the living area of a house as input and predicts the house price. The data points in the plot represent observations from the dataset. By fitting a line to these data points, the model is created. The equation of this line is Y = m x + c, where 'm' is the slope of the line and 'c' is the y-intercept, i.e., the point where the line crosses the y-axis.

Regression - Model's Performance Measurement: Mean Square Error (MSE) or L2 Loss

Next, it is essential to evaluate how well the model fits the dataset, which is determined by the concept of loss. Loss measures the difference between the predicted value and the true value (ground truth).

The Mean Squared Error (MSE) or L2 loss is a loss function that computes the average of the squared differences between the predictions given by 'ŷ' and the actual sample values given by 'yi'.

An equation for the Mean Squared Error (MSE).

MSE = Mean Squared Error

N = Number of Data Points

yi = Observed Values

ŷi = Predicted Values

What is Classification?

Classification is a fundamental machine learning technique aimed at organizing input data into distinct classes. For example, the figure below illustrates how a classification model determines whether an incoming message falls under the category of SPAM or Inbox (non-SPAM).

A classification model determining whether an incoming message falls under the category of SPAM or Inbox (non-SPAM).

During the classification process, the model undergoes training using the 'train subset' function, followed by evaluation using the 'test subset' function.

A notable distinction between classification and regression tasks lies in their output variables. While classification deals with discrete target variables, regression tasks involve continuous output variables, as introduced in previous sections.

Time Series Analysis

Time series data is data collected about a subject at different points in time. For example, the exports of a country by year, the sales of a particular company over a period of time, or a person's blood pressure taken every minute. Any data captured continuously at different time-intervals is a type of time series data.

For example, the figure below illustrates the United Kingdom's annual mean temperatures, measured in Celsius degrees, dating back from 1800 up to recent years. The data is sourced from the Met office or Meteorological office had-UK gridded dataset. The Met office serves as the UK's national weather and climate service.

The United Kingdom's annual mean temperatures, measured in Celsius degrees, dating back from 1800 up to recent years.

Referencing the blog of the Met office news team (https://blog.metoffice.gov.uk/2023/07/14/how-have-daily-temperatures-shifted-in-the-uks-changing-climate/), you can discover that, utilizing 30-year meteorological averaging periods, reveals an almost 1 degree Celsius increase in the average annual mean temperature for the UK in the latest period (1991-2020) compared to the preceding period (1961-1990). This historical time series data uncovers a long-term warming trend.

Overview of Algorithms in SAP HANA

SAP HANA has implemented a wide variety of algorithms in the categories of classification, regression, and time series analysis, and much more.

The typical algorithms for classification are; Decision Tree Analysis (CART, C4.5, CHAID), Logistic Regression, Support Vector Machine, K-Nearest Neighbor, Naïve Bayes, Confusion Matrix, AUC, Online multi-class Logistic Regression, and many more.

Whereas for regression, the typical algorithms are; Multiple Linear Regression, and Online Linear Regression, among others.

The most popular time series analysis algorithms are; (Auto) Exponential Smoothing, Unified Exponential Smoothing, Linear Regression (damped trend, seas. adjust), Hierarchical Forecasting, and many more.

For more details, please see Machine Learning Information Map | SAP Help Portal

SAP HANA offers a range of algorithms for classification, regression and time series analysis. Examples include Decision Trees, Logistic Regression, Support Vector Machines, and Exponential Smoothing. For details, please see Machine Learning Information Map.

Log in to track your progress & complete quizzes