Understanding Machine Learning for Time Series Forecasting

Setting Up the Environment and Preparing Data Using SAP HANA DataFrames

Training an SAP HANA PAL Univariate Time Series Model

Forecasting with Multiple Time Series

Exploring and Manipulating Data with SAP HANA Dataframes

Objective

After completing this lesson, you will be able to demonstrate how to format, aggregate, and prepare the dataset for time series forecasting using SAP HANA DataFrames.

Preparing Data for Time Series Forecasting

Before building forecasting models, it's essential to prepare and shape the data in a way that aligns with time series modeling requirements. This includes loading the dataset into SAP HANA, transforming fields into time-aware formats, and performing basic aggregation - all using the SAP HANA DataFrame API, which enables SQL-like operations in Python.

Loading the Time Series Dataset

The overnight stays dataset is freely available and can be uploaded from [1] as shown below. Using the following code, we display the first five observations of the newly created Pandas DataFrame.

More material related to the overnight stays scenario can be found in the very same folder [2].

Code Snippet

123
df_data= pd.read_csv('./OVERNIGHTSTAYS.csv', sep = ',')
df_data.MONTH = pd.to_datetime(df_data.MONTH, format='%d/%m/%Y')
df_data.head(5)

	MONTH	REGION	COUNTRYOFRESIDENCE	OVERNIGHTSTAYS
0	2022-01-01	Graubünden	Switzerland	392805
1	2022-01-01	Graubünden	Baltic States	0
2	2022-01-01	Graubünden	Germany	80648
3	2022-01-01	Graubünden	France	4229
4	2022-01-01	Graubünden	Italy	10083

Creating an SAP HANA DataFrame

The SAP function hana_ml.dataframe.create_dataframe_from_pandas() [3] creates an SAP HANA DataFrame from a Pandas DataFrame and creates a table in SAP HANA, that is, table_name = 'OVERNIGHTSTAYS'.

Then, we display the first five observations of the SAP HANA DataFrame, and then DataFrame.collect() copies the current DataFrame to a new Python Pandas DataFrame [4].

Code Snippet

1234567

df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn,
                                                   pandas_df = df_data,
                                                   table_name = 'OVERNIGHTSTAYS',
                                                   force = True,
                                                   replace = False)
display(df_remote.head(5).collect())

100%|██████████| 1/1 [00:01<00:00, 1.44s/it]

	MONTH	REGION	COUNTRYOFRESIDENCE	OVERNIGHTSTAYS
0	2022-01-01	Graubünden	Switzerland	392805
1	2022-01-01	Graubünden	Baltic States	0
2	2022-01-01	Graubünden	Germany	80648
3	2022-01-01	Graubünden	France	4229
4	2022-01-01	Graubünden	Italy	10083

We then show the structure of the table that has been created as follows:

_Output:

[('MONTH', 'TIMESTAMP', 27, 27, 27, 0), ('REGION', 'NVARCHAR', 5000, 5000, 5000, 0), ('COUNTRYOFRESIDENCE', 'NVARCHAR', 5000, 5000, 5000, 0), ('OVERNIGHTSTAYS', 'INT', 10, 10, 10, 0)]

Loading Data into SAP HANA DataFrames

We can also explicitly load the overnight stays data into the SAP HANA DataFrame as shown below.

Then, we display the SQL query backing the DataFrame.

Finally, we display the first six observations of the DataFrame.

Code Snippet

123
hdf_overnightstays = conn.table('OVERNIGHTSTAYS')
display(hdf_overnightstays.select_statement)
display(hdf_overnightstays.head(6).collect())

'SELECT * FROM ''OVERNIGHTSTAYS'' '

	MONTH	REGION	COUNTRYOFRESIDENCE
0	2022-01-01	Graubünden	Baltic States
1	2022-01-01	Graubünden	Australia, New Zealand, Oceania
2	2022-01-01	Eastern Switzerland	Baltic States
3	2022-01-01	Eastern Switzerland	Australia, New Zealand, Oceania
4	2022-01-01	Zurich Region	Baltic States
5	2022-01-01	Zurich Region	Australia, New Zealand, Oceania

Counting the Rows in the Dataset

Next, DataFrame.count()[5] computes the number of rows in the SAP HANA DataFrame and displays it.

Code Snippet

1
display(hdf_overnightstays.count())

29029

Analysis of the DataFrame

Data manipulation and analysis are vital in the field of Data Science. The SAP class method DataFrame.describe() returns a DataFrame that contains various statistics of data attributes/columns [6].

Such descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset's distribution. This helps to organize and present data in a meaningful manner.

For example,in the following figure, the row labeled 3 displays the statistics for the COUNTRYOFRESIDENCE column. This column represents the visitor's country of residence. We notice that it has 29029 observations as the rest of the dataset's columns. It has no null values, and we may remember it is a character-based data type; however, a numerical attribute is the first column, that is the OVERNIGHTSTAYS column, for which several statistics are computed, namely minimum, maximum, mean, median, and so on. We may remember the median is the value splitting the higher half from the lower half of the dataset [7].

Code Snippet

1
hdf_overnightstays.describe().head(10).collect()

	column	count	unique	mean	std	min	max	median	25_percent_cont	25_percent_disc	50_percent_cont	50_percent_disc	75_percent_cont	75_percent_disc
0	OVERNIGHTSTAYS	29029	5596	3307.070585	19415.55791	0.0	583762.0	269.0	58.0	58.0	269.0	269.0	1066.0	1066.0
1	MONTH	29029	29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	REGION	29029	13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	COUNTRYOFRESIDENCE	29029	77	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

4 rows × 15 columns

We can explore the data by looking at a graphical report as shown below:

Code Snippet

1
UnifiedReport(hdf_overnightstays).build().display()

WARNING:hana_ml.visualizers.unified_report: key has not been set. The first column has been treated as the index column. The Overview screen of the Dataset Report

The Overview screen of the Dataset Report

The following image shows the Variable Type Distribution graph.

The following image shows the High Cardinality Variables chart.

The following image shows the Highly Skewed Variables chart.

Further Analysis with Data Aggregation

In order to perform further data analysis, we need to aggregate the overnight stays by month as shown below:

Code Snippet

123
hdf_overnightstays_agg = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH')
hdf_overnightstays_agg = hdf_overnightstays_agg.sort('MONTH')
hdf_overnightstays_agg.head(5).collect()

	MONTH	OVERNIGHTSTAYS_SUM
0	2022-01-01	2204984
1	2022-02-01	2892697
2	2022-03-01	3053960
3	2022-04-01	2523861
4	2022-05-01	2820085

We proceed to plot the monthly data as follows:

Code Snippet

12345
df_data = hdf_overnightstays_agg.collect()

df_data.plot(x='MONTH')
plt.xticks( rotation='vertical')
plt.show()

Monthly data for overnight stays shown in a line graph

References

[1]Overnight Stays Dataset

[2]Overnight Stays Scenario

[3]SAP function hana_ml.dataframe.create_dataframe_from_pandas()

[4]SAP hana_ml.dataframe-DataFrame.collect()

[5]SAP hana_ml.dataframe-DataFrame.count()

[6]SAP hana.ml.dataframe-DataFrame.describe()

[7]Statistics: Median

Continue to quiz