Understanding Machine Learning for Time Series Forecasting

Setting Up the Environment and Preparing Data Using SAP HANA DataFrames

Training an SAP HANA PAL Univariate Time Series Model

Forecasting with Multiple Time Series

Preparing Data for Multiple Time Series Forecasting

Objective

After completing this lesson, you will be able to explain how to organize and group the dataset to support multiple time series forecasting.

Structuring Data for Multiple Forecasts

In univariate forecasting, we modeled all overnight stays as one combined time series. However, real-world scenarios often require separate forecasts for each subgroup — such as regions, products, or in this case, countries of residence. In this lesson, you’ll learn how to prepare the dataset to support grouped time series forecasting by filtering and aggregating data by country.

Preparing the Dataset for the Multivariate Scenario

In the previous section, we focused on the univariate time-series modelling scenario where we aggregated all overnights stays into monthly values, and forecasted the following 12 months.

In this section, we focus on the multivariate time-series modelling scenario where we create individual forecasts for the countries from which most visitors are coming.

Therefore, let's discover the top 10 countries with the most overnight stays.

Code Snippet

12
hdf_overnightstays = conn.table('OVERNIGHTSTAYS')
hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10).collect()

	COUNTRYOFRESIDENCE	TOTAL
0	Switzerland	50087144
1	Germany	8885041
2	United States	6365130
3	United Kingdom	3726785
4	France	3279915
5	Italy	2027511
6	Netherlands	1676202
7	Belgium	1282186
8	India	1210124
9	Spain	1056512

We need to use the list of the aforementioned countries as a filter on the whole table.

A way to accomplish it is by creating a comma-separated list of values for a "where clause" as shown below:

Code Snippet

12345678
countries = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE') \
     .sort('TOTAL', desc=True).select('COUNTRYOFRESIDENCE').head(10).collect() \
     .iloc[:,0].tolist()

countries = str(countries)
countries = countries.replace('[', '(')
countries = countries.replace(']', ')')
countries

"('Switzerland', 'Germany', 'United States', 'United Kingdom', 'France', 'Italy', 'Netherlands', 'Belgium', 'India', 'Spain')"

At the moment, the SAP HANA DataFrame only holds information about the aforementioned top 10 countries.

Next, we aggregate the dataset again but this time by both month and country. Therefore, we are in the position to create country-specific forecasts.

Code Snippet

12
hdf_overnightstays = hdf_overnightstays.filter(f'''"COUNTRYOFRESIDENCE" IN {countries}''')
hdf_overnightstays.head(10).collect()

	MONTH	REGION	COUNTRYOFRESIDENCE	OVERNIGHTSTAYS
0	2022-01-01	Fribourg Region	India	0
1	2022-01-01	Graubünden	Italy	10083
2	2022-01-01	Graubünden	Spain	1586
3	2022-01-01	Eastern Switzerland	Italy	1167
4	2022-01-01	Eastern Switzerland	Spain	212
5	2022-01-01	Zurich Region	Spain	2949
6	2022-01-01	Lucerne / Lake Lucerne	Spain	523
7	2022-01-01	Basel Region	Italy	1356
8	2022-01-01	Basel Region	Spain	672
9	2022-01-01	Bern Region	Italy	2724

Now, we aggregate the overnight stays by month and country as shown below:

Code Snippet

123
hdf_overnightstays_agg = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by=['MONTH', 'COUNTRYOFRESIDENCE'])
hdf_overnightstays_agg = hdf_overnightstays_agg.sort('MONTH')
hdf_overnightstays_agg.head(5).collect()

	MONTH	COUNTRYOFRESIDENCE	OVERNIGHTSTAYS_SUM
0	2022-01-01	India	4701
1	2022-01-01	France	63084
2	2022-01-01	Italy	43065
3	2022-01-01	Germany	206477
4	2022-01-01	Belgium	26172

Afterwards, let's prepare the dataset for plotting the overnight stays over time by each country as follows:

Code Snippet

12
hdf_overnightstays_plot = hdf_overnightstays_agg.pivot_table(values='OVERNIGHTSTAYS_SUM', index='MONTH', columns='COUNTRYOFRESIDENCE', aggfunc='sum').collect()
hdf_overnightstays.collect()

	MONTH	REGION	COUNTRYOFRESIDENCE	OVERNIGHTSTAYS
0	2022-01-01	Fribourg Region	India	0
1	2022-01-01	Graubünden	Italy	10083
2	2022-01-01	Graubünden	Spain	1586
3	2022-01-01	Eastern Switzerland	Italy	1167
4	2022-01-01	Eastern Switzerland	Spain	212
...	...	...	...	...
3765	2024-05-01	Geneva	India	6887
3766	2024-05-01	Valais	India	4866
3767	2024-05-01	Ticino	India	2215
3768	2024-05-01	Fribourg Region	India	157
3769	2024-05-01	Aargau and Solothurn Region	India	3954

3770 rows × 4 columns

We generate the plot as follows:

Code Snippet

12345
hdf_overnightstays_plot.plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));

plt.xticks(rotation='vertical')
plt.grid()
plt.show()

A line graph showing the data related to overnight stays according to time for each of the 10 countries

The previous plot is densely populated. Thus, we decide to focus on the top two most popular foreign countries.

Next, we create a plot only for Germany and the United States.

Code Snippet

12345
hdf_overnightstays_plot[['MONTH', 'Germany', 'United States']].plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));

plt.xticks(rotation='vertical')
plt.grid()
plt.show()

A line graph showing the data related to overnight stays according to time for Germany and the United States

By looking at the plot above, we identify similarities in the overnight stays for both countries, where the summer season is more evident.

Individual forecasts for each country can capture such individual patterns.

Next lesson