Preparing Data for Multiple Time Series Forecasting

Objective

After completing this lesson, you will be able to explain how to organize and group the dataset to support multiple time series forecasting.

Structuring Data for Multiple Forecasts

In univariate forecasting, we modeled all overnight stays as one combined time series. However, real-world scenarios often require separate forecasts for each subgroup — such as regions, products, or in this case, countries of residence. In this lesson, you’ll learn how to prepare the dataset to support grouped time series forecasting by filtering and aggregating data by country.

Preparing the Dataset for the Multivariate Scenario

In the previous section, we focused on the univariate time-series modelling scenario where we aggregated all overnights stays into monthly values, and forecasted the following 12 months.

In this section, we focus on the multivariate time-series modelling scenario where we create individual forecasts for the countries from which most visitors are coming.

Therefore, let's discover the top 10 countries with the most overnight stays.

Code Snippet
12
hdf_overnightstays = conn.table('OVERNIGHTSTAYS') hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10).collect()
 COUNTRYOFRESIDENCETOTAL
0Switzerland50087144
1Germany8885041
2United States6365130
3United Kingdom3726785
4France3279915
5Italy2027511
6Netherlands1676202
7Belgium1282186
8India1210124
9Spain1056512

We need to use the list of the aforementioned countries as a filter on the whole table.

A way to accomplish it is by creating a comma-separated list of values for a "where clause" as shown below:

Code Snippet
12345678
countries = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE') \ .sort('TOTAL', desc=True).select('COUNTRYOFRESIDENCE').head(10).collect() \ .iloc[:,0].tolist() countries = str(countries) countries = countries.replace('[', '(') countries = countries.replace(']', ')') countries

"('Switzerland', 'Germany', 'United States', 'United Kingdom', 'France', 'Italy', 'Netherlands', 'Belgium', 'India', 'Spain')"

At the moment, the SAP HANA DataFrame only holds information about the aforementioned top 10 countries.

Next, we aggregate the dataset again but this time by both month and country. Therefore, we are in the position to create country-specific forecasts.

Code Snippet
12
hdf_overnightstays = hdf_overnightstays.filter(f'''"COUNTRYOFRESIDENCE" IN {countries}''') hdf_overnightstays.head(10).collect()
 MONTHREGIONCOUNTRYOFRESIDENCEOVERNIGHTSTAYS
02022-01-01Fribourg RegionIndia0
12022-01-01GraubündenItaly10083
22022-01-01GraubündenSpain1586
32022-01-01Eastern SwitzerlandItaly1167
42022-01-01Eastern SwitzerlandSpain212
52022-01-01Zurich RegionSpain2949
62022-01-01Lucerne / Lake LucerneSpain523
72022-01-01Basel RegionItaly1356
82022-01-01Basel RegionSpain672
92022-01-01Bern RegionItaly2724

Now, we aggregate the overnight stays by month and country as shown below:

Code Snippet
123
hdf_overnightstays_agg = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by=['MONTH', 'COUNTRYOFRESIDENCE']) hdf_overnightstays_agg = hdf_overnightstays_agg.sort('MONTH') hdf_overnightstays_agg.head(5).collect()
 MONTHCOUNTRYOFRESIDENCEOVERNIGHTSTAYS_SUM
02022-01-01India4701
12022-01-01France63084
22022-01-01Italy43065
32022-01-01Germany206477
42022-01-01Belgium26172

Afterwards, let's prepare the dataset for plotting the overnight stays over time by each country as follows:

Code Snippet
12
hdf_overnightstays_plot = hdf_overnightstays_agg.pivot_table(values='OVERNIGHTSTAYS_SUM', index='MONTH', columns='COUNTRYOFRESIDENCE', aggfunc='sum').collect() hdf_overnightstays.collect()
 MONTHREGIONCOUNTRYOFRESIDENCEOVERNIGHTSTAYS
02022-01-01Fribourg RegionIndia0
12022-01-01GraubündenItaly10083
22022-01-01GraubündenSpain1586
32022-01-01Eastern SwitzerlandItaly1167
42022-01-01Eastern SwitzerlandSpain212
...............
37652024-05-01GenevaIndia6887
37662024-05-01ValaisIndia4866
37672024-05-01TicinoIndia2215
37682024-05-01Fribourg RegionIndia157
37692024-05-01Aargau and Solothurn RegionIndia3954

3770 rows × 4 columns

We generate the plot as follows:

Code Snippet
12345
hdf_overnightstays_plot.plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5)); plt.xticks(rotation='vertical') plt.grid() plt.show()
A line graph showing the data related to overnight stays according to time for each of the 10 countries

The previous plot is densely populated. Thus, we decide to focus on the top two most popular foreign countries.

Next, we create a plot only for Germany and the United States.

Code Snippet
12345
hdf_overnightstays_plot[['MONTH', 'Germany', 'United States']].plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5)); plt.xticks(rotation='vertical') plt.grid() plt.show()
A line graph showing the data related to overnight stays according to time for Germany and the United States

By looking at the plot above, we identify similarities in the overnight stays for both countries, where the summer season is more evident.

Individual forecasts for each country can capture such individual patterns.