In univariate forecasting, we modeled all overnight stays as one combined time series. However, real-world scenarios often require separate forecasts for each subgroup — such as regions, products, or in this case, countries of residence. In this lesson, you’ll learn how to prepare the dataset to support grouped time series forecasting by filtering and aggregating data by country.
Preparing the Dataset for the Multivariate Scenario
In the previous section, we focused on the univariate time-series modelling scenario where we aggregated all overnights stays into monthly values, and forecasted the following 12 months.
In this section, we focus on the multivariate time-series modelling scenario where we create individual forecasts for the countries from which most visitors are coming.
Therefore, let's discover the top 10 countries with the most overnight stays.
12hdf_overnightstays = conn.table('OVERNIGHTSTAYS')
hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE' ).sort('TOTAL', desc=True).head(10).collect()| COUNTRYOFRESIDENCE | TOTAL | |
|---|---|---|
| 0 | Switzerland | 50087144 |
| 1 | Germany | 8885041 |
| 2 | United States | 6365130 |
| 3 | United Kingdom | 3726785 |
| 4 | France | 3279915 |
| 5 | Italy | 2027511 |
| 6 | Netherlands | 1676202 |
| 7 | Belgium | 1282186 |
| 8 | India | 1210124 |
| 9 | Spain | 1056512 |
We need to use the list of the aforementioned countries as a filter on the whole table.
A way to accomplish it is by creating a comma-separated list of values for a "where clause" as shown below:
12345678countries = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'TOTAL')], group_by='COUNTRYOFRESIDENCE') \
.sort('TOTAL', desc=True).select('COUNTRYOFRESIDENCE').head(10).collect() \
.iloc[:,0].tolist()
countries = str(countries)
countries = countries.replace('[', '(')
countries = countries.replace(']', ')')
countries"('Switzerland', 'Germany', 'United States', 'United Kingdom', 'France', 'Italy', 'Netherlands', 'Belgium', 'India', 'Spain')"
At the moment, the SAP HANA DataFrame only holds information about the aforementioned top 10 countries.
Next, we aggregate the dataset again but this time by both month and country. Therefore, we are in the position to create country-specific forecasts.
12hdf_overnightstays = hdf_overnightstays.filter(f'''"COUNTRYOFRESIDENCE" IN {countries}''')
hdf_overnightstays.head(10).collect()| MONTH | REGION | COUNTRYOFRESIDENCE | OVERNIGHTSTAYS | |
|---|---|---|---|---|
| 0 | 2022-01-01 | Fribourg Region | India | 0 |
| 1 | 2022-01-01 | Graubünden | Italy | 10083 |
| 2 | 2022-01-01 | Graubünden | Spain | 1586 |
| 3 | 2022-01-01 | Eastern Switzerland | Italy | 1167 |
| 4 | 2022-01-01 | Eastern Switzerland | Spain | 212 |
| 5 | 2022-01-01 | Zurich Region | Spain | 2949 |
| 6 | 2022-01-01 | Lucerne / Lake Lucerne | Spain | 523 |
| 7 | 2022-01-01 | Basel Region | Italy | 1356 |
| 8 | 2022-01-01 | Basel Region | Spain | 672 |
| 9 | 2022-01-01 | Bern Region | Italy | 2724 |
Now, we aggregate the overnight stays by month and country as shown below:
123hdf_overnightstays_agg = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by=['MONTH', 'COUNTRYOFRESIDENCE'])
hdf_overnightstays_agg = hdf_overnightstays_agg.sort('MONTH')
hdf_overnightstays_agg.head(5).collect()| MONTH | COUNTRYOFRESIDENCE | OVERNIGHTSTAYS_SUM | |
|---|---|---|---|
| 0 | 2022-01-01 | India | 4701 |
| 1 | 2022-01-01 | France | 63084 |
| 2 | 2022-01-01 | Italy | 43065 |
| 3 | 2022-01-01 | Germany | 206477 |
| 4 | 2022-01-01 | Belgium | 26172 |
Afterwards, let's prepare the dataset for plotting the overnight stays over time by each country as follows:
12hdf_overnightstays_plot = hdf_overnightstays_agg.pivot_table(values='OVERNIGHTSTAYS_SUM', index='MONTH', columns='COUNTRYOFRESIDENCE', aggfunc='sum').collect()
hdf_overnightstays.collect()| MONTH | REGION | COUNTRYOFRESIDENCE | OVERNIGHTSTAYS | |
|---|---|---|---|---|
| 0 | 2022-01-01 | Fribourg Region | India | 0 |
| 1 | 2022-01-01 | Graubünden | Italy | 10083 |
| 2 | 2022-01-01 | Graubünden | Spain | 1586 |
| 3 | 2022-01-01 | Eastern Switzerland | Italy | 1167 |
| 4 | 2022-01-01 | Eastern Switzerland | Spain | 212 |
| ... | ... | ... | ... | ... |
| 3765 | 2024-05-01 | Geneva | India | 6887 |
| 3766 | 2024-05-01 | Valais | India | 4866 |
| 3767 | 2024-05-01 | Ticino | India | 2215 |
| 3768 | 2024-05-01 | Fribourg Region | India | 157 |
| 3769 | 2024-05-01 | Aargau and Solothurn Region | India | 3954 |
3770 rows × 4 columns
We generate the plot as follows:
12345hdf_overnightstays_plot.plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));
plt.xticks(rotation='vertical')
plt.grid()
plt.show()
The previous plot is densely populated. Thus, we decide to focus on the top two most popular foreign countries.
Next, we create a plot only for Germany and the United States.
12345hdf_overnightstays_plot[['MONTH', 'Germany', 'United States']].plot(x='MONTH').legend(loc='center left',bbox_to_anchor=(1.0, 0.5));
plt.xticks(rotation='vertical')
plt.grid()
plt.show()
By looking at the plot above, we identify similarities in the overnight stays for both countries, where the summer season is more evident.
Individual forecasts for each country can capture such individual patterns.