Before building forecasting models, it's essential to prepare and shape the data in a way that aligns with time series modeling requirements. This includes loading the dataset into SAP HANA, transforming fields into time-aware formats, and performing basic aggregation - all using the SAP HANA DataFrame API, which enables SQL-like operations in Python.
Loading the Time Series Dataset
The overnight stays dataset is freely available and can be uploaded from [1] as shown below. Using the following code, we display the first five observations of the newly created Pandas DataFrame.
More material related to the overnight stays scenario can be found in the very same folder [2].
123df_data= pd.read_csv('./OVERNIGHTSTAYS.csv', sep = ',')
df_data.MONTH = pd.to_datetime(df_data.MONTH, format='%d/%m/%Y')
df_data.head(5)| MONTH | REGION | COUNTRYOFRESIDENCE | OVERNIGHTSTAYS | |
|---|---|---|---|---|
| 0 | 2022-01-01 | Graubünden | Switzerland | 392805 |
| 1 | 2022-01-01 | Graubünden | Baltic States | 0 |
| 2 | 2022-01-01 | Graubünden | Germany | 80648 |
| 3 | 2022-01-01 | Graubünden | France | 4229 |
| 4 | 2022-01-01 | Graubünden | Italy | 10083 |
Creating an SAP HANA DataFrame
The SAP function hana_ml.dataframe.create_dataframe_from_pandas() [3] creates an SAP HANA DataFrame from a Pandas DataFrame and creates a table in SAP HANA, that is, table_name = 'OVERNIGHTSTAYS'.
Then, we display the first five observations of the SAP HANA DataFrame, and then DataFrame.collect() copies the current DataFrame to a new Python Pandas DataFrame [4].
1234567
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn,
pandas_df = df_data,
table_name = 'OVERNIGHTSTAYS',
force = True,
replace = False)
display(df_remote.head(5).collect())| MONTH | REGION | COUNTRYOFRESIDENCE | OVERNIGHTSTAYS | |
|---|---|---|---|---|
| 0 | 2022-01-01 | Graubünden | Switzerland | 392805 |
| 1 | 2022-01-01 | Graubünden | Baltic States | 0 |
| 2 | 2022-01-01 | Graubünden | Germany | 80648 |
| 3 | 2022-01-01 | Graubünden | France | 4229 |
| 4 | 2022-01-01 | Graubünden | Italy | 10083 |
We then show the structure of the table that has been created as follows:
Output:
[('MONTH', 'TIMESTAMP', 27, 27, 27, 0), ('REGION', 'NVARCHAR', 5000, 5000, 5000, 0), ('COUNTRYOFRESIDENCE', 'NVARCHAR', 5000, 5000, 5000, 0), ('OVERNIGHTSTAYS', 'INT', 10, 10, 10, 0)]Loading Data into SAP HANA DataFrames
We can also explicitly load the overnight stays data into the SAP HANA DataFrame as shown below.
Then, we display the SQL query backing the DataFrame.
Finally, we display the first six observations of the DataFrame.
123hdf_overnightstays = conn.table('OVERNIGHTSTAYS')
display(hdf_overnightstays.select_statement)
display(hdf_overnightstays.head(6).collect())| MONTH | REGION | COUNTRYOFRESIDENCE | OVERNIGHTSTAYS | |
|---|---|---|---|---|
| 0 | 2022-01-01 | Graubünden | Baltic States | 0 |
| 1 | 2022-01-01 | Graubünden | Australia, New Zealand, Oceania | 0 |
| 2 | 2022-01-01 | Eastern Switzerland | Baltic States | 0 |
| 3 | 2022-01-01 | Eastern Switzerland | Australia, New Zealand, Oceania | 0 |
| 4 | 2022-01-01 | Zurich Region | Baltic States | 0 |
| 5 | 2022-01-01 | Zurich Region | Australia, New Zealand, Oceania | 0 |
Counting the Rows in the Dataset
Next, DataFrame.count()[5] computes the number of rows in the SAP HANA DataFrame and displays it.
1display(hdf_overnightstays.count())29029
Analysis of the DataFrame
Data manipulation and analysis are vital in the field of Data Science. The SAP class method DataFrame.describe() returns a DataFrame that contains various statistics of data attributes/columns [6].
Such descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset's distribution. This helps to organize and present data in a meaningful manner.
For example,in the following figure, the row labeled 3 displays the statistics for the COUNTRYOFRESIDENCE column. This column represents the visitor's country of residence. We notice that it has 29029 observations as the rest of the dataset's columns. It has no null values, and we may remember it is a character-based data type; however, a numerical attribute is the first column, that is the OVERNIGHTSTAYS column, for which several statistics are computed, namely minimum, maximum, mean, median, and so on. We may remember the median is the value splitting the higher half from the lower half of the dataset [7].
1hdf_overnightstays.describe().head(10).collect()| column | count | unique | nulls | mean | std | min | max | median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OVERNIGHTSTAYS | 29029 | 5596 | 0 | 3307.070585 | 19415.55791 | 0.0 | 583762.0 | 269.0 | 58.0 | 58.0 | 269.0 | 269.0 | 1066.0 | 1066.0 |
| 1 | MONTH | 29029 | 29 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | REGION | 29029 | 13 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | COUNTRYOFRESIDENCE | 29029 | 77 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 rows × 15 columns
We can explore the data by looking at a graphical report as shown below:
1UnifiedReport(hdf_overnightstays).build().display()
The following image shows the Variable Type Distribution graph.

The following image shows the High Cardinality Variables chart.

The following image shows the Highly Skewed Variables chart.

Further Analysis with Data Aggregation
In order to perform further data analysis, we need to aggregate the overnight stays by month as shown below:
123hdf_overnightstays_agg = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH')
hdf_overnightstays_agg = hdf_overnightstays_agg.sort('MONTH')
hdf_overnightstays_agg.head(5).collect()| MONTH | OVERNIGHTSTAYS_SUM | |
|---|---|---|
| 0 | 2022-01-01 | 2204984 |
| 1 | 2022-02-01 | 2892697 |
| 2 | 2022-03-01 | 3053960 |
| 3 | 2022-04-01 | 2523861 |
| 4 | 2022-05-01 | 2820085 |
We proceed to plot the monthly data as follows:
12345df_data = hdf_overnightstays_agg.collect()
df_data.plot(x='MONTH')
plt.xticks( rotation='vertical')
plt.show()