Exploring and Manipulating Data with SAP HANA Dataframes

Objective

After completing this lesson, you will be able to demonstrate how to format, aggregate, and prepare the dataset for time series forecasting using SAP HANA DataFrames.

Preparing Data for Time Series Forecasting

Before building forecasting models, it's essential to prepare and shape the data in a way that aligns with time series modeling requirements. This includes loading the dataset into SAP HANA, transforming fields into time-aware formats, and performing basic aggregation - all using the SAP HANA DataFrame API, which enables SQL-like operations in Python.

Loading the Time Series Dataset

The overnight stays dataset is freely available and can be uploaded from [1] as shown below. Using the following code, we display the first five observations of the newly created Pandas DataFrame.

More material related to the overnight stays scenario can be found in the very same folder [2].

Code Snippet
123
df_data= pd.read_csv('./OVERNIGHTSTAYS.csv', sep = ',') df_data.MONTH = pd.to_datetime(df_data.MONTH, format='%d/%m/%Y') df_data.head(5)
 MONTHREGIONCOUNTRYOFRESIDENCEOVERNIGHTSTAYS
02022-01-01GraubündenSwitzerland392805
12022-01-01GraubündenBaltic States0
22022-01-01GraubündenGermany80648
32022-01-01GraubündenFrance4229
42022-01-01GraubündenItaly10083

Creating an SAP HANA DataFrame

The SAP function hana_ml.dataframe.create_dataframe_from_pandas() [3] creates an SAP HANA DataFrame from a Pandas DataFrame and creates a table in SAP HANA, that is, table_name = 'OVERNIGHTSTAYS'.

Then, we display the first five observations of the SAP HANA DataFrame, and then DataFrame.collect() copies the current DataFrame to a new Python Pandas DataFrame [4].

Code Snippet
1234567
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, pandas_df = df_data, table_name = 'OVERNIGHTSTAYS', force = True, replace = False) display(df_remote.head(5).collect())
100%|██████████| 1/1 [00:01<00:00, 1.44s/it]
 MONTHREGIONCOUNTRYOFRESIDENCEOVERNIGHTSTAYS
02022-01-01GraubündenSwitzerland392805
12022-01-01GraubündenBaltic States0
22022-01-01GraubündenGermany80648
32022-01-01GraubündenFrance4229
42022-01-01GraubündenItaly10083

We then show the structure of the table that has been created as follows:

Output:

[('MONTH', 'TIMESTAMP', 27, 27, 27, 0), ('REGION', 'NVARCHAR', 5000, 5000, 5000, 0), ('COUNTRYOFRESIDENCE', 'NVARCHAR', 5000, 5000, 5000, 0), ('OVERNIGHTSTAYS', 'INT', 10, 10, 10, 0)]

Loading Data into SAP HANA DataFrames

We can also explicitly load the overnight stays data into the SAP HANA DataFrame as shown below.

Then, we display the SQL query backing the DataFrame.

Finally, we display the first six observations of the DataFrame.

Code Snippet
123
hdf_overnightstays = conn.table('OVERNIGHTSTAYS') display(hdf_overnightstays.select_statement) display(hdf_overnightstays.head(6).collect())
'SELECT * FROM ''OVERNIGHTSTAYS'' '
 MONTHREGIONCOUNTRYOFRESIDENCEOVERNIGHTSTAYS
02022-01-01GraubündenBaltic States0
12022-01-01GraubündenAustralia, New Zealand, Oceania0
22022-01-01Eastern SwitzerlandBaltic States0
32022-01-01Eastern SwitzerlandAustralia, New Zealand, Oceania0
42022-01-01Zurich RegionBaltic States0
52022-01-01Zurich RegionAustralia, New Zealand, Oceania0

Counting the Rows in the Dataset

Next, DataFrame.count()[5] computes the number of rows in the SAP HANA DataFrame and displays it.

Code Snippet
1
display(hdf_overnightstays.count())

29029

Analysis of the DataFrame

Data manipulation and analysis are vital in the field of Data Science. The SAP class method DataFrame.describe() returns a DataFrame that contains various statistics of data attributes/columns [6].

Such descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset's distribution. This helps to organize and present data in a meaningful manner.

For example,in the following figure, the row labeled 3 displays the statistics for the COUNTRYOFRESIDENCE column. This column represents the visitor's country of residence. We notice that it has 29029 observations as the rest of the dataset's columns. It has no null values, and we may remember it is a character-based data type; however, a numerical attribute is the first column, that is the OVERNIGHTSTAYS column, for which several statistics are computed, namely minimum, maximum, mean, median, and so on. We may remember the median is the value splitting the higher half from the lower half of the dataset [7].

Code Snippet
1
hdf_overnightstays.describe().head(10).collect()
 columncountuniquenullsmeanstdminmaxmedian25_percent_cont25_percent_disc50_percent_cont50_percent_disc75_percent_cont75_percent_disc
0OVERNIGHTSTAYS29029559603307.07058519415.557910.0583762.0269.058.058.0269.0269.01066.01066.0
1MONTH29029290NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2REGION29029130NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3COUNTRYOFRESIDENCE29029770NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

4 rows × 15 columns

We can explore the data by looking at a graphical report as shown below:

Code Snippet
1
UnifiedReport(hdf_overnightstays).build().display()
WARNING:hana_ml.visualizers.unified_report: key has not been set. The first column has been treated as the index column.The Overview screen of the Dataset Report

The following image shows the Variable Type Distribution graph.

The Variable Type Distribution screen

The following image shows the High Cardinality Variables chart.

The High Cardinality Variables chart

The following image shows the Highly Skewed Variables chart.

The Highly Skewed Variables chart

Further Analysis with Data Aggregation

In order to perform further data analysis, we need to aggregate the overnight stays by month as shown below:

Code Snippet
123
hdf_overnightstays_agg = hdf_overnightstays.agg([('sum', 'OVERNIGHTSTAYS', 'OVERNIGHTSTAYS_SUM')], group_by='MONTH') hdf_overnightstays_agg = hdf_overnightstays_agg.sort('MONTH') hdf_overnightstays_agg.head(5).collect()
 MONTHOVERNIGHTSTAYS_SUM
02022-01-012204984
12022-02-012892697
22022-03-013053960
32022-04-012523861
42022-05-012820085

We proceed to plot the monthly data as follows:

Code Snippet
12345
df_data = hdf_overnightstays_agg.collect() df_data.plot(x='MONTH') plt.xticks( rotation='vertical') plt.show()
Monthly data for overnight stays shown in a line graph