Exploring Data with SAP HANA DataFrames

Objectives

After completing this lesson, you will be able to:
  • Utilize SAP HANA DataFrames for data handling
  • Visualize the attributes of the California housing dataset

Data Handling using SAP HANA DataFrames

This section covers how to manage and manipulate data in SAP HANA Cloud using SAP HANA DataFrames. The California housing dataset, pre-loaded into SAP HANA Cloud, serves as an example to illustrate this concept further [1].

Creating a HANA Dataframe in Python

An SAP HANA DataFrame provides a way to view the data stored in SAP HANA without containing any of the physical data. HANA-ML makes use of a SAP HANA DataFrame as the input for training and scoring purposes.

An SAP HANA DataFrame hides the underlying SQL statement, providing users with a Python interface to SAP HANA data.

To use an SAP HANA DataFrame, firstly create an object from the hana_ml.dataframe.ConnectionContext class (a connection to SAP HANA) [2], and then use the methods provided in the library to create an SAP HANA DataFrame.

The SAP HANA DataFrame is only usable while the connection is open, and is inaccessible once the connection is closed. For the full documentation about the classes and functions available, refer to [3].

References

[1] California housing dataset

[2] SAP class hana_ml.dataframe.ConnectionContext

[3] SAP hana_ml.dataframe

Selecting data from the specified table

The method ConnectionContext.table() returns a DataFrame that represents the specified table, that is, "california_housing", and its associated schema, that is, "ML_DEMO" [1]. Then, we display the SQL query backing the DataFrame.

In order to inspect the dataset, the column names of the DataFrame and their data types are displayed.

Python
12
hdf = conn.table("california_housing", schema="ML_DEMO") print(hdf.select_statement)

Output:

SELECT * FROM "ML_DEMO"."california_housing"

Python
1
print(hdf.columns)

Output:

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'Target']

Python
1
print(hdf.dtypes())

Output:

[('MedInc', 'DOUBLE', 15, 15, 15, 0), ('HouseAge', 'DOUBLE', 15, 15, 15, 0), ('AveRooms', 'DOUBLE', 15, 15, 15, 0), ('AveBedrms', 'DOUBLE', 15, 15, 15, 0), ('Population', 'DOUBLE', 15, 15, 15, 0), ('AveOccup', 'DOUBLE', 15, 15, 15, 0), ('Latitude', 'DOUBLE', 15, 15, 15, 0), ('Longitude', 'DOUBLE', 15, 15, 15, 0), ('Target', 'DOUBLE', 15, 15, 15, 0)]

References

[1] SAP hana_ml.dataframe - ConnectionContext.table()

As given, the first five observations of the SAP DataFrame are displayed, then the method DataFrame.collect() copies the current DataFrame to a new Python Pandas DataFrame [1]

Python
12
#Only the use of the collect method, transfers data or result sets from SAP HANA to Python hdf.head(5).collect()

Output:

 MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeTarget
01.2452.02.920.91396.04.6537.80-122.275.00
11.1652.02.430.941349.05.3937.87-122.255.00
27.8552.07.791.05517.02.4137.86-122.245.00
39.3952.07.510.951366.02.7537.85-122.245.00
47.8752.08.281.04947.02.6237.83-122.235.00

References

[1] SAP method hana_ml.dataframe - ConnectionContext.table()

Analyzing data attributes

Data manipulation and analysis are vital in the field of Data Science. The SAP class method DataFrame.describe() returns a DataFrame that contains various statistics of data attributes or columns [1].

Such descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution. This helps to organize, and present data in a meaningful manner.

For instance, the row labeled as number 3 (shown below) displays the statistics for column 'AveBedrms'. This column represents the average number of bedrooms per household. You will notice that it has '20,640' observations as the rest of the dataset's columns. It has no null values, and its mean value is '1.09'; however, it can take values from a minimum of '0.33' to a maximum value of '34.06'. Interestingly, the median is '1.04' [4]. You should remember that the median value represents the middle value of a dataset.

Python
1
hdf.describe().collect()
 ColumnCountUniqueNullsMeanStdMinMaxMedian25_percent_cont25_percent_disc50_percent_cont50_percent_disc75_percent_cont75_percent_disc
0MedInc206401292803.871.890.4915.003.532.562.563.533.534.744.74
1HouseAge2064052028.6312.581.0052.0029.0018.0018.0029.0029.0037.0037.00
2AveRooms206401939205.422.470.84141.905.224.444.445.225.226.056.05
3AveBedrms206401423301.090.470.3334.061.041.001.001.041.041.091.09
4Population20640388801425.471132.463.0035682.001166.00787.00787.001166.001166.001725.001725.00
5AveOccup206401884103.0710.380.691243.332.812.422.422.812.813.283.28
6Latitude20640862035.632.1332.5441.9534.2633.9333.9334.2634.2637.7137.71
7Longitude206408440-119.562.00-124.35-114.31-118.49-121.80-121.80-118.49-118.49-118.01-118.01
8Target20640384202.0681.150.145.001.791.191.191.791.792.642.64

References

[1] SAP method hana_ml.dataframe - DataFrame.describe()

[2] Statistics: Median

SAP HANA DataFrame filtering

Rows that match the given condition(s) can be selected.

For example, the following code selects the subset of rows where the 'Population' is greater than '300'. The first five rows are then displayed to confirm that all 'Population' values exceed '300'.

Python
1
hdf_filter=hdf.filter('"Population">300')
Python
1
print(hdf_filter.select_statement)

Output:

SELECT * FROM (SELECT * FROM "ML_DEMO"."california_housing") AS "DT_258" WHERE "Population">300

Python
1
hdf_filter.head(5).collect()
 MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeTarget
01.2452.02.920.91396.04.6537.80-122.275.00
11.1652.02.430.941349.05.3937.87-122.255.00
27.8552.07.791.05517.02.4137.86-122.245.00
39.3952.07.510.951366.02.7537.85-122.245.00
47.8752.08.281.04947.02.6237.83-122.235.00

Visual Data Exploration using SAP HANA DataFrame

Descriptive statistics also involves a graphical representation of data via plots. They can further assist in visualizing and interpreting the information.

By employing descriptive statistics, one can summarize and share the main properties of a dataset. Therefore, obtaining deeper insights into the data to support:

  • Further statistical analyses
  • Decision-making processes

You can use the HANA capabilities [1] to visualize a distribution plot for the SAP HANA DataFrame column named 'Population'.

Python
1234567891011
#Exploratory Data Visualizations from hana_ml.visualizers.eda import EDAVisualizer start = time.time() eda = EDAVisualizer(enable_plotly=True) fig, trace, bin_data = eda.distribution_plot(data=hdf, debrief=True, column="Population", bins=60, x_axis_fontsize=13, x_axis_rotation=1, width=600, title="Distribution of feature: Population", height=400) fig.show() end = time.time() print("Time taken to do this by getting the data from the server was: {}s".format(round(end-start, 3)))

Output:

Time taken to do this by getting the data from the server was: 0.08s

A distribution plot of the Population feature.

References

[1] SAP hana_ml.visualizers.eda - EDAVisualizer.distribution_plot

Adding an index to the dataset

Note that the dataset must have an 'ID' column for partitioning it into disjoint subsets. It is required by the partitioning algorithm [1] used later on to derive the 'Training' and 'Testing' subsets that support the execution of the Machine Learning Workflow.

Ensure that this column is included in the dataset, as demonstrated below.

 IDMedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeTarget
011.2452.02.920.91396.04.6537.80-122.275.00
121.1652.02.430.941349.05.3937.87-122.255.00
237.8552.07.791.05517.02.4137.86-122.245.00
349.3952.07.510.951366.02.7537.85-122.245.00
457.8752.08.281.04947.02.6237.83-122.235.00

References

[1] SAP algorithm hana_ml.algorithms.pal.partition - train_test_val_split.

Log in to track your progress & complete quizzes