This section covers how to manage and manipulate data in SAP HANA Cloud using SAP HANA DataFrames. The California housing dataset, pre-loaded into SAP HANA Cloud, serves as an example to illustrate this concept further [1].
Creating a HANA Dataframe in Python
An SAP HANA DataFrame provides a way to view the data stored in SAP HANA without containing any of the physical data. HANA-ML makes use of a SAP HANA DataFrame as the input for training and scoring purposes.
An SAP HANA DataFrame hides the underlying SQL statement, providing users with a Python interface to SAP HANA data.
To use an SAP HANA DataFrame, firstly create an object from the hana_ml.dataframe.ConnectionContext class (a connection to SAP HANA) [2], and then use the methods provided in the library to create an SAP HANA DataFrame.
The SAP HANA DataFrame is only usable while the connection is open, and is inaccessible once the connection is closed. For the full documentation about the classes and functions available, refer to [3].
References
[1] California housing dataset
Selecting data from the specified table
The method ConnectionContext.table() returns a DataFrame that represents the specified table, that is, "california_housing", and its associated schema, that is, "ML_DEMO" [1]. Then, we display the SQL query backing the DataFrame.
In order to inspect the dataset, the column names of the DataFrame and their data types are displayed.
12hdf = conn.table("california_housing", schema="ML_DEMO")
print(hdf.select_statement)
Output:
SELECT * FROM "ML_DEMO"."california_housing"
1print(hdf.columns)
Output:
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'Target']
1print(hdf.dtypes())
Output:
[('MedInc', 'DOUBLE', 15, 15, 15, 0), ('HouseAge', 'DOUBLE', 15, 15, 15, 0), ('AveRooms', 'DOUBLE', 15, 15, 15, 0), ('AveBedrms', 'DOUBLE', 15, 15, 15, 0), ('Population', 'DOUBLE', 15, 15, 15, 0), ('AveOccup', 'DOUBLE', 15, 15, 15, 0), ('Latitude', 'DOUBLE', 15, 15, 15, 0), ('Longitude', 'DOUBLE', 15, 15, 15, 0), ('Target', 'DOUBLE', 15, 15, 15, 0)]
References
[1] SAP hana_ml.dataframe - ConnectionContext.table()
As given, the first five observations of the SAP DataFrame are displayed, then the method DataFrame.collect() copies the current DataFrame to a new Python Pandas DataFrame [1]
12#Only the use of the collect method, transfers data or result sets from SAP HANA to Python
hdf.head(5).collect()
Output:
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | Target | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.24 | 52.0 | 2.92 | 0.91 | 396.0 | 4.65 | 37.80 | -122.27 | 5.00 |
1 | 1.16 | 52.0 | 2.43 | 0.94 | 1349.0 | 5.39 | 37.87 | -122.25 | 5.00 |
2 | 7.85 | 52.0 | 7.79 | 1.05 | 517.0 | 2.41 | 37.86 | -122.24 | 5.00 |
3 | 9.39 | 52.0 | 7.51 | 0.95 | 1366.0 | 2.75 | 37.85 | -122.24 | 5.00 |
4 | 7.87 | 52.0 | 8.28 | 1.04 | 947.0 | 2.62 | 37.83 | -122.23 | 5.00 |
References
[1] SAP method hana_ml.dataframe - ConnectionContext.table()
Analyzing data attributes
Data manipulation and analysis are vital in the field of Data Science. The SAP class method DataFrame.describe() returns a DataFrame that contains various statistics of data attributes or columns [1].
Such descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution. This helps to organize, and present data in a meaningful manner.
For instance, the row labeled as number 3 (shown below) displays the statistics for column 'AveBedrms'. This column represents the average number of bedrooms per household. You will notice that it has '20,640' observations as the rest of the dataset's columns. It has no null values, and its mean value is '1.09'; however, it can take values from a minimum of '0.33' to a maximum value of '34.06'. Interestingly, the median is '1.04' [4]. You should remember that the median value represents the middle value of a dataset.
1hdf.describe().collect()
Column | Count | Unique | Nulls | Mean | Std | Min | Max | Median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MedInc | 20640 | 12928 | 0 | 3.87 | 1.89 | 0.49 | 15.00 | 3.53 | 2.56 | 2.56 | 3.53 | 3.53 | 4.74 | 4.74 |
1 | HouseAge | 20640 | 52 | 0 | 28.63 | 12.58 | 1.00 | 52.00 | 29.00 | 18.00 | 18.00 | 29.00 | 29.00 | 37.00 | 37.00 |
2 | AveRooms | 20640 | 19392 | 0 | 5.42 | 2.47 | 0.84 | 141.90 | 5.22 | 4.44 | 4.44 | 5.22 | 5.22 | 6.05 | 6.05 |
3 | AveBedrms | 20640 | 14233 | 0 | 1.09 | 0.47 | 0.33 | 34.06 | 1.04 | 1.00 | 1.00 | 1.04 | 1.04 | 1.09 | 1.09 |
4 | Population | 20640 | 3888 | 0 | 1425.47 | 1132.46 | 3.00 | 35682.00 | 1166.00 | 787.00 | 787.00 | 1166.00 | 1166.00 | 1725.00 | 1725.00 |
5 | AveOccup | 20640 | 18841 | 0 | 3.07 | 10.38 | 0.69 | 1243.33 | 2.81 | 2.42 | 2.42 | 2.81 | 2.81 | 3.28 | 3.28 |
6 | Latitude | 20640 | 862 | 0 | 35.63 | 2.13 | 32.54 | 41.95 | 34.26 | 33.93 | 33.93 | 34.26 | 34.26 | 37.71 | 37.71 |
7 | Longitude | 20640 | 844 | 0 | -119.56 | 2.00 | -124.35 | -114.31 | -118.49 | -121.80 | -121.80 | -118.49 | -118.49 | -118.01 | -118.01 |
8 | Target | 20640 | 3842 | 0 | 2.068 | 1.15 | 0.14 | 5.00 | 1.79 | 1.19 | 1.19 | 1.79 | 1.79 | 2.64 | 2.64 |
References
SAP HANA DataFrame filtering
Rows that match the given condition(s) can be selected.
For example, the following code selects the subset of rows where the 'Population' is greater than '300'. The first five rows are then displayed to confirm that all 'Population' values exceed '300'.
1hdf_filter=hdf.filter('"Population">300')
1print(hdf_filter.select_statement)
Output:
SELECT * FROM (SELECT * FROM "ML_DEMO"."california_housing") AS "DT_258" WHERE "Population">300
1hdf_filter.head(5).collect()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | Target | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.24 | 52.0 | 2.92 | 0.91 | 396.0 | 4.65 | 37.80 | -122.27 | 5.00 |
1 | 1.16 | 52.0 | 2.43 | 0.94 | 1349.0 | 5.39 | 37.87 | -122.25 | 5.00 |
2 | 7.85 | 52.0 | 7.79 | 1.05 | 517.0 | 2.41 | 37.86 | -122.24 | 5.00 |
3 | 9.39 | 52.0 | 7.51 | 0.95 | 1366.0 | 2.75 | 37.85 | -122.24 | 5.00 |
4 | 7.87 | 52.0 | 8.28 | 1.04 | 947.0 | 2.62 | 37.83 | -122.23 | 5.00 |