SAP HANA Cloud is enriched with an Automated Machine Learning (AutoML) approach. AutoML can be helpful, for example, to give data scientists a head start into quickly developing an initial machine learning model.
A machine learning model is a program that finds patterns as well as recommends decisions based on patterns. This intelligence is made possible by first ‘training’ the model with a large dataset. During training, the machine learning algorithm is optimized to find certain patterns or outputs from the dataset, depending on the task. The output of this process - often a computer program with specific rules and data structures - is called a machine learning model.
AutoML with SAP HANA Cloud is a great starting point to see what is possible with a dataset, and if it is worth investing more time into a use case.
Unique benefits with PAL (Predictive Analytics Library) AutoML includes:
Improved PAL models and enhanced business impact
Composite pipeline models of multiple PAL algorithms
Automated algorithm comparison and selection
Parameter search with optimal selection resulting in ML predictions with higher accuracy and value
Productivity Uplift and Expert Experience
Expert Data Scientists have the ability to derive optimal models in less time and with better utilization of compute time
Comparable AutoML expertise that addresses trending and competitive capability gaps
Less time to maximum value
For this section, a data set containing customer transactions as a table has already been loaded into the SAP HANA Cloud Database (GX_TRANSACTIONS).
The challenge goal is to predict whether a transaction is fraudulent or not. Such use cases are often quite challenging due to imbalanced data and thus require different techniques before implementing a machine learning model.
Try it out!
Jupyter Notebook Extension in Business Application Studio
This exercise uses the Jupyter Notebook and Python extension in SAP Business Application Studio (BAS). This enables the use of python code cells to run queries against the SAP HANA Cloud database.
Select this link to open the Business Application Studio.
If prompted to log in, do so with your user ID () and provided credentials.
In the landing page for BAS, select Create Dev Space to create a virtual environment in which to start a project.
Note
If you have an existing Dev Space for a HANA Native application associated with your user from a previous lesson or workbook, it’s perfectly fine to re-use that here. Otherwise, create a new Dev Space as described in the following steps.
For the Dev Space name, use _AutoML and select SAP HANA Native Application as the application type. On the right hand side, under the list of Additional SAP Extensions, choose Python Tools and then select Create Dev Space to start the process. This will create a dedicated development environment with all the tools required to run the notebook.
It will take a few minutes for the Dev Space to start up. The status can be seen beside the name.
Once the status changes to Running, select the name (_AutoML) to open it.
Note
If the following pop-up appears, choose
If the Welcome screen for SAP HANA appears, select the highlighted icon to close it.
On the Get Started landing page of BAS, select the option Clone from Git to import a repository from Github.
A prompt will appear on top asking for a URL to the repository. Copy and paste in the following URL and then press Enter.
The following message will appear. Select Open to see the imported files in the Explorer window.
The files should now be visible in the Explorer pane on the left hand side. Select the Jupyter Notebook FraudDetection_AutoML.ipynb to open it.
The notebook, which contains further information along with executable code cells, will be displayed on the screen.
In the next section the python environment will be configured so that the code cells can be executed from within the Jupyter notebook.
Configure Jupyter and Python Extensions
Read through the notebook until you get to the first code cell in the Install hana-ml libraries section.
Run the first code cell to install the required python modules.
Note
To execute a code cell, select the play icon beside the cell. It is also possible to execute it by clicking into the code cell and pressing
When the first cell is executed, a prompt asking to choose a kernel source will appear. Select Python Environments.
On the next screen, select the recommended Python kernel (3.11.2).
The required libraries will now start installing, and some messages about port numbers will also be visible in the lower right corner. This could take a few minutes to complete also.
Once the libraries have been successfully installed, the kernel must be restarted. Do this by pressing the Restart option in the menu bar at the top.
On the following pop-up message box, choose Restart again to complete the process.
It will take a few seconds for the kernel to restart and once done, the environment is ready for use!
The hana_ml library enables you to directly connect to your SAP HANA Cloud tenant. Use the connection details from the table below to populate the cell values and then execute the cell:
Parameter
Value
hana_address:
hana_port:
443
hana_user:
hana_password:
Provided during registration
hana_encrypt:
True
Create a data frame through SQL or table function and get the row count.
Control data and convert the following variables accordingly.
Control the conversion and take a look at a short description of the data.
Note
The target variable is called
Note
Data types have been altered accordingly
Split the data into a training and testing set.
Control the size of the training and testing datasets.
Import the following dependencies for the Automatic Classification.
Manage the workload in SAP HANA Cloud tenant by creating workload classes. Please execute the following SQL script to set the workload class, which will be used in the Automatic Classification.
Note
Ignore the error if the work class
Run process
The AutoML approach automatically executes data processing, model fitting, comparison and optimization.
First, create an AutoML classifier object auto_c in the following cell. It is helpful to review and set respective AutoML configuration parameters
The defined scenario will run two iterations of pipeline optimization. The total number of pipelines which will be evaluated is equal to population_size + generations × offspring_size. Hence, in this case this amounts to 15 pipelines.
With elite_number, you specify how many of the best pipelines you want to compare.
Setting random_seed =1234 helps to get reproducable AutoML runs.
Set the maximum runtime for individual pipeline evaluations with the parameter max_eval_time_mins or determine if the AutoML shall stop if there are no improvement for the set number of generations with the early_stop parameter. Further, you can set specific performance measures for the optimization with the scoring parameter.
Important!
Change
Reinitialize and display the AutoML operators and their parameters.
Note
A default set of AutoML classification operators and parameters is provided as the global config-dict, which can be adjusted to the needs of the targeted AutoML scenario. Use methods like
Adjust some of the settings to narrow the searching space. As the resampling method choose the SMOTETomek method, since the data is imbalanced.
Exclude the Transformer methods. As machine learning algorithms keep the Hybrid Gradient Boosting Tree and Multi Logistic Regression.
Set some parameters for the optimization of the algorithms.
Review the complete AutoML configuration for the classification.
Fit the Auto ML scenario on the training data. It may take a couple of minutes. If it takes too long, exclude the SMOTETomek in the resampler() method of the config file.
Inspect the pipeline progress through the execution logs.
Evaluate the best model on the testing data.
Create predictions with your machine learning model.
Save the best model in SAP HANA. Therefore, create a Model Storage. Change ‘YourSchema’ in the code below to .
Save the model through the following command.
Congratulations!! This concludes the lesson on Automated Machine Learning in SAP HANA Cloud.
For further information on AutoML with HANA Cloud, check out the following links: