Deriving Insights by leveraging SAP Databricks ML Capabilities

Objective

After completing this lesson, you will be able to derive insights by leveraging SAP Databricks ML capabilities

Working with Notebooks in SAP Databricks

After performing exploratory analysis on your data, you'll begin the machine learning phase.

This phase involves writing code in notebooks. The supported code languages are Python and SQL.

Introducing Notebooks

A notebook is an interactive, web-based document that allows you to write and run code on your data, display the results, and combine with the code and results with explanatory text and images, all in one place.

Python notebook code that generates a black scatter plot of sine function values over a range from 0 to 10.

The document is structured as a series of cells, where each cell contains either executable code or formatted text.

You can run these cells individually and in any order, allowing you to see the results of each step immediately without having to re-run an entire program.

Their step-by-step nature is perfect for a data science workflow: you can load data in one cell, clean it in the next, visualize it in another, and build a machine learning model further down. This makes it easy to experiment, debug, and see the immediate impact of your code.

Furthermore, because notebooks blend code with narrative text and visuals, they are an excellent tool for storytelling with data, creating reproducible research, and sharing your analysis with others in a format that is easy to read and understand.

Creating Notebooks in SAP Databricks

In SAP Databricks, you manage your notebooks in the Workspace section. You usually create a specific folder for notebooks to organize your workspace. You can also create subfolders for different projects.

SAP Databricks sidebar menu highlights the Workspace section with a custom Project Files folder under the Workspace folder. There is a Cachflow subfolder under the Project Files folder.

When you create a notebook, the editor opens where you'll enter your code. This editor is like the one used for SQL queries, though the language differs. By default, the language is Python, but you can change it to SQL.

SAP Databricks workspace features a new notebook with code cell for Python input and an assistant panel for code-related queries.

The data exploration and analysis phase could also be managed using a Python notebook instead of using SQL.

A Cash Flow Forecasting Example

Cash Flow Forecasting Scenario

To illustrate the machine learning process, let's examine a cash flow forecasting scenario.

You want to predict the future incoming and outgoing cash by evaluating the data through a time series algorithm.

Before writing any code, you need to plan the different steps of your machine learning project.

Refer to the following video to learn about the three steps of a machine learning project.

Now, let's look at those three steps for the cash flow forecasting example.

Data Preparation

After sharing the cash flow data product with SAP Databricks, you performed some exploratory analysis and noticed many empty values and also some erroneous values in posting dates.

You also decided to summarize data by month to reduce the data volume.

You then do the following in your data preparation phase:

Replace empty strings with Null values.
Select necessary columns and filter out on the Posting date invalid dates and Null values.
Floor Posting date column to month and rename date and value column.
Group data on date column and sum up Cash Flow per month.
Generate continuous time series range between minimum date and maximum date present in data.
Join generated time sequence to time series data in order to provide a continuous time series dataframe.
Fill null values with 0 as this means on those days no cash flow was recorded.
Convert Spark dataframe to pandas dataframe.

Model Training

Once the data is ready to be processed by the different algorithms, you will start the training phase:

Train the Time Series algorithm using the AutoTS library.
The AutoTS library provides a possibility to run multiple models in parallel to run and validate different data preparation steps as well as different time series algorithms in parallel.
Log the model to MLflow under the experiment called Time Series.
MLflow is used to start a run, log the model, and register it for future use.
As the package AutoTS does not provide an autologging of the model, you must create a prediction Python class, in order to be able to use the MLflow prediction capabilities.

Data Forecast

You will then use the production model for the prediction phase:

Retrieve logged model from MLflow.
Write prediction data to Delta Table.

You are now ready to code those steps in your Python notebook.

Watch the following video to learn about the data preparation phase.

Now, watch the next video to learn about the model training phase.

Now, watch the final video to learn about the data forecasting phase.

Summary

Use Python notebooks for exploratory analysis and machine learning, integrating code, text, and visuals seamlessly.
Create and manage notebooks in the Workspace section, using folders for efficient project organization.
In your notebooks, prepare data by cleaning and grouping, train models, log the models with MLflow, and store the prediction results in a delta table.

Go to Course