Getting started with SAP Datasphere

Data Acquisition

Data Modeling

Analytic Modeling

SAP Analytics Cloud Reporting

Explore the Catalog

Additional Content

Further Learnings

Product Information and Learnings

Create a Data Flow

Objective

After completing this lesson, you will be able to create a data flow to read product semantic data from Google Cloud Storage.

Accelerates and enhance the business data fabric value
View Google Cloud Storage Source Connection
Create a Data Flow
Add Python Script
Add Projection
Add Aggregation
Add Output Table
Save and Deploy
Run the Pipeline

In this lesson, we will show how to load data hosted in Google file storage via a Data Flow. The data contains information used to assign sentiment scores to your current products. This provides some insight into how product reviews from social media or other similar sources could be captured and incorporated into SAP Datasphere scenarios.

We use a Data Flow since we first want to perform multiple transformations to enrich the source data before we store it in SAP Datasphere.

Later in the modeling lesson we will join this source with the Product Dimension objects.

At the conclusion of this lesson your Data Flow will look like the following:

Review the Help Portal for more information at Creating a Data Flow.

Accelerates and enhance the business data fabric value

With the SAP and Google Cloud Data & Analytics partnership, we can unleash the Data Promise by simplifying the access to mission critical business data. Using SAP Datasphere running in SAP BTP you can connect to Google Cloud Storage. And in this use case being able to read the sentiments data.

Review the Help Portal for more information on Google connections:

Google BigQuery Connections

Google Cloud Storage Connections

View Google Cloud Storage Source Connection

Note

Our environment already has an established connection to

Select Connections (step ‘A’) in the side navigation area and your space if necessary.
Select the connection GCP_STORAGE (step ‘B’) and choose Edit (step ‘C’).
The main information is entered in the creation connection wizard for Google Cloud Storage:
- Business Name
- Technical Name
- Project
- Root Path (optional)
- Key - JSON format -
The Google Cloud Storage connection supports the Data Flows as well as Replication Flows data ingestion procedures. We continue with data flows in this lesson.
Since this connection type exists already, click the Cancel button and confirm to Leave the connection property window.

Create a Data Flow

The Data Flow Builder offers a comprehensive approach to data warehousing and data integration for instant data-to-value.

Create a data flow to move and transform data in an intuitive graphical interface. You can drag and drop sources from the Source Browser, join them as appropriate, add other operators to remove or create columns, aggregate data, and do Python scripting, before writing the data to the target table.

Our Data Flow pipeline will accomplish the following:

Ingest a Product Review CSV file from GCS (product_review.csv).
Run a Python script to extract Sentiments from the Reviews.
Store the aggregated Sentiment score for each product in a table in SAP Datasphere.

The biggest advantage of Data Flow is the ability to run Python scripts to perform complex data wrangling tasks.

The next step is to build the Data Flow.

Select Data Builder in the side navigation area. Choose your space if necessary.
Select the New Data Flow tile, opening the flow editor.
Choose the Sources tab, expand Connections -> GCP_STORAGE -> DSP and locate the object product_review.csv.
Drag product_review.csv onto the canvas.
Choose the Preview Data operator, to see the data of the CSV file. Every row corresponds to a review for a product by a customer.

[Optional] Select the Show or hide source tree icon to hide the source tree panel on the left to enlarge the modeling canvas area.

Add Python Script

Now, we will add a Python script to extract the sentiments. We will also add an additional column - SENTIMENTS - to the output schema to store the computed sentiment.

Select the Script operator and drag it onto the diagram canvas. Release it where you want to create the script operator.
Select the product_review node on the canvas again and choose the Create Flow operator (arrow symbol) from the context tools.
Drag the Create Flow arrow onto the script operator node Script 1. A connection line is created between the source and the script operator nodes.

Select the script operator node on the canvas to display its properties in the side panel.
Change the label name to ExtractSentiments and the python version to 3.11.

The Sentiment Extraction algorithm will parse the review, assign scores to positive, negative, and neutral words in the review, and store the cumulative score in a new output column named SENTIMENTS.
Still on ExtractSentiments node, select Add ‘+’ in the Columns section of the properties panel.

Choose Create New Column.
Name the new column SENTIMENTS with data type int32.
Press the Save button.

In the next steps, we replace the default script code by a Python script, which is provided as download file. The Python scripts extracts the sentiment values from the source.

Download the extract_sentiments.py file and save it locally. Windows Notepad, or any text editor will work, to open the file.
Go to the Script section in the properties panel and select the Edit pencil icon to open the script editor.
In the script editor, highlight all the existing code (CTRL-A) and then paste in the code from the downloaded file ‘extract_sentiments.py’. This should replace the default code with the ‘extract_sentiments.py’ code.
Select the top-right Exit Full Screen icon to close of the Python window.

Add Projection

Next, we will add a Projection operator to project only two columns - PRODUCT_ID and SENTIMENTS.

Select the Projection operator, drag it onto the canvas and drop to the right of the script object.
Select the Extract Sentiments node again and choose the Create Flow operator (arrow symbol). Drag the arrow onto the projection operator node Projection 1. A connection line is created between both nodes.
Select the new Projection 1 operator node. If the properties panel on the right is not displayed, click the Details icon in the upper right.

Change the Label field to ProjectSentiments.
Go to the REVIEW_ID column, select the More (…) function and choose the Remove Column option.
Repeat this step for all columns except of PRODUCT_ID and SENTIMENTS.

See the ProjectSentiments operator after all unwanted columns have been removed:

Add Aggregation

Now, we will add an Aggregation operator that will aggregate the sentiments by Product IDs.

Select the Aggregation operator and drag it onto the canvas. Drop it to the right of ProjectSentiments.
Select the ProjectSentiments again and choose the Create Flow operator (arrow symbol). Drag the arrow on the new aggregation operator node Aggregation 1. A connection line is created between both nodes.
Select the aggregation operator node Aggregation 1 on the canvas and change the label field to AggregateSentiments in the properties panel.

Go to the SENTIMENTS column, select the More (…) function and choose the Change Aggregataion option.
Select SUM as the new aggregation behaviour for the SENTIMENTS column.

Add Output Table

The final stage of this Data Flow pipeline is to persist the (PRODUCT_ID, SENTIMENTS) in a relational table in SAP Datasphere.

[Optional] If the free modeling space in the canvas area is too small to add further operators, click the Details button on top of the Properties Panel on the right side to hide it.
Select the Add Table operator and drop it to the right of the AggregateSentiments node on the canvas.
After adding the table, select the AggregateSentiments node and choose the Create Flow operator (arrow symbol). Drag the arrow on the new table operator node ‘Table 1’. A connection line is created between both nodes.

Select the new table operator node Table 1.
[Optional] Click the Details button on top of the Properties Panel, in case it is not displayed.
Change the default business name Table 1 to LoadSentiments in the properties panel.
Select the Create and Deploy Table button.

Confirm the Create and Deploy task in the information pop-up window.

The LoadSentiments table supports both APPEND and TRUNCATE semantics. Every time, when the Data Flow pipeline runs, sentiment data will either be appended to the table or existing data will be truncated and new rows added.
After successful deployment of the table, select either option for this lesson.

Save and Deploy

Save the Data Flow pipeline.

Select the Save icon in the header menu.
In the dialog popup, enter the new data flow name DF_Load_Sentiments and Save it.

Press the Deploy icon to make your data flow ready to run.

Run the Pipeline

Running this pipeline will establish a connection with GCS, read the product review data, extract sentiments, project the necessary columns, aggregate the sentiments, and finally persist the sentiments in SAP Datasphere.

Choose the Run icon to launch the data flow execution.
Check the Run Status of the pipeline.
Select the Data Flow Monitor icon.

Note: It will take a minute or so for the pipeline to finish.

In the Data Integration Monitor, select the Refresh icon to watch the progress.

Note
: If the Refresh icon isn’t displayed, press the ‘Additional Options (…)’ icon and choose the ‘Refresh’ function. Alternatively, expand your browser window to see the options at once.

Data Flow pipelines can be scheduled to run at periodic intervals as shown here.

Choose the Schedule -> Create Schedule function.

This is shown only to highlight the capabilities. We are not scheduling any pipelines in this lesson.

Select Cancel, and then confirm to leave in the warning pop-up message.

Now, that we have run the data flow pipeline successfully, let us verify the results in the target table. Therefore we leave the Data Integration Monitor and go to the Data Builder section.

Select Data Builder in the side navigation area, leaving the Data Integration Monitor.

In the Data Builder object list, LoadSentiments is the table which the pipeline created with aggregated sentiments. Let us open it and view the data.

Select the checkbox next to LoadSentiments and press the Edit pencil icon.

Data Preview will display the data. Feel free to explore the other tabs to understand the structure and metadata of the table.

Open the Data Viewer to list the data records at the bottom.

The LoadSentiments table has two columns - PRODUCT_ID and SENTIMENTS.
The Sentiments are aggregated across 42 unique Product IDs.
Note: You will see more records if you run the pipeline multiple times and has choosen the ‘APPEND’ option as the insert mode in step 45 (APPEND or TRUNCATE).

Close the table editor and return to the Data Builder object list.
The Data Builder perspective lists the created objects.

Congratulations! You have successfully created a Data Flow pipeline to ingest, transform, and persist data. The persisted table can be used in downstream modeling tasks by combining with other data sets if needed.

Accelerates and enhance the business data fabric value
View Google Cloud Storage Source Connection
Create a Data Flow
Add Python Script
Add Projection
Add Aggregation
Add Output Table
Save and Deploy
Run the Pipeline