Working with Flows

Objective

After completing this lesson, you will be able to use flows in SAP Datasphere

Flows

We have already explored the use of remote tables and their replication options. Let's now explore the possibilities of data flows, replication flows and transformation flows.

Types of Flows

The figure shows an overview of different ways for data acquisition using SAP Datasphere connections: remote table (including federation, replication (snapshot and real-time), data flow, replication flow and transformation flow.

Suppose, the federation approach shows sub-optimal runtime performance or puts much stress on your source system. In such cases, consider persisting the values. We have already explained the replication options of remote tables.

SAP Datasphere Data also enables the definition of more advanced ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) operations.

The figure shows a screenshot of 3 types of flows: data flow, replication flow and transformation flow.

The Flows tab in the Data Builder in SAP Datasphere offers the creation of the following three objects:

Data flow: Traditional ETL Approach: transform data first, then store it, following the established ETL paradigm.
Replication flow: For ELT scenarios where data is extracted and loaded first, then transformed.
Transformation flow: For post-load transformations on already loaded data.

Data flows follow the traditional ETL (Extract, Transform, Load) approach, this means that data is first transformed and then stored. If you apply complex transformations or replicate mass data, resources are blocked in the source system for a long time.

In such cases, consider a different approach called ELT (Extract, Load, Transform), which means that you extract and load (store) data first and then apply the transformation only inside SAP Datasphere. This sequence allows businesses to initially load raw data into the data warehouse and then flexibly modify (transform) the same data set in different ways as needed for different use cases. In SAP Datasphere you can use replication flows and transformations flows for the ELT approach.

Data Flows

A data flow offers the following functionality:

Extract: Bring data into SAP Datasphere from various SAP sources (for example SAP S/4HANA CDS views) and non-SAP sources. A local table in SAP Datasphere can also be used as a source.
Transform: Perform business logic calculations and data transformations using standard operators, a limited set of built-in SQL functions or custom Python scripts.
Load: Store the data in SAP Datasphere local tables without any delta capability.

The figure shows a screenshot of a data flow, highlighting the option to use Python script for transformation scenarios.

You can create a data flow to move and transform data in an intuitive graphical interface. You can drag and drop sources from the Source Browser, join them as appropriate, add other operators to remove or create columns, aggregate data, and do Python scripting, before writing the data to the target table.

Replication Flows

The replication flow Is based on a cloud-based replication tool in which cloud-to-cloud based replication scenarios for on-premise components are not needed anymore (for example, data provisioning agent). It can be used for mass data replication from supported source systems into supported target systems. It supports three load types: initial only, initial and delta, and delta only.

Note

You can check this to learn which load type is supported by your source or target connection: Load Types and Connections for Your Replication Flows

A replication flow offers the following functionality:

Apply column projections and row-level filters.
Allow the reuse of the same source objects in multiple flows.
Accelerate initial loads through partition-based parallelization up to a maximum of 16 parallel jobs (worker graphs).
Ensure data resiliency with automated error recovery and retry logic and update runtime settings (source and target thread limit, delta load frequency) without redeployment.
Configure email notification for failures at object level.
Support HANA SQL and Script views as sources, enabling initial load extraction using semantic keys when needed across SAP HANA Cloud, SAP HANA On-premise, and SAP Datasphere.

Replication flows can be used for the following three scenarios:

Inbound data movement: Replicate data from SAP or non-SAP sources into SAP Datasphere and capture data changes occurring near real-time from the source to update the target table.
Outbound data movement: Read data from SAP Datasphere and replicate it into supported SAP and non-SAP outbound targets while tracking data changes.
Pass-through option: Use SAP Datasphere as a middleware and replicate data from supported sources into supported targets while tracking data changes occurring near real-time in that source.

The figure shows the possible source systems and outbound targets when using a replication flow.

For a complete list of possible connections which can be used as source or target for a replication flow, you can check this: Connection Types

On the inbound integration, you can replicate data from the following source objects into SAP Datasphere:

CDS views in ABAP-based SAP systems, which are enabled for extraction.
Tables that have a primary key.
Objects from ODP providers, such as extractors or SAP BW artifacts (from any SAP system that is based on SAP NetWeaver and has a suitable version of the DMIS add-on, see SAP Note 3412110).

You can use CDS views and ODP providers that do not have a primary key if the load type is Initial Only. (delta loading is not supported).

For sources without primary key, a technical column is automatically added as the last column (for certain targets). It contains an unchangeable unique identifier for each record.

You can find more information about the use of SAP S/4HANA and other ABAP sources for replication flows here: SAP S/4HANA and Other ABAP Sources for Replication Flows

Transformation Flows

A transformation flow run is a one-time event that completes when the relevant data is loaded to the target table successfully. You can run a transformation flow multiple times, for example if you are loading delta changes to the target table. Creating a Transformation Flow involves two important steps:

Creating a graphical view transform or an SQL view transform to load data from source tables and to transform data.
Adding a target table to the transformation flow and mapping the columns from the graphical view transform or SQL View transform created in the first step to the target table.

You can add local tables, views, Open SQL schema objects and also remote tables located in BW Bridge spaces.

A transformation flow offers the following functionality:

Various SQL-based transformations (for example joins, aggregations, and calculated columns).
Schedule the transformation flow through task chains and integrated in the Data Integration Monitor including support for input parameters.
Run mode setting for performance and memory optimized consumption.
Simulate runs and generate PlanViz files.
Generate HANA ExplainPlan for transformation flows using SAP HANA runtime directly from the Data Integration Monitor to simplify performance analysis.

You can also define transformation flows in SAP HANA Data Lake Files spaces. The transformation flow runtime is set to Apache Spark. This brings extra options:

Use of Python script operators for transformations.
Optional use of batch processing when using SAP HANA Runtime and Python operators.

Comparing Types of Flows

The following table gives a comparison of different aspects of data flow, replication flow and transformation flow:

Comparison of Flows

	Data Flow	Replication Flow	Transformation Flow
Supported Sources	SAP and non-SAP systems	SAP and non-SAP systems	SAP Datasphere only: It processes data from one local table to another local table
Supported Targets	Local table in SAP Datasphere only	SAP and non-SAP systems	SAP Datasphere only: It processes data from one local table to another local table
Data Load Types	Initial, Batch Load	Initial Only, Initial and Delta, Delta Only	Initial Only and Initial and Delta
Supported Storage within SAP Datasphere	SAP HANA Database (Disk and In-Memory)	SAP HANA Database (Disk and In-Memory) or SAP HANA Data Lake Files (Object Store)	SAP HANA Database (Disk and In-Memory) or SAP HANA Data Lake Files (Object Store)
Parallelized Data Load and Scalability	Partial	Supports parallelization during initial load through partitioning to achieve a parallelized data load	You can define batches (in the Flows monitor detail screen) to divide large datasets
Operators and Transform Capabilities	Join, Union, Projection, Calculated Column, Aggregation, Python script	Projections and filters, for example, adding, adjusting, and removal of columns, as well as the ability to provide row-level filters on one or multiple objects	Join, Aggregation, Filter, Projection, Calculated Columns, Python script (Transformation Flows in file Space only)
Data Recovery	No. When data flow runs fail, they restart from the beginning. However, users can select the Automatic Restart on Failure in the Advanced Properties.	Yes. If a replication flow run fails, it restarts where it failed. You can also pause/resume a replication flow, at the replication flow level or at an individual object level.	No

Next lesson