Implementing Advanced Features in a Flowgraph

Objectives

After completing this lesson, you will be able to:

  • Implement Advanced Transformation Nodes
  • Use Variables
  • Build a flowgraph from existing flowgraphs

Advanced Transformation Nodes

The flowgraph offers advanced nodes for the following use cases:

You have learned how to split a data set into different records by using a Case node. But, how do you flatten a dataset across multiple columns, based on its content? For example, if a column contains the country value, let's put each country into its own column. As you will see, we can do that with a Pivot node.

You've learned how to fetch an attribute such as a phone number and add it to a sales record with the corresponding person. But if this person has multiple phone numbers a join would create a copy of the transaction data record with this person for each phone number. Suppose, you need only the first phone number, a Lookup node is what you need. With a combination of a Lookup node and a Pivot node you could capture all the phone numbers.

When a record changes in a source system, you usually want to update the data target with the change. You can choose to overwrite the existing record with the newer version, but you could also decide to keep the existing record, mark it as 'old', and load the new record alongside. This way you keep the old records as well as provide the new record to the business. Use a History Preservation node to accomplish this. This special node allows us to automatically add time-validity information to the old and new records.

The following table lists all flowgraph nodes in the Advanced section

Advanced Node Types

Node TypeUSe Case

History Preservation

Allows for maintaining older versions of rows when a change occurs by generating new rows in a target.

Lookup

Retrieves a column value or values from a Lookup table that match a lookup condition you define.

Map Operation

Sorts input data and maps output data.

Pivot

Creates a row of data from existing rows.

Unpivot

Creates a new row for each value in a column identified as a pivot column.

Table Comparison

Compares two tables and produces the difference between them as a dataset with rows flagged as INSERT, UPDATE, or DELETE.

Let's take a closer look at two important examples: The Lookup node and the Pivot node.

Suppose, you need an address or phone number of a person and you have an address book with all addresses and numbers. You want the most frequent address, or the one that fits best with the sales date. Then you should use the lookup node.

What is a lookup? A lookup is almost a join. You define matching criteria and add the corresponding values from another column, but it has some differences:

  • The lookup table should not change dynamically during data load of the main records. Therefore, the lookup transformation can be processed in real-time.
  • You can specify lookup table column and sort value pairs to invoke a sort, which selects a single lookup table row when multiple rows are returned.
  • Configure default values in the form of constants to be output when no Lookup table rows are returned.

The lookup only returns one result per People_Id.

But if you want two results you use a combination of a join and a pivot node, provided that the contact counter always starts with 1. On the left side of the following table, we have the result of a join. The right side shows the result of the pivot node with the desired outcome.

How does it work? Let's take a look at the pivot node principle in another example with measures.

This example is created by the following user interface:

Let's get familiar with this procedure step by step.

How to Configure a Pivot Node

You have data with a combined key of ID, month, and country. You want a monthly overview of costs and revenue per country. A pivot table can help summarize the data by placing it in an output data set.

You want to create new columns for different months and combine data from several months into one row for each country. A pivot node with pivot axis month will help you achieve this.

For each unique value in a pivot axis column, it produces a column in the output data set.

Steps

  1. In your flowgraph, add a Pivot Node and connect its predecessor to the new node.

  2. In the Axis Attributes section, select Click to select axis. Select the column that you want to pivot on, and then choose OK. A set of pivoted columns is generated for each axis value. Create an axis value for each unique value in the axis column. At runtime, a new column is created for each pivoted attribute and each unique axis value in the Axis Attribute section.

  3. Select Add Values to create one or more columns to hold the pivoted data. Type the column name in the Value column, and enter a prefix in the Prefix column. The new columns with their prefixes are displayed in the list of output columns after you specify the data in the columns. For example, if you distribute by month, you can generate prefixes Jan for the axis value (month) 1 and Feb for the axis value (month) 2.

  4. Under Data Columns, choose the + icon for each data column whose values you want pivoted from rows into columns. For example, you can select Costs and Revenue columns that you wanted included. Notice that the Output Columns section generates Jan_Costs, Jan_Revenue,Feb_Costs , and Feb_Revenue as new columns. (The underscore is added automatically to separate the prefix name from the pivoted column name.)

  5. Under Output Columns, select Pass Through to select the columns that you want to output without pivoting, typically attributes, for example, Country. The pass through columns appear in the target table without modification.

  6. Set the Duplicate Strategy at the bottom to choose the behavior when a duplicate is encountered.

    • Select Abort when you want to cancel the transform process.
    • Select First Row when you want to store the value in the first row.
  7. Select Apply to return to the flowgraph.

Result

You have defined how the data will be transformed.

SAP HANA Smart Data Quality / Data Quality Management Nodes

SAP HANA provides flowgraph nodes that relate to data quality.

Data quality covers several topics. These include record de-duplication, address cleansing, generating missing data, and enriching records with geographic information.

SAP HANA on-premise and SAP HANA Cloud support data quality but they offer different flowgraph nodes.

SAP HANA on-premise provides several nodes that support data quality. These are:

  • Match - find and deal with records that might be duplicates
  • Cleanse - tidy-up fields and generate additional information
  • Geocode - generate geographic data for a record

For SAP HANA Cloud, the key data quality features are available via a subscription-based micorservice. To configure how this microservice is used in your flowgraph, include a node called Data Quality Management microservice Cleanse (DQMm Cleanse).

The Cleanse node identifies, parses, validates, and formats the following data:

  • Address
  • Person name (*)
  • Organization name (*)
  • Occupational title (*)
  • Phone number (*)
  • Email address (*)

Address cleansing follows country-specific standards and directories.

(*) These items are available for SAP HANA on-premise only.

The node has a single inbound port and a single outbound port.

The node processes input data and compares it with reference data stored in country-specific address directories. The directories can be downloaded and deployed as part of SAP HANA Smart Data Integration or Smart data qualitiy installation for SAP HANA on-premise.

The Geocode node generates latitude and longitude coordinates for an address, and generates addresses from latitude and longitude coordinates.

The node interprets the input data and compares them with reference information coming from local geographical information directories.

For detailed information about the Geocode node, see the "Modeling Guide for SAP HANA Smart Data Integration and SAP HANA Smart Data Quality", available on the SAP Help Portal.

For detailed information about the installation of the address Directories, see the "Installation and Configuration Guide", available on the SAP Help Portal.

Using Variables in Flowgraphs

When you define a flowgraph, many of the nodes require values for us in filters or expressions. Fixing the value might not be efficient. In that case, you would define a variable so that you don't have to provide the value during design-time of the flowgraph.

There are many reasons for implementing variables including:

  • You might want to define one flowgraph and reuse it across multiple cases.
  • The filter values are not known until the execution time of the flowgraph.

When you create variables, you can use them in nodes that accept them such as the Projection node and Aggregation node.

For example, in a Projection node, you might want to process only those records for a specific country, such as Spain. In this case, you create a variable to provide the country value, in the flowgraph properties. You can then use the variable in a filter condition or in an expression by surrounding the variable name with $$.

For example, you have defined a variable P_COUNTRY. You use it as follows:

Code snippet
"Filter1_Input"."COUNTRY" = $$P_COUNTRY$$
Expand

The variables are provided at the time of execution. This means you don't have to change the flowgraph definition each time you want to choose different countries.

When you execute the procedure, you can provide the variable value for each execution.

Nested Flowgraphs and Procedures in Flowgraphs

You already know that you can use SQL expressions in nodes such as projection, join or case nodes. However, if you need to apply a sequence of SQL expressions, adding several nodes may be tedious and difficult to maintain. In some cases, the statement depends on the content, current date, or a manual choice made by the person who calls the flowgraph.

The solution is to implement a Procedure node. Here are some examples of why you might do this:

  • To implement a complex sequence of SQL statements with just one node
  • To implement an SQL statement that depends on a parameter which is not provided by the predecessor node
  • To reuse an existing SQLScript
  • To reuse an existing flowgraph

The procedure node allows you to call database procedures (that is, in SQLScript) within a flowgraph.

The procedure node has an inbound or outbound port for every IN or OUT table parameter of the procedure.

If a procedure has scalar input parameters, the values for these parameters come from variables defined in the flowgraph.

These variables are created automatically. When the flowgraph is executed, you provide a value for each variable so that it can be passed to the input scalar parameters.

Note
The procedure node is not available for real time processing.

Nested Flowgraphs

If you have an existing basic flowgraph that should be called by one or more new top-level flowgraphs, you can deploy it as a procedure. Then, this procedure can be implemented in the new flowgraph.

The image shows that deploying an existing flowgraph can be deployed as a procedure. This procedure can be re-used in a new flowgraph.

What are the advantages?

  • By referencing the same basic flowgraph, you need to define the common part of the transformation only once.
  • When the requirement for the basic flowgraph changes, you only need to adapt the basic flowgraph. All top-level flowgraphs remain consistent as they automatically use the new logic.
  • When different procedures or flowgraphs should be scheduled at the same start time and with the same recurrance pattern, you can include all of them from a master flowgraph and then schedule only the master flowgraph.

So, now you have to learn how a job for running a flowgraph can be scheduled.

Log in to track your progress & complete quizzes