Implementing Common Nodes in a Flowgraph

Objectives

After completing this lesson, you will be able to:

  • Combine datasets using flowgraph nodes.
  • Transform data sets using flowgraph nodes.
  • Split data sets using flowgraph nodes.

Union or Join?

When you want to combine data, there are two approaches. You can either choose a union or a join.

A union operation adds rows, a Join operation adds columns. The sources of a union provide the same type of information in dfferent rows with similar structure and fields. The sources of a join provide the dffferent type of information in dfferent columns with different structure and a few matching fields with the same values.

A Union operation adds records to existing records to form a combined, larger data set. The added records provide typically the same type of information from a different area, for example, the sales orders of two regions or from two years. The structures of the two sources should be compatible. That means that the columns that are part of the union must have matching data types.

A Join operation attaches columns from a second data set to an existing data set. The added columns provide additional, related information to enrich the original record. For example, your sales order record includes only the product ID and not the product description. In this case you could join the sales order record to the corresponding product master record to capture the product description and add the product description column to the sales order record. Your original record now has an additional column.

Union

What can you do when you have different source systems for sales orders from different countries and you need a full worldwide data set? The Union node merges the input data from two or more preceding nodes with an identical column structure. When you have more than two sources, choose + to create an additional input port.

Note

Rename the input port to reflect the meaning of the incoming data from the preceding node.

All incoming columns are passed on by the union node. Therefore, add a predecessor projection node to remove any irrelevant columns and rename the relevant ones, if necessary. The union node has a single output port that generates a union of all the records from all of the inputs to pass to the next node in the flowgraph. After choosing the (inspect) button, you see the only available property, Union All. It acts as explained in the following table.

Effect of the Union All property

Union All settingOutput data set
OffDuplicate records are removed.
OnAll records are preserved.

Join

Suppose you have a table with country codes, but you also want to include the country name in the output. The country name is stored in a separate table. Then, you need to specify a Join node to fetch the additional column and add it to your output.

The join node executes an SQL SELECT statement on multiple inbound data sets, combines them based on a join condition, and creates a unique outbound data set. Output columns can be removed or renamed.

The image shows two source tables, Sales2, with records for ID 1, 2, and 5, and master data 1 and 2. A join node can be defined as Inner Join which only passes matching records, or Left Join which passes all fileds even those without master data.

Which lines are passed on to the result set is determined by the join type. The following join types are supported:

  • Inner Join (only lines with matching partners)
  • Left Outer Join (all lines from the left join partner)
  • Right Outer Join (all lines from the right join partner)
  • Full Outer Join (all lines from both join partners)
  • Cross Join (all combinations of all lines)

A join might be an important prerequisite when you want to aggregate values on an attribute level, such as the sales revenue sum per product category. Of course, then you need another node for the aggregation after the join.

Aggregating Data

A common requirement of data provisioning is to summarize source data and store the result in a target table in SAP HANA. This is known as aggregation. An aggregation is defined by selecting one or more measures to be aggregated, an aggregation behavior for each measure, and one or more attributes that determine the level of aggregation.

For example: I need the average (aggregation behavior) sales quantity (measure) by region (attribute) for each year (attribute).

You implement aggregation using a flowgraph node. The Aggregation node is always based on a single inbound data set and generates a single outbound data set. Every outbound row is the result of the aggregation of multiple inbound rows; its values are calculated based on the aggregation node configuration.

Let's now focus on the configuration of the output columns. You can remove columns. For each remaining column, one of the following actions can be used:

  • Group-by
  • Count
  • Min
  • Max
  • Average
  • Sum

The group-by columns determine the level of aggregation. All rows with identical value combination of the group-by columns are aggregated in a single outbound row. This means, any two output rows have different values in at least one of the group-by columns.

The other columns are calculated based on the given aggregation behavior. For example, they return the sum, the average value, or the number of distinct original values.

In combination with the aggregation, you can do the following:

  • Rename outbound columns
  • Apply filters to the inbound rows, using the specification of a filtering criteria. For example, include only France sales in the aggregation.
  • Apply filters to the outbound rows, using the specification of a HAVING criteria. For example, only generate an output row if the aggregated sales are greater than EUR 10,000.

Splitting Data Sets

You can distribute a data source into multiple outputs.

Here are some reasons why you might want to do this:

  • You can implement different storage strategies, such as load to in-memory storage for recent records, and load to disk storage for older records.
  • You can store data at different levels of aggregation. For example, you could aggregate older data and load this to a table and leave the newer data at the line level and store this in a separate table.
  • Different transformation logic may be required. For example, a bonus calculation that depends on the job role, such as senior manager or sales executive.

The Case node receives data from a single inbound port and distributes the rows unchanged to multiple outbound ports.

For every outbound port, a Boolean expression is defined to determine which inbound rows will be transferred to it.

One outbound port is marked as "Default" to receive all the inbound rows that haven't been dispatched to any of the other ports.

After the case node, you can add further nodes, such as projection or aggregation nodes, to define a specific transformation behavior for each of the subsets of data generated by the case node. If needed, you could merge the different result sets into a single output with an additional union node.

Launch the following video to learn how you can create a flowgraph with a variety of nodes to handle a complex data transformation example.

Log in to track your progress & complete quizzes