Modeling in SAP HANA Cloud

Working with Unions, Minus and Intersect Nodes

Objectives
After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

  • Use Union nodes to combine data sets
  • Use set operations:Minus and Intersect

Union Node

A Union node is used to combine two or more data sets to a common set of columns.

Depending on how different the column names are from the data source, you can map the columns automatically by name (columns with identical names will be mapped automatically) or define the mapping manually.

Standard Union

Depending on the requirement, you can use one of the following approaches:

  • A standard union

  • A union with constant values

A standard union is where, for each target column in the union, there is always one source column that is mapped. In the example, you see how both sources provide a mapped column to the union. The source column does not have to have the same name as the target column. For example, a source column month could be mapped to a target column period. Also, a standard union does not have to provide a mapping for each source column. In other words, there could be source columns that are left behind and play no part in the union. This scenario is useful when you want to combine measures from multiple sources into one column. Because the data sources can provide an attribute that describes the type of measure (for example, Plan or Actual) it means that we do not lose the meaning of each row.

Union with Constant Values

A union with constant values is where you provide a fixed value that is used to fill a target column where the data source cannot provide a mapped column. For example, you have a data source that provides the actual amount and you also have a second data source that provides the planned amount. You decide to avoid combining these measures into one column as they would lose meaning, as there is no attribute to describe the meaning. In this case you would map the measure to a separate target column and provide a constant value ‘0’ to the target column for the missing measure on each side.

The choice between a standard union and a union with constant values depends on the data you are using and the way you want the end users to report on data.

If it is more beneficial to present different measures in different columns, you can use a union with constant values so that you have a way to provide a value (probably zero) to a column where there is no source mapping. On the contrary, if it is easier to present measures in a single column and differentiate them with an attribute, such as an ‘amount type’, use the standard union.

Mapping Columns Based on their Names

It is possible to map columns in a Union node based on their names. This helps to save time when the data structures have some similarities in terms of column names.

The Auto Map by Name feature can be used in two different ways, depending on whether you make a selection in the Data Source area before clicking the icon.

Two Options to Map Columns by Name Automatically

ScenarioResult
You have not selected any data source (default behavior)

All the columns of all data sources are added to the output.

Columns with matching names are mapped together.

You have selected one or several data source(s)

Only the columns of the selected data source(s) are added to the output.

The columns from the other data source(s) are added to the output only if they have a matching column in the selected data source(s). They are mapped to their matching columns.

Note
Selecting a data source before triggering Auto Map by Name is useful when one or several other data sources have a lot of columns that you do not want to include in the output.

Unmapped Columns in a Union Node

There could be instances when a union needs to be performed between data sources that do not provide the same data structure.

  • If column names are different but contain the same type of information (for example, CURRENCY versus CURRENCY_CODE, you must define the mapping manually. This can be done with drag and drop.

  • If a column exists in only one data source but you need it in the target, you map it to the target. Then, for the other source, you must decide whether you allow the data in this column to be null or you can define a constant value that will be assigned to all the rows from this data source.

As an example, you have a data source that provides the customer status attribute and you have a second data source where the customer status attribute cannot be provided; there is no column for this. You know this second data source contains only customers with the status Active. So you simply fill the target column with a constant value for this data source to Active for every row.

Even if you have a source column available for mapping, you can choose to use a constant value instead. This creates the effect of overriding the source value.

Manage Mapping of Unmapped Columns in a Union Node

Although we described how constants are often used when one of the data sources cannot, or should not, provide a value to the union target column, we can also use constants when none of the data sources can provide a value. To do this, we first create an empty column in the union target. We then define a constant value for each data source. For example, I would like to create a union between table A, that contains part-time employees’ data, and another table B, that contains full-time employees’ data. The employment status is very important for my analysis but neither table contains a column that indicates employment status. So I simply add a new column to the union target called ‘Employment Status’ and define a constant value for one data source as Full Time and the other data source as Part Time. Basically, I have manually tagged each source with a fixed label that now appears in each row and I can use this new column for filtering, aggregation, and so on.

To set the constant value, right-click the target column and choose Manage Mappings.

Union Node Pruning

In some circumstances, the query that is executed on top of a Union node can ignore completely one or several of the data sources.

Think about a Union node that combines two data sources: Cold (data up to 2016) and Hot (data for 2017 and later).

If a select statement queries only the data for 2017, the Cold data source can be ignored from the start (this is called pruning), which will provide a better performance than if the query execution scans the entire Cold data source looking for records for 2017, and eventually does not find a single one.

Note
You will learn how to define Union node pruning in the unit Optimization of Models.

Empty Union Behavior

We know that the data sources to a union can provide different columns. So what happens when a query requests columns that are present in one source but not another source? Do you want to be made aware that one source was not able to provide a row? Or is this not important?

To illustrate this feature, imagine you have a data source, A, that contains the columns Product and Product Group and another data source, B, that contains only the column Product. If a query requests only the column Product Group, how does data source B respond when it doesn't have this column?

The answer depends on the setting in the property Empty Union Behavior which is set for each data source. The default behavior, as you might expect, is to provide No Row for data source B.

But there are times when you might prefer to know that no rows could be returned from data source B. In that case you should proceed as follows:

  1. Add a constant column to the union output.

  2. Provide a constant value for each data source with a suitable value that help you identify each source.

  3. Change the property Empty Union Behavior to Row with Constants.

To test this, you simply consume the calculation view that contains the union, with a query that does not request any column from one of the data sources. In our case above, the query should request only the column Product Group. The results will then contain multiple rows from data source A, and also a single row with the constant value that you defined for the data source B and nulls will fill the empty columns that it could not provide.

Combining Several Data Sets: Union Versus Join

When you need to combine measures originating from more than one table or calculation view, you might want to create a join between these tables. However, under most circumstances, you must avoid using a join to address this requirement, because it is very costly from a computing standpoint.

It is more beneficial to use a Union node, which provides much better performance.

Note

Technically, a union is not a join type.

Union All or Union Distinct

A Union node generates a list of all values from the input data sources, even if values are repeated. For example, if two data sources both include the same customer then the customer will appear twice in the output data set.

This is the equivalent of UNION ALL in SQL and is desirable in many cases.

If repeating values are not required, then you should include an Aggregation node on top of the Union node to aggregate the attributes. We normally associate aggregation behavior with measures, such as SUM, AVE, and so on, but aggregation can also be performed on attributes. The effect of aggregating attributes is to simply produce a distinct list of values.

This is the equivalent of UNION DISTINCT (or just UNION) in SQL.

Combine Two Data Sources with a Union Node

Set Operations Nodes – Intersect and Minus

Watch this video to learn about the Set Operations - Intersect and Minus.

For the Minus node, the data sets are considered based on their order in the list of data sources for the node. So, the output contains items from the FIRST data source that are NOT in the SECOND data source. In SAP HANA Cloud it is possible to change the order of the data sources: just right click the Minus node and choose Switch Order.

Filtering relies on the list of attributes that are queried at runtime. In other words, a column that is provided by both source nodes, for example, Country, but is not queried at runtime, is ignored.

Note
It is possible to swap the two data sources of a Minus node. Of course, this has an impact on the semantic.

Save progress to your learning plan by logging in or creating an account