A Union node is used to combine two or more data sets to a common set of columns.
Depending on how different the column names are from the data source, you can map the columns automatically by name (columns with identical names will be mapped automatically) or define the mapping manually.
Depending on the requirement, you can use one of the following approaches:
A standard union
A union with constant values
A standard union is where, for each target column in the union, there is always one source column that is mapped. In the example, you see how both sources provide a mapped column to the union. The source column does not have to have the same name as the target column. For example, a source column month could be mapped to a target column period. Also, a standard union does not have to provide a mapping for each source column. In other words, there could be source columns that are left behind and play no part in the union. This scenario is useful when you want to combine measures from multiple sources into one column. Because the data sources can provide an attribute that describes the type of measure (for example, Plan or Actual) it means that we do not lose the meaning of each row.
Union with Constant Values
A union with constant values is where you provide a fixed value that is used to fill a target column where the data source cannot provide a mapped column. For example, you have a data source that provides the actual amount and you also have a second data source that provides the planned amount. You decide to avoid combining these measures into one column as they would lose meaning, as there is no attribute to describe the meaning. In this case you would map the measure to a separate target column and provide a constant value ‘0’ to the target column for the missing measure on each side.
The choice between a standard union and a union with constant values depends on the data you are using and the way you want the end users to report on data.
If it is more beneficial to present different measures in different columns, you can use a union with constant values so that you have a way to provide a value (probably zero) to a column where there is no source mapping. On the contrary, if it is easier to present measures in a single column and differentiate them with an attribute, such as an ‘amount type’, use the standard union.
Mapping Columns Based on their Names
It is possible to map columns in a Union node based on their names. This helps to save time when the data structures have some similarities in terms of column names.
The Auto Map by Name feature can be used in two different ways, depending on whether you make a selection in the Data Source area before clicking the icon.
Two Options to Map Columns by Name Automatically
|You have not selected any data source (default behavior)|
All the columns of all data sources are added to the output.
Columns with matching names are mapped together.
|You have selected one or several data source(s)|
Only the columns of the selected data source(s) are added to the output.
The columns from the other data source(s) are added to the output only if they have a matching column in the selected data source(s). They are mapped to their matching columns.
Unmapped Columns in a Union Node
There could be instances when a union needs to be performed between data sources that do not provide the same data structure.
If column names are different but contain the same type of information (for example, CURRENCY versus CURRENCY_CODE, you must define the mapping manually. This can be done with drag and drop.
If a column exists in only one data source but you need it in the target, you map it to the target. Then, for the other source, you must decide whether you allow the data in this column to be null or you can define a constant value that will be assigned to all the rows from this data source.
As an example, you have a data source that provides the customer status attribute and you have a second data source where the customer status attribute cannot be provided; there is no column for this. You know this second data source contains only customers with the status Active. So you simply fill the target column with a constant value for this data source to Active for every row.
Even if you have a source column available for mapping, you can choose to use a constant value instead. This creates the effect of overriding the source value.
Manage Mapping of Unmapped Columns in a Union Node
Although we described how constants are often used when one of the data sources cannot, or should not, provide a value to the union target column, we can also use constants when none of the data sources can provide a value. To do this, we first create an empty column in the union target. We then define a constant value for each data source. For example, I would like to create a union between table A, that contains part-time employees’ data, and another table B, that contains full-time employees’ data. The employment status is very important for my analysis but neither table contains a column that indicates employment status. So I simply add a new column to the union target called ‘Employment Status’ and define a constant value for one data source as Full Time and the other data source as Part Time. Basically, I have manually tagged each source with a fixed label that now appears in each row and I can use this new column for filtering, aggregation, and so on.
To set the constant value, right-click the target column and choose Manage Mappings.
Union Node Pruning
In some circumstances, the query that is executed on top of a Union node can ignore completely one or several of the data sources.
Think about a Union node that combines two data sources: Cold (data up to 2016) and Hot (data for 2017 and later).
If a select statement queries only the data for 2017, the Cold data source can be ignored from the start (this is called pruning), which will provide a better performance than if the query execution scans the entire Cold data source looking for records for 2017, and eventually does not find a single one.
Empty Union Behavior
We know that the data sources to a union can provide different columns. So what happens when a query requests columns that are present in one source but not another source? Do you want to be made aware that one source was not able to provide a row? Or is this not important?
To illustrate this feature, imagine you have a data source, A, that contains the columns Product and Product Group and another data source, B, that contains only the column Product. If a query requests only the column Product Group, how does data source B respond when it doesn't have this column?
The answer depends on the setting in the property Empty Union Behavior which is set for each data source. The default behavior, as you might expect, is to provide No Row for data source B.
But there are times when you might prefer to know that no rows could be returned from data source B. In that case you should proceed as follows:
Add a constant column to the union output.
Provide a constant value for each data source with a suitable value that help you identify each source.
Change the property Empty Union Behavior to Row with Constants.