The purpose of the Rank node is to enable the selection, within a data set, of the top or bottom 1, 2, ... n values for a defined measure, and to output these measures together with the corresponding attributes and, if needed, other measures.
For example, with a Rank node, you can easily build a subset of data that gives you the five products that generated the biggest revenue (considering the measure GROSS_AMOUNT) in each country. The country, in this example, defines a Logical Partition of the source data set.
Note
The Rank node itself does not perform any type of aggregation on the source data set (this important topic is discussed later on in this lesson).Main Settings of a Rank Node
The main settings of a Rank node are as follows:
Main Settings of a Rank Node
Setting | Purpose |
---|---|
Aggregation Function | Define the key computation executed on the data set (see below) |
Result Set Direction | Decide whether to extract the Top or the Bottom (Down) rows from the ordered data set |
Result Set Type | Absolute or Percentage |
Target Value and Offset | Define the number of rows to return |
Generate Rank Column | Indicate if you want to output a rank column and specify its name |
Logical Partition | Partition the source data set by one or several columns, before executing the rank computation |
Dynamic Partition Elements | Define whether the partition can be adjusted automatically based on the columns that are selected by an upper node or an upper view/query that you execute on top of the current one |
Sort Column | The columns that are used to order the data set to execute the ranking |
Partitioning the Source Data Set
The source data set can be partitioned by one or several columns. This means that the extraction rule you define - for example, return the top five total sales amount - will be executed in each partition, for example, for each Country and each Year.
If you choose the Dynamic Partition Element, the columns listed in the Partition will be ignored if they are not requested by an upper node or top query. To follow the same example, you could return the top five total sales for each Country only, for each year only, just by excluding the column you do not need from your top query, and without redesigning your calculation view.
Choosing an Aggregation Functions
After partitioning the source data set by one or several columns and ordering it, an Aggregation Function is applied to the data set.
The Rank node offers four Aggregation Functions which can be classified into two categories:
Three functions computing the row numbers. These are as follows:
- Row
- Rank
- Dense Rank
One function computing the values of the sorted column, for example, a Sales amount.
This is Sum.
The figure, Aggregation Function, shows the behavior of the different aggregation functions on the same data set. Here is the description of each of them:
Aggregation Functions
Aggregation Function | Description |
---|---|
Row | Standard row numbering (ties have different ranks) |
Rank | Olympic ranking (ties have the same rank) |
Dense Rank | Enumerates every rank value (no gap in numbering sequence) |
Sum | Adds up values of the Order By column |
Row, Rank, and Dense Rank only differ in the way they deal with tie values (identical values in the sorted column). The Sum aggregation function generates a cumulative sum of the sorted column up to the current row.
Using Multiple Sort Columns
In scenarios where more than one sort column are defined, they are treated in sequence of appearance for the aggregation functions Row, Rank, and Dense Rank. This can be useful to better handle identical values in the first sort column.
The Sum aggregation function only uses the first Sort Column.
Generating the Result Set
The final stage is to extract the result set. This is done based on the Target Value setting.
The setting must be a rank value - for example, extract the data up to the rank 2), except for the Sum aggregation function where you set a cumulative value (for example, extract the data up to a cumulative Sales amount of 200).
The target value can be fixed, that is, defined in the Rank Node definition, or it can be set at runtime by means of an input parameter.
Note
The rank node returns the rows for which the computed ranking is lower or equal to the target value.
SAP HANA Cloud QRC 2/2024 introduces an additional capability: you can now set the Target Value setting to All Values. The rank node then sorts the value without eliminating any of them from the result set.
Working with Percentage Instead of Absolute Values
Instead of Absolute, which has been used up to now in our examples, you can set the Result Set Type to Percentage.
Note
The property is called percentage, but the calculated values are rather fractions between 0 and 1. For 50%, you will write or see the value 0.5.
The figure, Percentage Result Type, shows the difference between the two methods.
With the Percentage Result Set Type, you can address requirements such as the following:
Return the 50% best-selling products for each customer (for example, with a Row aggregation function)
Return the best-selling products representing 30% of the total sales for each customer
(with a Sum aggregation function).
The figure, Percentage Result Type, shows in a frame the rows that would be included in the extracted data set.
Defining an Offset on the Result Set
It is possible to exclude a number of elements from the top of the sorted data set by defining an offset.
In this example, the rows ranked up to (and including) 2 are excluded. Then the rank node returns the following rows based on the target value you have set.
Another example is when you want to assign members of a statistic collection to their inter-quartile interval. You could define a rank node as follows:
Aggregation Function: Row
Result Set Type: Percentage
Target Value: 0.5
Offset: 0.25
Is an Aggregation Needed Before Ranking?
As already discussed, the Rank node can partition the source data, but it does NOT perform any aggregation on the source data.
In other words, the way the source data is structured has a major impact on the way the Rank node will compute this data.
For example, if your source data contains a Sales Order ID column, you might not process directly a ranking of best-selling products in each country, because there might be several rows for the same Country and Product (but different Sales Orders).
When the source data for the Rank node is a table, you must make sure that the data structure suits your modeling needs. If not, you might need to first aggregate data by adding an Aggregation node used as a data source by the Rank node.
When the source data for the Rank node is a CUBE calculation view, the aggregation defined in that calculation view is implicitly triggered based on the columns requested by the rank node, but a column that is mapped to the output of the Rank node but not consumed by the upper node will not be requested to the data source.
Caution
As is the case with other types of nodes, some columns are passed to the upper nodes even when they are not requested. For example, this is the case when the source calculation view has a sort order defined in the Semantics node (all the columns used for sorting are passed to consuming views). This is also the case when special settings such as Keep Flag are used.Finally, keep in mind that further processing after a rank node (in the calculation view containing the rank node, or one that consumes it) might change the order of rows in your result set. To avoid this, in some scenarios, you might need to do one of the following:
Use an ORDER BY clause on the rank column in the query to keep the result set ordered
Use the Sort Result Set option in the semantics of the calculation view.
Assigning a Type to the Rank Column
Depending on how you want to use the ranking information, you can decide to assign to the rank column the type Attribute or Measure in the Semantics node.
Rank Column: Measure or Attribute?
Attribute
Assigning the type, Attribute, is a simple approach.
It is probably a bit less error-prone because you are not tempted to perform an irrelevant aggregation of ranking positions.
Measure
With the type, Measure, the rank column can provide some flexibility.
For example, if you set the default aggregate function to MAX, you can retrieve summarized data, such as the total sales generated by the five biggest orders in each country, while keeping the information about how many orders are actually totaled in each country. Indeed, there might be countries that have received less than five orders over the considered period, and this information could be of interest when analyzing the data.
Because the information in the rank column can differ a lot depending on the aggregation function and result set type you used, you might want to give it an explicit name if you find that Rank_Column (default column name) is too generic.