Implementing Rank Nodes

Objective

After completing this lesson, you will be able to Configure a rank node to identify the top or bottom values of a data set.

Rank Node

The purpose of the Rank node is to enable the selection, within a data set, of the top or bottom 1, 2, ... n values for a defined measure, and to output these measures together with the corresponding attributes and, if needed, other measures.

For example, with a Rank node, you can easily build a subset of data that gives you the five products that generated the biggest revenue (considering the measure GROSS_AMOUNT) in each country. The country, in this example, defines a Logical Partition of the source data set.

Note

The Rank node itself does not perform any type of aggregation on the source data set (this important topic is discussed later on in this lesson).

Main Settings of a Rank Node

The main settings of a Rank node are as follows:

Main Settings of a Rank Node

SettingPurpose
Aggregation FunctionDefine the key computation executed on the data set (see below)
Result Set DirectionDecide whether to extract the Top or the Bottom (Down) rows from the ordered data set
Result Set TypeAbsolute or Percentage
Target Value and OffsetDefine the number of rows to return
Generate Rank ColumnIndicate if you want to output a rank column and specify its name
Logical PartitionPartition the source data set by one or several columns, before executing the rank computation
Dynamic Partition ElementsDefine whether the partition can be adjusted automatically based on the columns that are selected by an upper node or an upper view/query that you execute on top of the current one
Sort ColumnThe columns that are used to order the data set to execute the ranking

Partitioning the Source Data Set

The source data set can be partitioned by one or several columns. This means that the extraction rule you define - for example, return the top five total sales amount - will be executed in each partition, for example, for each Country and each Year.

If you choose the Dynamic Partition Element, the columns listed in the Partition will be ignored if they are not requested by an upper node or top query. To follow the same example, you could return the top five total sales for each Country only, for each year only, just by excluding the column you do not need from your top query, and without redesigning your calculation view.

Choosing an Aggregation Functions

After partitioning the source data set by one or several columns and ordering it, an Aggregation Function is applied to the data set.

The Rank node offers four Aggregation Functions which can be classified into two categories:

  • Three functions computing the row numbers. These are as follows:

    • Row
    • Rank
    • Dense Rank
  • One function computing the values of the sorted column, for example, a Sales amount.

    This is Sum.

Table with sample data and different possible outputs depending on the applied aggregation function.

The figure, Aggregation Function, shows the behavior of the different aggregation functions on the same data set. Here is the description of each of them:

Aggregation Functions

Aggregation FunctionDescription
RowStandard row numbering (ties have different ranks)
RankOlympic ranking (ties have the same rank)
Dense RankEnumerates every rank value (no gap in numbering sequence)
SumAdds up values of the Order By column

Row, Rank, and Dense Rank only differ in the way they deal with tie values (identical values in the sorted column). The Sum aggregation function generates a cumulative sum of the sorted column up to the current row.

Using Multiple Sort Columns

In scenarios where more than one sort column are defined, they are treated in sequence of appearance for the aggregation functions Row, Rank, and Dense Rank. This can be useful to better handle identical values in the first sort column.

The Sum aggregation function only uses the first Sort Column.

Generating the Result Set

The final stage is to extract the result set. This is done based on the Target Value setting.

Tables of sample data to illustrate the target value. Target value 2 always returns 2 rows if the aggregation function is Row, but may return more than 2 rows if the aggregation function is Rank. For the aggregation function Sum, the target value acts as a limit for the sum value.

The setting must be a rank value - for example, extract the data up to the rank 2), except for the Sum aggregation function where you set a cumulative value (for example, extract the data up to a cumulative Sales amount of 200).

The target value can be fixed, that is, defined in the Rank Node definition, or it can be set at runtime by means of an input parameter.

Note

The rank node returns the rows for which the computed ranking is lower or equal to the target value.

SAP HANA Cloud QRC 2/2024 introduces an additional capability: you can now set the Target Value setting to All Values. The rank node then sorts the value without eliminating any of them from the result set.

Working with Percentage Instead of Absolute Values

Instead of Absolute, which has been used up to now in our examples, you can set the Result Set Type to Percentage.

The figure, Percentage Result Type, shows the difference between the two methods.

Tables with sample data for Result Set Type Percentage for all possible aggregation types. Percentage value is calculated as absolute value of the current row divided by the maximum absolute value. The target value should be lower than 1, for example 0.5. The aggregation function may affect the number of rows returned.

With the Percentage Result Set Type, you can address requirements such as the following:

  • Return the 50% best-selling products for each customer (for example, with a Row aggregation function)

  • Return the best-selling products representing 30% of the total sales for each customer

    (with a Sum aggregation function).

The figure, Percentage Result Type, shows in a frame the rows that would be included in the extracted data set.

Defining an Offset on the Result Set

It is possible to exclude a number of elements from the top of the sorted data set by defining an offset.

Tables with sample data for the offset property for various aggregation types. Target value 2 and offset 2 means that up to 2 further rows with rank higher than 2 are returned.

In this example, the rows ranked up to (and including) 2 are excluded. Then the rank node returns the following rows based on the target value you have set.

Another example is when you want to assign members of a statistic collection to their inter-quartile interval. You could define a rank node as follows:

  • Aggregation Function: Row

  • Result Set Type: Percentage

  • Target Value: 0.5

  • Offset: 0.25

Is an Aggregation Needed Before Ranking?

As already discussed, the Rank node can partition the source data, but it does NOT perform any aggregation on the source data.

In other words, the way the source data is structured has a major impact on the way the Rank node will compute this data.

For example, if your source data contains a Sales Order ID column, you might not process directly a ranking of best-selling products in each country, because there might be several rows for the same Country and Product (but different Sales Orders).

The effect of aggregation before ranking. This example shows a source table with country (used for logical partitioning), product, order id, and sales amount. Option 1: Do not aggregate, returns the products and amounts of the best individual orders per country. Option 2, Aggregate data by Country and Product before ranking, returns the products per country that sell best overall.

When the source data for the Rank node is a table, you must make sure that the data structure suits your modeling needs. If not, you might need to first aggregate data by adding an Aggregation node used as a data source by the Rank node.

When the source data for the Rank node is a CUBE calculation view, the aggregation defined in that calculation view is implicitly triggered based on the columns requested by the rank node, but a column that is mapped to the output of the Rank node but not consumed by the upper node will not be requested to the data source.

Caution

As is the case with other types of nodes, some columns are passed to the upper nodes even when they are not requested. For example, this is the case when the source calculation view has a sort order defined in the Semantics node (all the columns used for sorting are passed to consuming views). This is also the case when special settings such as Keep Flag are used.

Finally, keep in mind that further processing after a rank node (in the calculation view containing the rank node, or one that consumes it) might change the order of rows in your result set. To avoid this, in some scenarios, you might need to do one of the following:

  • Use an ORDER BY clause on the rank column in the query to keep the result set ordered

  • Use the Sort Result Set option in the semantics of the calculation view.

Assigning a Type to the Rank Column

Depending on how you want to use the ranking information, you can decide to assign to the rank column the type Attribute or Measure in the Semantics node.

Rank Column: Measure or Attribute?

  • Attribute

    Assigning the type, Attribute, is a simple approach.

    It is probably a bit less error-prone because you are not tempted to perform an irrelevant aggregation of ranking positions.

  • Measure

    With the type, Measure, the rank column can provide some flexibility.

    For example, if you set the default aggregate function to MAX, you can retrieve summarized data, such as the total sales generated by the five biggest orders in each country, while keeping the information about how many orders are actually totaled in each country. Indeed, there might be countries that have received less than five orders over the considered period, and this information could be of interest when analyzing the data.

Because the information in the rank column can differ a lot depending on the aggregation function and result set type you used, you might want to give it an explicit name if you find that Rank_Column (default column name) is too generic.

Use Rank Nodes

Log in to track your progress & complete quizzes