Implementing Rank Nodes

Objectives
After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

  • Configure a rank node to identify the top or bottom values of a data set

Rank Node

The purpose of the Rank node is to enable the selection, within a data set, of the top or bottom 1, 2, ... n values for a defined measure, and to output these measures together with the corresponding attributes and, if needed, other measures.

For example, with a Rank node, you can easily build a subset of data that gives you the five products that generated the biggest revenue (considering the measure GROSS_AMOUNT) in each country. The country, in this example, defines a Logical Partition of the source data set.

Note
The Rank node itself does not perform any type of aggregation on the source data set (this important topic is discussed later on in this lesson).

Main Settings of a Rank Node

The main settings of a Rank node are as follows:

Main Settings of a Rank Node

SettingPurpose
Aggregation FunctionDefine the key computation executed on the data set (see below)
Result Set DirectionDecide whether to extract the Top or the Bottom (Down) rows from the ordered data set
Result Set TypeAbsolute or Percentage
Target Value and OffsetDefine the number of rows to return
Generate Rank ColumnIndicate if you want to output a rank column and specify its name
Logical PartitionPartition the source data set by one or several columns, before executing the rank computation
Dynamic Partition ElementsDefine whether the partition can be adjusted automatically based on the columns that are selected by an upper node or an upper view/query that you execute on top of the current one
Sort ColumnThe columns that are used to order the data set to execute the ranking

Partitioning the Source Data Set

The source data set can be partitioned by one or several columns. This means that the extraction rule you define, for example, return the top five total sales amount, will be executed in each partition, for example, for each Country and each Year.

If you choose the Dynamic Partition Element, the columns listed in the Partition will be ignored if they are not requested by an upper node or top query. To follow the same example, you could return the top five total sales for each Country only, for each year only, just by excluding the column you do not need from your top query, and without redesigning your calculation view.

Choosing an Aggregation Functions

After partitioning the source data set by one or several columns and ordering it, an Aggregation Function is applied to the data set.

The Rank node offers four Aggregation Functions which can be classified into two categories:

  • Three functions computing the row numbers. These are:

    • Row

    • Rank

    • Dense Rank

  • One function computing the values of the sorted column, for example, a Sales amount. This is:

    • Sum

This diagram shows the behavior of the different aggregation functions on the same data set.

Row, Rank, and Dense Rank only differ in the way they deal with tie values (identical values in the sorted column). The Sum aggregation function generates a cumulative sum of the sorted column up to the current row.

Using Multiple Sort Columns

In scenarios where more than one sort column are defined, they are treated in sequence of appearance for the aggregation functions Row, Rank, and Dense Rank. This can be useful to better handle identical values in the first sort column.

The Sum aggregation function only uses the first Sort Column.

Generating the Result Set

The final stage is to extract the result set. This is done based on the Target Value setting.

The setting must be a rank value (for example, extract the data up to the rank 2), except for the Sum aggregation function where you set a cumulative value (for example, extract the data up to a cumulative Sales amount of 200).

The target value can be fixed, that is, defined in the Rank Node definition, or it can be set at runtime by means of an input parameter.

Note

The rank node returns the rows for which the computed ranking is lower or equal to the target value.

Working with Percentage Instead of Absolute Values

Instead of Absolute, which has been used up to now in our examples, you can set the Result Set Type to Percentage.

The figure, Percentage Result Type, shows the difference between the two methods.

With the Percentage Result Set Type, you can address requirements such as:

  • Return the 50% best-selling products for each customer (with a Row aggregation function, for example).

  • Return the best-selling products representing 30% of the total sales for each customer

    (with a Sum aggregation function).

The figure, Percentage Result Type, shows in a frame which rows would be included in the extracted data set.

Defining an Offset on the Result Set

It is possible to exclude a number of elements from the top of the sorted data set, by defining an offset.

In this example, the rows ranked up to (and including) 2 are excluded. Then the rank node returns the following rows based on the target value you have set.

Another example is when you want to assign members of a statistic collection to their inter-quartile interval. You could define a rank node as follows:

  • Aggregation Function: Row

  • Result Set Type: Percentage

  • Target Value: 0.5

  • Offset: 0.25

Is an Aggregation Needed Before Ranking?

As already discussed, the Ranknode can partition the source data, but it does NOT perform any aggregation on the source data.

In other words, the way the source data is structured has a major impact on the way the Ranknode will compute this data.

For example, if your source data contains a Sales Order ID column, you might not process directly a ranking of best-selling products in each country, because there might be several rows for the same Country and Product (but different Sales Orders).

When the source data for the Rank node is a table, you must make sure that the data structure suits your modeling needs. If not, you might need to first aggregate data by adding an Aggregation node used as a data source by the Rank node.

When the source data for the Rank node is a CUBE calculation view, the aggregation defined in that calculation view is implicitly triggered based on the columns requested by the rank node, but a column that is mapped to the output of the Ranknode but not consumed by the upper node will not be requested to the data source.

Caution
Like for other types of nodes, some columns are passed to the upper nodes even when they are not requested. For example, this is the case when the source calculation view has a sort order defined in the Semanticsnode (all the columns used for sorting are passed to consuming views). This is also the case when special settings such as Keep Flag are used.

Assigning a Type to the Rank Column

Depending on how you want to use the ranking information, you can decide to assign to the rank column the type Attribute or Measure in the Semantics node.

Rank Column: Measure or Attribute?

  • Attribute

    Assigning the type Attribute is a simple approach.

    It is probably a bit less error-prone, because you are not tempted to perform an irrelevant aggregation of ranking positions.

  • Measure

    With the type Measure, the rank column can provide some flexibility.

    For example, if you set the default aggregate function to MAX, you can retrieve summarized data, such as the total sales generated by the five biggest orders in each country, while keeping the information about how many orders are actually totaled in each country. Indeed, there might be countries that have received less than five orders over the considered period, and this information could be of interest when analyzing the data.

Because the information in the rank column can differ a lot depending on the aggregation function and result set type you used, you might want to give it an explicit name if you find that Rank_Column (default column name) is too generic.

Use Rank Nodes

Save progress to your learning plan by logging in or creating an account