Implement Good Modeling Practices

Objectives
After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

  • Implement good modeling practices

Best Practices for Modeling

In this lesson, we will cover recommendations and best practices for data modeling with SAP HANA Cloud.

General Design Recommendations

Break down large models into individual calculation views so that they are easier to understand and also allow you to define reusable components, thus avoiding redundancy and duplicated maintenance as a consequence.

Try to develop calculation views in layers so that each layer consumes the lower layers. Any changes to the lowers layers immediately impacts the higher layers. The lowest layers should provide the foundation calculation views. The higher layers add increasing semantics to provide more business meaning to the views.

Note

Do not be concerned that breaking up large calculation views into individual building blocks will damage overall performance. At run time, the SQL compiler attempts to flatten the entire stack of calculation views into a single executable model. Remember the calculation view definitions, however many there are in the stack, are simply a logical view of the data flow and provide the essential logic. How the compiler puts together the final query execution can never be known at the design stage by simply observing the calculation view definition.

Avoid mixing CDS views with calculation views. It is tempting to reuse CDS views as data sources in your calculation views that might provide you with a fast path to accessing data, such as master data. However, this is not recommended if you are looking for the very for best performance.

It might seem pretty obvious but you must thoroughly test your calculation views on data sets that are realistic and not just with a few sample records. This applies not just to data volume but also to the variety of data values. Make sure you include unexpected values to ensure they are handled as you expect. For example, null values, or a hierarchy that would produce orphan nodes.

Aim for Query Unfolding

At run time, your calculation view is analyzed by the SQL processor. A key aim is to convert each calculation view operation into a corresponding SQL operator to convert the complete data flow graph into a single SQL statement. This is called unfolding. A benefit of unfolding to SQL is that only the SQL engine is needed and no other SAP HANA engine is called, such as the Calculation Engine or other SAP HANA Cloud engines such as Spatial or Graph. However, due to your calculation view design, there might be steps that cannot be translated into SQL operators. If these steps cannot be translated, they will be processed in the specialist SAP HANA engines which generate one or more materialized column views. These materialized column view are then read by SQL. So this means, instead of SQL reading directly from source tables, column views have to be first generated and then the SQL has to read from the interim column views. That would result in a partially unfolded query that might not perform as well as a completely unfolded query due to limited SQL optimizations.

Some nodes and operators cannot be unfolded when there is no equivalent SQL operator for the calculation view feature you are implementing.

It should be emphasized that although unfolding is generally desirable, sometimes unfolding can actually damage overall performance. For example, query compilation time can increase when the query has to be unfolded each time the calculation view is called with new selections. Therefore, it is possible to set a SQL hint that prevents unfolding for individual calculation views. This is found under Semantic NodePropertiesAdvanced.

You can set a global parameter to prevent unfolding for all calculation views.

You can check if your calculation view is unfolded by using the SQL Analyzer where you can see the list of operators and data sources. If a column view is used as a data source instead of a table, then you know this is not an unfolded query as the result has been materialized as an interim view and not read directly from a table. A completely unfolded query access only tables.

Also check the query compilation time versus the execution time to see how unfolding is affecting the overall run time.

It is possible to identify how many queries were not unfolded. Run a query on system view M_FEATURE_USAGE as follows:

Code snippet
SELECT
 	FEATURE_NAME,
 	CALL_COUNT 
FROM 
	"M_FEATURE_USAGE" 
WHERE 
	COMPONENT_NAME = 'CALCENGINE'
AND
	FEATURE_NAME 
IN 
  ('QUERY TOTAL','QUERY UNFOLDING')
Copy code

Set Join Cardinalities

When you define a join, you can optionally set the cardinality to describe the relationship between the two data sources, for example, 1:1 ,1:m, n:m.

We recommend that you always set join cardinalities so the optimizer can decide if joins can be omitted at run time based on the query that is being sent to the calculation view.

If tables are used as the data sources, and not calculation views, then you can use the Propose Cardinality option.

Caution
The cardinality setting is proposed based on the content of the joined tables at the time you click the Proposalbutton. The cardinality setting does not automatically update as the table content change. You might need to come back and change the cardinality setting if the relationship in the data changes. Also, remember that you need to use realistic data if you plan to use the proposal. It is no good proposing cardinality on a few test records that later do not make sense on live data.

One of the main benefits of having the correct cardinality set is that the SQL optimizer can prune entire data source from a join if certain prerequisites are met.

Referential Join Type

Consider using the join type Referential if your joins do not have to ensure integrity of the data, and are defined to provide access to additional columns where you already trust the data integrity (e.g. the sales item will always have a header).

It might be possible for the SQL optimizer to prune the data source from the join if it is not needed, if certain conditions are met.

Unblock Filter Push-down

In your calculation view, if you have defined a node that is a source to more than one other node (in other words the data flow splits), then you need to be aware that this can potentially block filter push-down. The SQL optimizer does not want to break the semantics of the query and so by default it keeps the filter at the node at which it is defined (and this node might be high up in the stack causing large data sets to travel up the stack).

You can set the flag Ignore Multiple Outputs for Filter so that the blockage of filter push down is removed and the optimizer is free to push the filter down to the lowest level to get the best performance, but be very careful to check the query semantics are not broken and wrong results are produced.

General Join Recommendations

Joins are a source of potential performance improvements. Make sure you consider all the recommendations when defining joins to ensure you get the best performance from your calculation views.

Switch Calculation View Expressions to SQL

Expressions are used in many places within calculation views including calculated columns, restricted measures, and filter expressions.

For SAP HANA on-premise it is possible to write expression using either plain SQL or the native HANA language called column engine language. SAP recommends that you only use SQL and not column engine language to ensure best performance (Column Engine language limits query unfolding)

When you build expressions using SAP HANA Cloud calculation views, you can only choose SQL expression language. Column engine language is not available.

Calculation views that were imported to SAP HANA Cloud from SAP HANA on-premise and include column engine language will still run, but you should change the language to SQL to achieve better performance. Once you select SQL language, the language selector is then grayed out so that you cannot return to Column Engine language.

You should be able to recreate column engine expressions to plain SQL. For example, the concatenate operator + is replaced with the plain SQL operator ||.

Use Dedicated Views or Tables for ‘List of Values’

For performance issues, we generally recommend that you create dedicated calculation views or tables to generate the list of help values for variables and input parameters.

This approach enables a faster display of the list of values, without reading the entire view or table on which the calculation view is based.

This is also best practice so that the list of values is reusable and consistent from one calculation view to another, and users see the same selection values.

The list can be driven by simple fixed filters in the help calculation view or even by analytic privileges that offer a dynamic list restricted to what each user is allowed to choose from.

Column Pruning

Make sure you do not map columns that are not needed. This might sound like a fairly obvious recommendation, but it is very common for developers to simply map all columns without considering if and why there are used. They are often added 'just in case' they are needed. later. Too many attributes can lead to very large, granular data sets especially when only aggregated data is needed.

Optimal Filtering and Calculated Columns

Always try to filter as early as possible in the stack. The objective is to prune data as early as possible to avoid large data sets that can harm performance.

Avoid including calculated columns in filter expressions. It’s better to first create a new column based on the calculation, and then add a projection node on top to filter the result set based on the column (which is no longer calculated).

Calculate as high as you can in the stack so that you avoid calculating on granular data and instead calculate on aggregated data.

Optimal Aggregations

Always reduce the data set by introducing aggregations early in the modeling stack.

If you consume a calculation view that produces attributes, do not change them to measures, and vice-versa.

Choose a cube with star join to build dimensional OLAP models. Do not try to create OLPA models using standard join nodes which creates a relational model. These may not perform as well.

Caching View Results

Complex calculation views place a heavy load on the system. To reduce this, it is possible to cache the results of a calculation view in order to speed up the time needed to get the results when executing a query against the view. This feature is useful when you have complex scenarios or large amounts of data so that the results do not have to be recalculated each time the query is executed. The benefits are not just the reduced time for the results to appear for the user of the query, but it also means CPU and memory consumption is reduced leading to better performance for other tasks.

If the cache cannot be used, due to a query requesting a different data set than what is held in cache, a query does not fail. It means that the query will need complete processing.

Caching should not be applied to raw data but to data that has been highly processed and reduced through aggregations and filtering. In practice, this means applying caching to the top-most nodes of a calculation view.

Cache can only be used for calculation views that do not check for analytic privileges. This means analytic privileges should be defined on the top-most view only, in a calculation view stack. The optimal design would be to define the cache setting at the highest calculation view possible in the stack, but not at the very top where analytic privileges are checked. This means that the user privileges are always checked against the cached data, if the cache is useable.

It is possible to fine-tune the calculation view cache using column and filter settings.

Firstly, calculation view caching can be defined at the column level. This means queries that use only a sub-set of the cached columns can be served by the cache. It is recommended to cache only the columns that are frequently requested in order to reduce memory consumption and speed up cache refresh. If you do not specify columns in the cache settings, then all columns are cached.

The cache size can be further reduced by defining filters in the cache settings. Queries that use either exactly the cache filters, or a subset of the filters, are served by the cache. Cache is only used if the query filter matches exactly the filter that is defined for the cache, or reduces the data further.

It is possible to define a retention period of the cache so that the cached data expires. The value entered is in minutes.

Note
It is currently not possible to define that cache should be invalidated if new data is loaded to the underlying tables. This feature is expected to come later. For now we just have a time-based expiry (in minutes).

In order to use the calculation view static cache there are some important prerequisites that must be met.

Firstly, the basic setting Enable Cache must be set in the calculation view to be cached.

Then, the query must be able to be fully-unfolded. This means that the entire query must be fully translatable into a plain SQL statement and not require the calculation of interim results. You can use the Explain Plan feature of the SQL Analyzer to check if unfolding can occur.

There must be no calculations that depend on a specific level of calculated granularity as the cache is stored at the level of granularity requested by the first query. This may not align to the aggregation level required by subsequent queries and could cause incorrect results.

Even with these prerequisites met, the cache feature is disabled by default and is only considered when explicitly invoked by using one of the following RESULT_CACHE hints:

If the prerequisites for cache usage are not met, it might still be possible to force the cache to be used. You can use the setting Force so that the cache is used. However, you should test extensively to ensure that correct results are returned.

Controlling Parallelization in a Data Flow

SAP HANA Cloud always attempts to automatically apply parallelization to queries in order to optimize performance. Following good modeling practices will ensure the queries that run on your calculation views have the best chance of being parallelized. However, there are cases when the optimizer will not apply parallelization in a complex, long-running data processing step as it is not able to understand the business semantics and is concerned it might break the results if it tried to parallelize. So, instead, it cautiously leaves the processing step as sequential even though it might harm performance.

However, if you are able to ensure the step could be safely run in a parallel way by splitting up the data into semantic partitions, then you can dictate when parallelization should occur.

Within a calculation view, it is possible to force parallelization of data processing by setting a flag Partition Local Execution to mark the start and also the end of the section of a data flow where parallelization is required. The reason you do this is to improve the performance of a calculation view by generating multiple processing threads at specific positions in the data flow where performance bottlenecks can occur.

The parallelization block begins with a calculation view node that is able to define a table as a data source. This node could be a projection node or an aggregation node. It is not possible to use any other type of data source such as a function or a view.

In the Properties of the chosen start node, a flag Partition Local Execution is set to signal that multiple threads should be generated from this point onwards. It is possible to define a source column as a partitioning value. This means that a partition is created for each distinct value of the selected column. For example, if the column chosen was COUNTRY, then a processing thread is created for each country. Of course it makes sense to look for partitioning columns where the data can be evenly distributed. The partitioning column is optional. If it is not selected then the partitioning defined for the table is used.

If you don't explicitly define the partitioning column, then the partitioning rules of the underlying table are applied (assuming the table is partitioned).

To end the parallelization block you use a union node. But unlike a regular union node that would always have at least two data sources, a union used to terminate the parallel processing block is fed only from one data source. The union node combines the multiple generated threads but the multiple inputs are not visible in the graphical editor and so the union node appears to have only one source from the node below. You cannot combine any other data sources in the union node that is used to terminate the parallelization. In the Properties of the union node, a flag ‘Partition Local Execution is set to signal the ending of the parallelization block.

There are some restrictions that you should be aware of.

The parallelization block always starts with a node that includes a data source that must be a table. You can use a table defined in the local container or a synonym which points to an external table. You cannot use another calculation view as the data source or a table function.

Only one parallelization block can be defined per query. This means you cannot stop and the start another block either in the same calculation view or across calculation views in the complete stack. You cannot nest parallelization blocks, e.g. start a parallelization block then start another one inside the original parallelization block.

It is possible to create multiple start nodes with different partitioning configurations, but to do this explicitly define the partitioning column. If you create multiple start nodes then all threads that were generated are combined in a single ending union node.

Note

In addition to defining logical partitions using this technique, always remember that SAP HANA will attempt to apply additional parallelization to the logical partitions.

To check the partitioning of the data you can define a calculated column within the parallelization block with the simple column engine expression partitionid(). In the resulting data set you will then see a partition number generated for each processing thread.

Other techniques to monitor parallelization:

  • You can also collect trace information by adding WITH PARAMETERS ('PLACEHOLDER' = ('$$CE_SUPPORT$$','')) to the end of the query statement.

  • Use SQL Analyzer navigate to the nodes that should be parallelized and you should see the nodes are duplicated according to the number of partitions.

Caution
It is important to not overuse this feature as over-parallelizing can lead to poor overall performance where other processes are affected.

Best Practices for Writing SQL

One of the benefits of using graphical modeling tools is that you don't have to concern yourself with writing SQL code. However, there are many times when you do need to write the SQL code directly, such as when you are developing a function or a procedure.

Writing the SQL directly offers the most flexibility to control exactly what you want to achieve and how you want to achieve it, but you may write code that is suboptimal for SAP HANA Cloud. Even experienced SQL coders should pay attention to the following guidelines to avoid applying techniques that do not work well with SAP HANA Cloud. SAP has extended standard SQL which can potentially lead to poor performance of calculation views.

The official SAP Help documentation provides more details, but here are some of the basic guidelines:

You can reduce the complexity of SQL statements by using table variables. Table variables are used to capture the interim results of each SQL statement and can then be used as the inputs for subsequent steps. This makes understanding the code much easier and does not sacrifice performance. Using variables also means the calculated data set can be referred to multiple times within the script, thus avoiding repeating the same SQL statement. So we highly recommend that you use table variables.

You can reduce dependencies by ensuring that your SQL steps are independent from each other. When you wrap SQL expressions in looping constructs or use IF/THEN/ELSE expressions to determine the flow, you inhibit the ability of the SQL Optimizer to produce the best plan. This is because you have declared a specific flow that the SQL optimizer cannot break. Also, using expressions that control the logic makes it harder to create a plan that can be parallelized and therefore achieve the best performance. So, always consider if you can achieve the same results using only declarative language, and try to control the flow with imperative language, such as loops and if/then/else.

Cursors are a powerful feature of SQLScript and can be useful when you need to read a table one record at a time.

However, when writing SQLScript that uses cursors and that will be consumed by calculation views should be avoided. Think about writing the code using declarative language for maximum performance.

Union Pruning

What is Union Pruning?

There are opportunities to significantly improve the performance of unions by implementing union pruning, You can avoid accessing entire data sources that are not required in a union node by defining explicit or implicit pruning rules.

Explicit Pruning

For each data source in a union node, you can define an additional column and fill it with a constant value to explicitly tag the meaning of the data source. For example, plan or actual. If a query filter does not match the constant value for a specific source of a union, then the source is ignored. This is called explicit pruning. Explicit pruning helps performance by avoiding access to data sources that are not needed by the query.

In the example in the figure, Union Pruning, the union node has a new column defined in the union node Manage Mapping pane, and this column is named temperature. For each of the data sources a value is assigned. In this case either old or new. This is a simple way to filter out sources to unions where the rules are simple and rarely change Although a union node usually has just one constant column, you can create and map multiple constant columns and use AND/OR conditions in your query. But a key point is that if you don't explicitly refer to the constant column in the sending query, then the data source is always included in the union and is not pruned.

Explicit pruning is a very simple and direct way to implement union pruning but it requires maintenance of the constant value in each calculation view union node.

Implicit Pruning

Implicit pruning is implemented by defining one or more pruning rules in a dedicated table called the Pruning Configuration Table. This pruning configuration table allows you to describe in much more detail the rules that determine when each source of a union is valid. This implicit pruning approach does not need to provide constant values in the union node for each data source inside the calculation view. With a union pruning configuration table, you can define multiple single values per column that are treated with ‘OR’ logic. You can also define values across multiple columns in a rule. Values across columns are treated with ‘AND’ logic. This gives much more flexibility over explicit pruning where you can only enter one value per data source. It also means you can easily update the pruning rules as these are simply records in a table.

The first time you create a pruning configuration table, you will need to provide a table name. You then provide the pruning rules using various operators:

  • The possible operators are =, <, >, <=, >= and BETWEEN.

  • The BETWEENoperator refers to a closed interval (including the LOW and HIGH values).

  • If several pruning conditions exist for the same column of a view, they are combined with OR.

  • On different columns, the resulting conditions are combined with AND.

The conditions in a Pruning Configuration Table can be defined on columns of the following data types:

  • INT
  • BIGINT
  • VARCHAR
  • NVARCHAR
  • Date

For example, you could define that a source to a union should only be accessed if the query is requesting the column YEAR = 2008 OR YEAR = 2010 OR YEAR between 2012 – 2014 AND column STATUS = ‘A’ OR STATUS = ‘X’.

This means that if the requested data is 2008 and status ‘C’, then the source is ignored. If the request is for 2013 and status ‘A’, then the source is valid.

We can use input parameters to direct the query to use only the union sources that contain the data required.

Whichever technique you choose, you are providing the optimizer with important information that it uses when making run time decisions regarding the union sources. This ensures the best possible performance by avoiding processing sources that are not needed.

Implement Union Pruning

Optimize Union Pruning

Watch this video to learn about optimizing Union Pruning.

Save progress to your learning plan by logging in or creating an account