Developing a Data Management Architecture

Objectives
After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

  • Implement a data management architecture

Partitioning Tables

Table Partitioning

Data in column store tables is separated into individual columns to support high performance on queries (query pruning) and also for better memory management (load only required column to memory). But it is also possible to subdivide the rows of column tables into separate blocks of data. We call this table partitioning. If column separation is vertical subdivision, think of table partitioning as horizontal subdivision.

Reasons for Partitioning a Column Store

Note
Partitioning tables also helps with more efficient delta merge management as you can perform merging on partitions.

Table partitioning is typically used in multiple-host systems, but it may also be beneficial in single-host systems especially on very large tables where only small sets of data are requested from large tables. Table partitioning is typically evaluated in the wider context of overall SAP HANA data management. Often modelers will not be responsible for creating and maintaining partitions but their input is incredibly important to the decisions being made about partitioning. That is why modelers must be aware of partitioning.

Table Partitioning

Note
Be careful not to confuse table partitioning with table distribution. The latter is where entire tables are carefully distributed across hosts for better load balancing. Table partitioning is where we distribute the data from one table across partitions in single or multiple hosts systems.

Partitions can be defined in a Create column table statement. You can also define partitions once a table is created using the alter statement. This can be done even if the table contains data.

Specifying table partitions with SQL

Distribution of Records over Partitions

Partitioning is transparent for SQL query definitions. This means the SQL code does not refer to partitions. However, the key purpose of partitioning tables from a modeler perspective is to improve performance of the queries. Knowing how a table is partitioned is essential for the modeler. With this knowledge they can then carefully design calculation views, functions, or procedures to exploit the way the partitions are formed. For example, a modeler that is aware of partitions should ensure that filters are applied as early as possible on the partitioned column. This means that only data for the partition is loaded to memory.

Partitions are usually defined when a column-store table is created, but it is also possible to partition a column-store table that was created but never partitioned. You do not have to drop the table to create partitions. You can partition an existing table that is empty or already contains data.

As well as creating partitions, it is also possible to:

  • Re-partition an already partitioned table, for example, to change the partitioning specification (hash to round-robin and so on) or to change the partitioning columns or to increase or decrease the number of partitions.

  • Merge partitions back to one table.

  • Add/delete individual partitions.

  • Move partitions to other hosts.

You can display partition information for a table by opening Runtime Information tab of the table definition in the Database Explorer.

Multi-Level Partitions

With multi-level partitions, you can create additional partitions on each partition. It might help to think of this as a hierarchy of partitions. This is typically used when you first want to distribute large tables over multiple hosts (use hash partitioning to load balance), and then for each of those partitions you apply the next level of range partitions (break up by year). So now you have host/year partitioning. Also, multi-level partitioning can be used to overcome the limitation of single-level hash partitioning and range partitioning, that is, the limitation of only being able to use key columns as partitioning columns when you are dealing with keyed tables. Multi-level partitioning makes it possible to partition by a column that is not part of the primary key.

Note
You cannot define partitions on a row-store table

Data Tiering

What is Data Tiering?

In pursuit of the best performance for analytics, modelers should be aware of all the technical features provided by SAP HANA Cloud that can help to reach this goal. Data management storage architecture can have a significant impact on the performance of analytics.

SAP HANA Cloud provides an in-memory database. We have an opportunity to fit more data in memory than ever before. It is now possible to size hardware so that even the largest enterprise databases could be stored completely in memory. However, it does not make sense to store old, infrequently accessed data in expensive memory. Memory should not be used as a data archive and should be reserved for active data where instant access is needed. There are better solutions for managing data that is infrequently accessed, at a much lower cost than memory, but still providing good read performance.

Data storage can be considered as a pyramid. We classify data into layers of the pyramid. In SAP HANA Cloud we have four main layers.

At the top we have data that requires instant access, but storing at this level comes at the highest cost. The next layer loses some of the performance but the cost reduction can be significant. Finally, lower layers provides access speeds that might only be acceptable in some scenarios, but usually the storage capacity is huge and costs are relatively low.

SAP provides technical solutions to manage data at all data storage layers. The layering of data is known as data tiering. For all four layers, there are solutions available within SAP HANA Cloud.

Data Tiering Pyramid

Considerations for a Data Modeler

Why should a data modeler care about data management?

Many organizations now own huge amounts of data and have implemented data storage solutions. Although it is unlikely that a data modeler will be involved in the technical implementation of the tiering solutions, the decisions made around how data is tiered might have an influence on the design of their data models. For example, a modeler might suggest that data older than one year is managed in the lower layers, so they can implement union pruning rules or filters in their calculation views to prevent access to the layers if historical data is not required by queries. This can have a big impact on performance. Why trawl through the lower performance layers looking for data that is not needed?

Well before calculation views are created, a well-thought-out data management tiering architecture is an essential foundation on which to build. Modelers are important stakeholders in the design of the data management architecture and can heavily influence the solution with their detailed knowledge of data modeling capabilities within SAP HANA Cloud.

Save progress to your learning plan by logging in or creating an account