Choosing a Reference Data Warehouse Architecture

Objective

After completing this lesson, you will be able to create a reference data warehouse architecture

Introduction

Launch the video below to get started.

Reference Data Warehouse Architecture

A modern data warehouse is usually embedded in a complex environment that includes different technologies. The data warehouse must be able to handle large amounts of data from many different source systems that provide data in many different formats.

From a simplified view, there are two main areas of data that a modern data warehouse usually has to deal with

  • Commercially-oriented data, such as Sales and Distribution (SD), Financial Accounting & Controlling (FI / CO), Production Planning (PP), and Materials Management (MM), This data often comes from ERP systems such as SAP S/4HANA.
  • Production-oriented data, such as machine/IoT data, delivered through a manufacturing execution system or big data processing frameworks (such as Hadoop). In some cases this data can also come from customers of the manufacturing company or other third parties.

Modern data warehouses must also be capable to provision the data to an established Business Intelligence with a wide set of different tools, for example, for analysis, planning, reporting, or predictive purposes. In short, a modern data warehouse connects business data, production data, and business users. This can be done excellently by utilizing SAP HANA

When designing a data warehouse on of the biggest challenges is to choose a design that serves not only the current requirements of the business, but can also cope with the inevitable changes over time. We know that businesses don't stand still and the systems that support the business must be agile. New opportunities arise that are sometimes driven by advances in technology, such as artificial intelligence (AI), or changes to the business that often result in additional data and even complete new systems being integrated with a well-established data warehouse. A data warehouse takes a lot of effort and cost to implement and is usually becomes a mission critical system that must not be disrupted, even when we need to make changes.

So it is essential that thorough planning and design of the data warehouse takes place before even a line of code is written. One of the goals is to design a data warehouse architecture that is flexible so that new data can be easily integrated without disrupting existing data flows. Typically, you will look at implement a data warehouse using layers. One the one hand you want to keep the design as simple as possible, so you should try to implement only a few layers. But then you also need to provide enough layers to properly integrate the data so that you are able to harmonize and cleanse data from different sources, and also generate different shapes of data for consumption from the common store. Added to this challenge, you need to bear in mind that with SAP HANA, data does not have to be copied from source systems to the data warehouse and instead data can be virtualized. So this means the layers do not have to store physical data and can be defined as logical layers where data is generated on-demand.

Example Data Warehouse Architecture

In this course we will implement a data warehouse architecture that uses the following layers:

  • Acquisition Layer, which consists of the Data-InHub, Replication Area, and Staging Area
  • RAW DWH
  • Business Integrated DWH
  • Virtual Analytical Layer
  • Data Mart
  • Agile Area

Let's take a look at each layer in the next section.

Data-InHub

Data is not persisted in the Data-InHub area. The following object types are used in this area:

  • Source objects: SAP HANA objects (CDS, CalcViews, ABAP objects, etc.), virtual access to SAP extractors (ERP, BW, etc.), database tables (DB2, Oracle, etc.) , Flat files (XLSX, CSV, etc.), others (for more information, see SAP the EIM guide)
  • Target objects: Virtual tables in SAP HANA

Replication Area

In the replication area, source tables are replicated 1:1 in real time and persisted permanently. The following object types are used in this area:

  • Source objects: Any database tables including SAP HANA, DB2, Oracle, etc.
  • Target objects: SAP HANA tables, replication tasks

Staging Area

In the staging area, data is persisted temporarily and loaded using the delta loading techniques. For data provisioning, a range of ETL tools can be used, for example: SAP HANA Smart Data Integration (SDI), SAP Data Services, other non-SAP ETL tools (such as Informatica Powercenter, IBM InfoSphere Data Stage), and SAP Extractors. The following object types are used in this area:

  • Source objects: Any sources
  • Target objects: SAP HANA tables

RAW Data Warehouse

In the RAW Data Warehouse layer, data is persisted as raw, unprocessed information. The data structures conform 100% with the source data structures. Data can undergo historization processes which represent the history of the source data. Data is enriched with surrogate keys (in connection with the Business Integrated DWH). There is no source system overlapping integration, no quality control, no business transformation, and no end user access. This data needs to be able to serve any future requirement of the business so it must remain unchanged.

  • Source objects: Data from the Acquisition layer
  • Target objects: SAP HANA tables

Business Integrated Data Warehouse

In the Business Integrated Data Warehouse, cross-application integration, harmonization, standardization, and surrogate key creation take place. Business transformation, enrichment, and calculation are performed. Quality management (such as bad data, dummy values, correction, and auto-completion) takes place. Data is also historized, versioned, and modeled utilizing the data vault modeling approach (hubs, satellites, links). There is no end user access to the Business Integrated Data Warehouse.

  • Source objects: Usually data from the RAW layer, but for some data sources, can also come directly from the Acquisition layer if RAW layer is not needed
  • Target objects: SAP HANA tables

Virtual Analytical Layer

Master data and measures are transferred from lower architectural areas to analytical reusable virtual objects in the form of SAP HANA calculation views. Master data is turned into business entities. Basic measures are enriched inside fact tables. Objects include business-relevant data. There is no attribute selection. Data is projected completely, without connectors. Measures are not aggregated. There is no denormalization in order to create dimensional structures. This layer is not directly consumable by front end applications because the individual calculation views need combining to form meaningful shapes in the final layer.

The Virtual Analytical Layer follows the virtualization-first approach. Only persist data if performance is not satisfactory by generating the data for this layer on-demand. Or if historization requirements cannot be met with virtualization.

It is possible to run multiple instances at the same time and to persist data in the Virtual Analytical Layer through Architected Analytical Objects (for example, to counteract performance issues). Due to all the aspects mentioned above, the Virtual Analytical Layer requires a security concept.

  • Source objects: Usually data from the Business Integrated layer, in some cases the RAW layer, but even possible to source data directly form the Acquisition layer if this keeps things simple. Quite often, a combination of all these data sources is required.
  • Target objects: SAP HANA calculation views

Data Mart

The Data Mart layer provisions virtual analytical models in the form of SAP HANA calculation views and represents the final business perspective. This layer enables dimension cutting, fact aggregation, and information filtering and it utilizes the star and/or snowflake schema. It is possible to run multiple instances of the Data Mart layer at the same time using different modelling approaches. It is also possible to persist data through Architected Data Marts (for example to counteract performance issues).

  • Source objects: Data from the Virtual Analytical Layer
  • Target objects: SAP HANA calculation views

Agile Area

In some organizations there are experts / power users who are capable of modeling data but are not part or the IT department. They might be trained to use tools to create department-specific data models, or even write SQL code to generate data sets. This layer provides a way to enrich the data warehouse data with local data, such as spreadsheets. This area is governed by business users and not corporate IT. End users are enabled to create their own models on a self-service basis using data from local sources or the Acquisition layer, without the need of specialized IT knowledge.

  • Source objects: Data from any local sources (e.g. CSV, Excel) and the Acquisition Layer.
  • Target objects: SAP HANA calculation views

Corporate Memory

The Corporate Memory area is the single place to retain a full history of all inbound data for the entire enterprise. The goal of this layer is to provide a reliable, always-available, and complete source of all data at the most granular level. Think of this layer as the service that is able to provide any raw data to the upper layers to meet current and future business requirements. In our reference model example above, we see how the Corporate Memory is the very first layer to capture data from the source systems. No changes are made to this data and it is preserved in its original format. Sometimes, Corporate Memory is not the first layer and can be positioned higher up in the modeling stack. For example, we could capture the data in Corporate Memory after changes to the data to provide a clean data set that has undergone transformation, cleansing and enrichment so it is ready to use in the final layers.

But often, this layer is not implemented because it might seem too similar to the RAW and BID layers and provides no additional value. But sometimes the RAW or BID layers do not capture all data fields, or the RAW and BID layers might be emptied periodically so the full history is not preserved.

Technically, this layer could be implemented using cheaper disk-based hardware instead of memory.

The following object types are used in this area:

  • Source objects: SAP HANA tables
  • Target objects: SAP HANA tables

Note

We will not be implementing Corporate Memory or the Agile Area in our course.

Log in to track your progress & complete quizzes