Launch the video below to get started.
Objective
Launch the video below to get started.
A modern data warehouse is usually embedded in a complex environment that includes different technologies. The data warehouse must be able to handle large amounts of data from many different source systems that provide data in many different formats.
From a simplified view, there are two main areas of data that a modern data warehouse usually has to deal with
Modern data warehouses must also be capable to provision the data to an established Business Intelligence with a wide set of different tools, for example, for analysis, planning, reporting, or predictive purposes. In short, a modern data warehouse connects business data, production data, and business users. This can be done excellently by utilizing SAP HANA
When designing a data warehouse on of the biggest challenges is to choose a design that serves not only the current requirements of the business, but can also cope with the inevitable changes over time. We know that businesses don't stand still and the systems that support the business must be agile. New opportunities arise that are sometimes driven by advances in technology, such as artificial intelligence (AI), or changes to the business that often result in additional data and even complete new systems being integrated with a well-established data warehouse. A data warehouse takes a lot of effort and cost to implement and is usually becomes a mission critical system that must not be disrupted, even when we need to make changes.
So it is essential that thorough planning and design of the data warehouse takes place before even a line of code is written. One of the goals is to design a data warehouse architecture that is flexible so that new data can be easily integrated without disrupting existing data flows. Typically, you will look at implement a data warehouse using layers. One the one hand you want to keep the design as simple as possible, so you should try to implement only a few layers. But then you also need to provide enough layers to properly integrate the data so that you are able to harmonize and cleanse data from different sources, and also generate different shapes of data for consumption from the common store. Added to this challenge, you need to bear in mind that with SAP HANA, data does not have to be copied from source systems to the data warehouse and instead data can be virtualized. So this means the layers do not have to store physical data and can be defined as logical layers where data is generated on-demand.
In this course we will implement a data warehouse architecture that uses the following layers:
Let's take a look at each layer in the next section.
Data is not persisted in the Data-InHub area. The following object types are used in this area:
In the replication area, source tables are replicated 1:1 in real time and persisted permanently. The following object types are used in this area:
In the staging area, data is persisted temporarily and loaded using the delta loading techniques. For data provisioning, a range of ETL tools can be used, for example: SAP HANA Smart Data Integration (SDI), SAP Data Services, other non-SAP ETL tools (such as Informatica Powercenter, IBM InfoSphere Data Stage), and SAP Extractors. The following object types are used in this area:
In the RAW Data Warehouse layer, data is persisted as raw, unprocessed information. The data structures conform 100% with the source data structures. Data can undergo historization processes which represent the history of the source data. Data is enriched with surrogate keys (in connection with the Business Integrated DWH). There is no source system overlapping integration, no quality control, no business transformation, and no end user access. This data needs to be able to serve any future requirement of the business so it must remain unchanged.
In the Business Integrated Data Warehouse, cross-application integration, harmonization, standardization, and surrogate key creation take place. Business transformation, enrichment, and calculation are performed. Quality management (such as bad data, dummy values, correction, and auto-completion) takes place. Data is also historized, versioned, and modeled utilizing the data vault modeling approach (hubs, satellites, links). There is no end user access to the Business Integrated Data Warehouse.
Master data and measures are transferred from lower architectural areas to analytical reusable virtual objects in the form of SAP HANA calculation views. Master data is turned into business entities. Basic measures are enriched inside fact tables. Objects include business-relevant data. There is no attribute selection. Data is projected completely, without connectors. Measures are not aggregated. There is no denormalization in order to create dimensional structures. This layer is not directly consumable by front end applications because the individual calculation views need combining to form meaningful shapes in the final layer.
The Virtual Analytical Layer follows the virtualization-first approach. Only persist data if performance is not satisfactory by generating the data for this layer on-demand. Or if historization requirements cannot be met with virtualization.
It is possible to run multiple instances at the same time and to persist data in the Virtual Analytical Layer through Architected Analytical Objects (for example, to counteract performance issues). Due to all the aspects mentioned above, the Virtual Analytical Layer requires a security concept.
The Data Mart layer provisions virtual analytical models in the form of SAP HANA calculation views and represents the final business perspective. This layer enables dimension cutting, fact aggregation, and information filtering and it utilizes the star and/or snowflake schema. It is possible to run multiple instances of the Data Mart layer at the same time using different modelling approaches. It is also possible to persist data through Architected Data Marts (for example to counteract performance issues).
In some organizations there are experts / power users who are capable of modeling data but are not part or the IT department. They might be trained to use tools to create department-specific data models, or even write SQL code to generate data sets. This layer provides a way to enrich the data warehouse data with local data, such as spreadsheets. This area is governed by business users and not corporate IT. End users are enabled to create their own models on a self-service basis using data from local sources or the Acquisition layer, without the need of specialized IT knowledge.
The Corporate Memory area is the single place to retain a full history of all inbound data for the entire enterprise. The goal of this layer is to provide a reliable, always-available, and complete source of all data at the most granular level. Think of this layer as the service that is able to provide any raw data to the upper layers to meet current and future business requirements. In our reference model example above, we see how the Corporate Memory is the very first layer to capture data from the source systems. No changes are made to this data and it is preserved in its original format. Sometimes, Corporate Memory is not the first layer and can be positioned higher up in the modeling stack. For example, we could capture the data in Corporate Memory after changes to the data to provide a clean data set that has undergone transformation, cleansing and enrichment so it is ready to use in the final layers.
But often, this layer is not implemented because it might seem too similar to the RAW and BID layers and provides no additional value. But sometimes the RAW or BID layers do not capture all data fields, or the RAW and BID layers might be emptied periodically so the full history is not preserved.
Technically, this layer could be implemented using cheaper disk-based hardware instead of memory.
The following object types are used in this area:
Note
We will not be implementing Corporate Memory or the Agile Area in our course.Log in to track your progress & complete quizzes