Developing the Persistent Layers of the Data Warehouse

Objective

After completing this lesson, you will be able to load data to the STAGE, RAW and BID layers of the data warehouse

Load the STAGE Layer

In the Stage area, flowgraphs are used to move the data from the source master data tables that are located in an external schema, using synonyms, to the target tables.

During the loading process we will add a loading time-stamp to record the loading time and date of each record. We will also add a constant value to identify the source system in case that might be helpful later.

For this simple flow we will require the following nodes:

  • data source - this is our synonym
  • projection - to select the required columns and to generate the two new columns
  • data target - this is the target table defined using a CDS source file

Launch the video to explore the synonyms and tables that we will use in our data flow.

Launch the video to watch how we create our first flowgraph to load the material master staging table.

Launch the video to watch how we execute the remaining flowgraphs using a procedure.

Load the BID Layer

Load the HUB tables

Once we have loaded the master data to the stage tables in the data warehouse, we can then use the stage tables to load the various tables of the data vault model.

The first table we load is the hub table. For each business key, we generate a surrogate key and store both of these in the hub table. Surrogate keys play an important role in a data warehouse, especially to support star schemas.

We define flowgraphs to load the hub tables in the Business Integrated Data Warehouse layer. The flowgraphs check whether the business key already exists and if it doesn't the record is inserted into the hub table. The surrogate keys are created in the Data Target node of each flowgraph, using a predefined sequence database object.

Sequence objects define rules for the incrementation of key attributes. In SAP HANA a sequence is defined using a .hdbsequence source file. They are stored as dedicated database objects and can be selected in flowgraph Data Target nodes or in any procedures. The output that is generated when a sequence is called is a number. In the sequence source file you define how the number is generated. This is usually by reading a numeric column from a table and looking for the last number generated, then adding one or a higher number to it to get the next number.

The nodes in our flowgraph to load the hub tables are as follows:

  • data source - our stage table that has already been loaded and includes the extra generated columns
  • projection - ensure we only generate distinct rows (non duplicated)
  • lookup - read the target hub table and see if the record already exists. If it does, post the business key into a new column.
  • projection - apply a filter so that checks the new column and if its empty (is null) then the record must be new and we want this
  • data target - update the hub table with the new records and generate a surrogate key using a sequence

Launch the video to watch how we load one of the hub tables using a flowgraph:

There is a separate flowgraph defined in our project to load each hub table. We already executed these flowgraphs so all the hub tables are now loaded.

Load the SATELLITE tables

Once the hub tables are loaded we can then load the satellite tables.

Flowgraphs for the satellite tables in the Business Integrated Data Warehouse load data from both the staged source tables and the hub tables. Note that due to this dependency, hub table flowgraphs always need to be executed before the satellite table flowgraphs. Once loaded, source and hub tables are joined to create a new table that includes both the business key and the surrogate key. As you may recall from the hub table flowgraph example, surrogate keys are created in the data targets of the hub flowgraphs). Afterward, a table comparison is performed to check if the rows to be added already exist, before the data is added to the target table.

The nodes in our flowgraph to load the satellite tables are as follows:

  • data source - our stage table that has already been loaded and includes the extra generated columns
  • projection - ensure we only generate distinct rows (non duplicated)
  • data source - provide the surrogate key needed in the join
  • join - for each input record join to the hub table to fetch the surrogate key and add it to the output
  • table comparison - generate the op code to determine if we need to insert or update the target table
  • data target - update the satellite table with the records using upsert

Launch the video to watch how we load one of the satellite tables using a flowgraph:

There is a separate flowgraph defined in our project to load each satellite table. We already executed these flowgraphs so all the satellite tables are now loaded.

Load the LINK tables

Once the satellite tables are loaded we can then load the link tables.

As with satellite and hub table flowgraphs, link table flowgraphs also load the data from the staged source tables. Then a lookup is performed to check if the data to be added already exists. Afterward, the loaded data is joined with the already loaded hub tables. Note that due to this dependency, hub table flowgraphs in the reference architecture always need to be executed before the link table flowgraphs. Once the table is joined, data is delivered to the link table.

The nodes in our flowgraph to load the link tables are as follows:

  • data source - our stage table that has already been loaded and includes the extra generated columns
  • projection - ensure we only generate distinct rows (non duplicated)
  • lookup - check if the record already exists in the target table, if it does write the key value to a new column 'lookup'
  • projection - apply a filter to eliminate records that already exist in the target table by checking if the column 'lookup' is null
  • data source - provide the surrogate key from the region hub table that is needed in the join
  • data source - provide the surrogate key from the country hub table that is needed in the join
  • join - for each input record join to the hub tables to fetch the surrogate keys
  • data target - update the satellite table

Launch the video to watch how we load one of the link tables using a flowgraph:

There is a separate flowgraph defined in our project to load each link table. We already executed these flowgraphs so all the link tables are now loaded.

Load the RAW Layer

Now that we have loaded the master data tables that form the data vault model, we can now load the transaction tables.

We manage the transaction tables in the RAW layer.

The transaction tables contain some useful attributes that describe the measures, but the majority of attributes will come from the master data tables. This means we need to add a key to each transaction record to link to the corresponding master data record.

Flowgraphs in the RAW Data Warehouse also load the tables via synonyms. In our example there is no need to take the transaction tables through a staging layer, as we did for the master data tables, because the transaction tables are already complete.

To integrate the transaction data into the model, we need to include the surrogate keys from each of the hub tables so that we can join the master data to each transaction record. In our example we will include only the CUSTOMER, PRODUCT, and STORE master data. Additionally, new columns with the load date time-stamp and the source are added to the transaction tables to provide useful audit data.

The nodes in our flowgraph to load one of the transaction tables are as follows:

  • data source - this node points to our synonym which in turn points to the transaction table (actuals) in the external schema
  • projection - ensure we only generate distinct rows (non duplicated)
  • data source - this is the customer hub table that provides the surrogate key
  • data source - this is the product hub table that provides the surrogate key
  • data source - this is the store hub table that provides the surrogate key
  • join - fetch the surrogate key from each hub table and add all three of them to the transaction record - these keys will be used later to join to the master data tables
  • table comparison - ensure the record does not already exist it the target table
  • data target - update the transaction table with the new records

Launch the video to watch how we load one of the transaction tables using a flowgraph:

We have already loaded the plan table CE2IDEA using a similar flowgraph.

Log in to track your progress & complete quizzes