Modeling Methods

Objectives
After completing this lesson, you will be able to:

After completing this lesson, you will be able to:

• Describe the common data modeling methods used to build data warehouses

Modeling Methods

Data Components

For more information, see Hans Hultgren, Modeling the Agile Data Warehouse with Data Vault, November 2012.

Example: 3NF Model

The 3NF is a strictly defined database normalization rule. Beside 3NF, there is a number of other normal forms (such as 1NF, 2NF, Boyce-Codd normal form) that come with different formal rules. All normal forms have in common that they are supposed to reduce data redundancy and improve data integrity.

Essentially, the 3NF approach enforces the following criteria:

• Identify each set of related data with a primary key
• Eliminate repeating groups in individual tables
• Create a separate table for each set of related data
• No partial dependencies (values depend on the whole of every candidate key)
• No transitive dependencies (values depend only on candidate keys)

Example: Dimensional Model

In dimensional modeling, data is separated into facts (measures) and dimensions (context for the measures). Conventionally, facts are numeric values while dimensions are groups of hierarchies and descriptors that define the facts. For example, the product price is a fact and the product description is a dimension. The key benefit of dimensional models is that they are relatively easy to understand for business users, because the structure is divided into measurements, facts and context, and dimensions. Also, the retrieval of data from the data warehouse tends to operate very quickly.

Data Vault Method

The main idea of the Data Vault is the explicit separation of structural information (in hubs and links) and descriptive context information/attributes (in satellites). It was developed by Dan Linsted and first described in a paper in 2000. Essentially, the Data Vault is a hybrid of the 3NF and the dimensional approach. The main advantages of the Data Vault approach are:

• Flexibility of the data model in case of extensions
• Full uni-temporal historization
• Auditability of data
• No dependence between data load processes
• Strong schematization of data load processes/utilization of ETL templates
• High parallelization of data load processes

Data Vault - Hub Tables

A hub table represents a core object of the business logic and is used to store a business key, which can also consist of several keys if required. Other descriptive information is not modeled in the hub. A hub does not contain foreign keys.

Formal rules for business keys in hub tables:

• Business keys should be natural keys used by the business (for example, national insurance number).
• Business keys should stand alone and have meaning to the business.
• Business keys should never change. They should have the same semantic meaning and the same granularity.

Focusing on business keys (instead of focusing on source system surrogates) ensures that the result serves the needs of the business.

A link models the relationship between two or more hubs, in other words, the business relationship between two or more business objects. The relationship must always be unique and specific to the business rules. A relationship is always n:m.

• Relationships/associations
• Foreign keys in OLTP systems
• Hierarchies and redefinitions
• Hierarchical relationships, which can be modeled by one link and two connections to hubs: HAL (parent-child LINK) and SAL (same-as LINK)

Transactions and events are often modeled as links (could also be a hub). For example, sales order or sensor data.

The modeling solution depends on the requirements, context, and so on.

Data Vault - Satellite Tables

A satellite contains all non-key/context attributes and is connected to exactly one hub or link. Hub or link tables can have several satellite tables, organized for example by source system. They can contain in the extreme case one column only or any number of columns (and the mandatory columns). Note that in this course we work without historicized data, so, the satellites shown only represent present data.

The criteria to design satellite tables (that is, to separate data into different satellite tables) are:

• Source system
• Rate of change
• Data types (such as separate CLOBS or other lengthy textual fields)
• Attributes which change together (temporal normalization), such as address satellite
• Attributes which are queried together
• Privacy and data protection aspects