Describing Data Provisioning

Objective

After completing this lesson, you will be able to Explain the key concepts of data provisioning.

Data Provisioning

Introduction

Data provisioning is a very broad term and refers to the acquisition of data from a source system to a target system. The word acquisition is preferred, and not data loading because data can be acquired without the need to physically load it to a target system. In fact, with advances in technology, moving data around an organization is becoming less common. It's often a lot simpler to read the data remotely.

There are many reasons why data provisioning is needed. These include:

Extract data from business applications and load to a central data warehouse
Provide real-time access to data sources for analytics
Distribute data from a central system to regional systems
Consolidate data from multiple systems into a central system
Keep systems in sync
Migrate data from a legacy system to a new system

In the simplest data provisioning scenarios, there are only two systems involved: the source and the target. But often there are multiple systems involved. For example, you might want to combine data from multiple source system to a single target system. It could also go the other way: a single source system that distributes its data to multiple target systems. And, finally, we can even have a combination of both: multiple source systems consolidating data and distributing it to multiple target systems.

Launch the video below to learn more about the basic concepts behind data provisioning:

Application or database control of data provisioning

Data provisioning can be controlled by standalone, specialized applications or using the in-built tooling of a database. Let's consider each approach.

Application-controlled data provisioning is when a dedicated application controls the data flow. These applications provide tools to connect to data sources and data targets and to define data-flow rules that determine how data moves between systems. Examples of dedicated data provisioning applications include SAP Data Services, SAP Landscape Transformation, and SAP Datasphere.

With this application-controlled approach, the application extracts data from a source database, and loads it to a target database. The extraction rules, flow logic and loading methods are managed by the application. Think of the data provisioning application as the orchestrator of data movement between systems. In some cases, the data provisioning application extracts the source data and stores it temporarily, before sending it onto the target system. This is often found in cases where multiple data sources need to be combined and a staging area is needed to synchronize the data, which might arrive at different times.

One of the key reasons for using a dedicated data provisioning application is when you’re working with multiple data sources that use different technologies or come from different vendors. These dedicated applications can usually process data from any sources, for example, databases, CSV files, JSON files, and web services. Some can even connect to business applications directly, for example, SAP BW/4HANA extracts from the SAP S/4HANA at the application level and not from the database tables. In this case, the data flow logic is built at a level higher than the physical storage technology.

Now let's look at database controlled data provisioning.

The basic requirement is that the database provides the data provisioning tools. The simplest type of data provisioning tool could be an export and import tool to move data from one database to another. But some databases, including SAP HANA, provide sophisticated tools to handle complex data provisioning scenarios such as those that require combining data, validating data, and enriching data. Using the in-built tooling of a database to manage data provisioning means you don’t have to implement separate data provisioning applications, as we described earlier. This approach supports a simpler landscape.

Working with database-provided tools is what we mean by database-controlled data provisioning. The data flow is controlled using tools that are part of the database.

This course covers the built-in data provisioning tooling of SAP HANA on-premise and SAP HANA Cloud.

Capturing Data Changes

Consider the following scenario: You've implemented a reporting application that runs on an SAP HANA database. The reporting application is used by management to analyze company spend. You decide to load the purchase order transactions on a daily basis from your procurement application to the SAP HANA target database. You begin by extracting the entire set of the purchase order records from the procurement application, and you load these to the SAP HANA target database. The source data and the target data are now in sync. But only for a while. That is because the next day, new procurement records are added to the source database, some existing records are modified, and some records are deleted.

So, what should we load next time from the source database to that ensure the source and target databases are kept in sync? The entire data set again, or just the changes?

Of course, you could delete all the data in the target database and reload all the records. This would mean you're back in sync. But this approach would be rather wasteful, especially if only one or two records had changed in a very large data set. Imagine if no records had changed. Deleting the contents of tables and then reloading them isn't very smart and is a waste of system resources. Plus, it's risky to delete entire data sets because if something were to go wrong in the reload, you might end up with no data.

A more efficient approach is to capture only the changes to the records, and update the target database only with the changes. Capturing the changes to records is often referred to as the delta load. It's also called change data capture (CDC).

So, how do we identify the records that have changed in the source application?

Some data sources provide in-built change data capture (CDC) capabilities. This means that the responsibility to identify the changed records lies with the source system. Your data extraction mechanism simply requests the delta data from the change data capture mechanism instead of a full load. But not all sources provide a CDC capability, so you need to find another way to pick out the records that changed from those that didn't.

There are various techniques available including:

Read the source system's redo log (if it has one). This is where changes to the database are collected.
Set up triggers on tables in the source system that fire when a record is changed, so that it can be sent to the target system.
Provide a field in the source record, such as a time-stamp, that captures the time of the record change, so we can identify the records that have been recently changed and filter out the ones we don't want. Many applications already include such a field.
Use data comparison tools that are often provided by data provisioning software to compare the source and target tables. The tool can then isolate the differences so that only those records are updated to the target system.

Finally, in the context of change data capture, you should be aware of the term snapshot. This term is often used when describing data provisioning.

A snapshot is a complete capture of all data from the source system at a precise moment in time. One of the reasons we use snapshots is to preserve a data set for a period of time before it's overwritten with new data. For example, we might load a snapshot of data at the end of the month to capture the position of our orders. We might then repeat the snapshot at the end of the following month. Snapshots give us a stable data set on which we can work. Sometimes we even retain the snapshots to build up a history. But if we decide to retain our snapshots, data growth has to be considered especially if the snapshot includes your complete data set.

In this course, you’ll learn about SAP HANA's capabilities regarding change data capture.

Types of Data Provisioning

Data Virtualization

As mentioned earlier, data provisioning doesn't always mean we have to load data to a target system. Data can be read from a source system and provided to a target system at the time the data is needed. When the target system has finished reading the data, the connection is terminated and no data is moved. This is called data virtualization. It can also be referred to as data federation, especially when you have a network of systems connected using this technique.

One of the key reasons for considering this approach is when data changes frequently and you want to ensure that only the latest version of the data is read. Also, it might be that the source data is only read very occasionally, so it doesn't make sense to make a copy of it by continually loading it and filling a target system.

One of the concerns with data virtualization is that performance might not be satisfactory because data has to travel across the network each time it's needed by a target application. This might not be a problem while data volumes are low, but, as we know, data volumes usually increase over time, so what might start out as a workable solution might quickly become unworkable. However, data virtualization should always be considered before making the decision to move the data to a target system.

Data Replication

When data should by kept in sync between systems, this is when you implement data replication. Data is copied from one system to another, usually without changes. The idea is that the target system captures data as it changes in the source system, so that applications running on the target system always use the latest data. Replication happens in real-time (synchronous) or in near real-time (asynchronous).

Replication is a popular option when virtualization isn't a good choice, and also when data doesn’t require transformation or aggregation before it's loaded to a target system.

When you replicate data, you need to carefully monitor the data volumes. This is because when you replicate from systems that generate a lot of data, the target system can be filled very quickly. This is especially true when the source data is very granular, because replication usually extracts the data one-to-one without filtering or aggregating records.

Data Transformation

If you need to make changes to the extracted data before it's loaded to a target system, this is referred to as data transformation. Transformation is a very broad term and can mean something as simple as applying filters to the data so that you load only what you need. It could also mean generating new values from source data. For example, calculating the profit from each record that provides the sales price and the cost. Transformation might also refer to the filling-in of missing fields and checking values. With transformation you can merge data from multiple systems. You can also define rules to determine how to split data and distribute it to multiple target systems.

As with replication, data transformation takes data from one or more source systems and loads it to a target system. The key difference is that replication usually does not make changes to the data, whereas transformation does.

One of the benefits of transformation over replication is that you can apply aggregation rules to summarize data instead of loading huge numbers of individual transactions. So, data growth can be controlled. However, transformations can quickly become complex and difficult to maintain as the data landscape changes.

Next lesson