Using Source-Based Changed Data Capture (CDC)

Objectives

After completing this lesson, you will be able to:

  • Use source-based change data capture (CDC)

Source-Based CDC

Source–based Changed Data Capture (CDC) is the preferred method of updating data because it improves performance by extracting the fewest rows. Source–based CDC, also referred to as incremental extraction, extracts only the changed rows from the source as shown in the following figure.

Time Stamp and Change Logs

To use source–based CDC, your source data must have some indication of the change. The indication can be a time stamp or a log file change.

  • Time Stamps:

    Use the time stamps in your source data to determine what rows have been added or changed since the last time data was extracted from the source. Your database tables must have at least an update time stamp to support this type of source-based CDC. Include a create time stamp also for optimal results.

  • Change Logs:

    Use the information captured by the RDBMS in the log files for the audit trail to determine what data has been changed.

Time-Based CDC

Some systems have time stamps with dates and times, some with just the dates, and some with monotonically–generated increasing numbers. You can treat dates and generated numbers in the same manner.

Time zones are important for time stamps based on real time. You can keep track of time stamps using the nomenclature of the source system and treat both temporal and logical time stamps in the same way.

Use Cases for Time-Based CDC

Time stamp–based CDC is an ideal solution to track changes if:

  • There are date and time fields in the tables being updated.
  • You are updating a large table that has a small percentage of changes between extracts and an index on the date and time fields.
  • You are not concerned about capturing intermediate results of each transaction between extracts, for example, if a customer changes regions twice in the same day.

Don't Use Time-Based CDC When

  • There are no time stamp columns available in the source tables to track changes.
  • You have a large table with a large percentage of it changing between extracts and there is no index on the time stamps.
  • You need to capture physical row deletes.
  • You need to capture multiple events occurring on the same row between extracts.

Time Stamp-Based Technique

For time-stamped CDC, you must create a work flow that contains:

  • A script that reads the target table and sets the value of a global variable to the latest time stamp.
  • A data flow that uses the global variable (or a parameter) in a WHERE clause to filter the data.

The data flow contains a source table, a query, and a target table. The query extracts only those rows that have time stamps later than the last update.

This example illustrates time stamp technique of tracking changes:

  • The last load occurred at 2:00 PM on January 1, 2008.
  • At that time, the source table had only one row (key=1).
  • Data Services loaded this row into the target table with the original time stamp of 1:10 PM on January 1, 2008.
  • After 2:00 PM, Data Services adds more rows to the source table.

At 3:00 PM on January 1, 2008, the job runs again. The job:

  1. Reads the Last_Update field from the target table as 01/01/2008 01:10 PM.
  2. Selects rows from the source table that have time stamps that are later than the value of Last_Update.

    The SQL command to select these rows is: SELECT * FROM SOURCE WHERE LAST_UPDATE > 01/01/2007 01:10 PM.

    This operation returns the second and third rows (key=2 and key=3).

  3. Loads these new rows into the target table.

Let's Try It

Let me guide you through the implementation of Source Based CDC:

Overlaps

There is a window of time when changes can be lost between two extraction runs. This overlap period affects source–based CDC because this capture relies on a static time stamp to determine changed data.

For example, if a table has 10,000 rows and a change is made to one of the rows after it was loaded but before the job ends, the second update can be lost.

There must be a strategy for overlaps. It may be possible to avoid overlaps or it may be necessary to perform overlap reconciliation, for example, by using the database transaction logs. In some cases, pre-sampling may help with overlaps.

You can find more information in the Designer Guide: https://help.sap.com/docs/SAP_DATA_SERVICES/ec06fadc50b64b6184f835e4f0e1f52f/5720dbaf6d6d1014b3fc9283b0e91070.html?locale=en-US.

Log in to track your progress & complete quizzes