Some systems have time stamps with dates and times, some with just the dates, and some with monotonically–generated increasing numbers. You can treat dates and generated numbers in the same manner.
Time zones are important for time stamps based on real time. You can keep track of time stamps using the nomenclature of the source system and treat both temporal and logical time stamps in the same way.
Use Cases for Time-Based CDC
Time stamp–based CDC is an ideal solution to track changes if:
- There are date and time fields in the tables being updated.
- You are updating a large table that has a small percentage of changes between extracts and an index on the date and time fields.
- You are not concerned about capturing intermediate results of each transaction between extracts, for example, if a customer changes regions twice in the same day.
Don't Use Time-Based CDC When
- There are no time stamp columns available in the source tables to track changes.
- You have a large table with a large percentage of it changing between extracts and there is no index on the time stamps.
- You need to capture physical row deletes.
- You need to capture multiple events occurring on the same row between extracts.
Time Stamp-Based Technique
For time-stamped CDC, you must create a work flow that contains:
- A script that reads the target table and sets the value of a global variable to the latest time stamp.
- A data flow that uses the global variable (or a parameter) in a
WHERE clause to filter the data.
The data flow contains a source table, a query, and a target table. The query extracts only those rows that have time stamps later than the last update.
This example illustrates time stamp technique of tracking changes:
- The last load occurred at 2:00 PM on January 1, 2008.
- At that time, the source table had only one row (key=1).
- Data Services loaded this row into the target table with the original time stamp of 1:10 PM on January 1, 2008.
- After 2:00 PM, Data Services adds more rows to the source table.
At 3:00 PM on January 1, 2008, the job runs again. The job:
- Reads the Last_Update field from the target table as 01/01/2008 01:10 PM.
- Selects rows from the source table that have time stamps that are later than the value of Last_Update.
The SQL command to select these rows is:
SELECT * FROM SOURCE WHERE LAST_UPDATE > 01/01/2007 01:10 PM.
This operation returns the second and third rows (key=2 and key=3).
- Loads these new rows into the target table.
Let's Try It
Let me guide you through the implementation of Source Based CDC:
There is a window of time when changes can be lost between two extraction runs. This overlap period affects source–based CDC because this capture relies on a static time stamp to determine changed data.
For example, if a table has 10,000 rows and a change is made to one of the rows after it was loaded but before the job ends, the second update can be lost.
There must be a strategy for overlaps. It may be possible to avoid overlaps or it may be necessary to perform overlap reconciliation, for example, by using the database transaction logs. In some cases, pre-sampling may help with overlaps.
You can find more information in the Designer Guide: https://help.sap.com/docs/SAP_DATA_SERVICES/ec06fadc50b64b6184f835e4f0e1f52f/5720dbaf6d6d1014b3fc9283b0e91070.html?locale=en-US.