Using Target-Based Changed Data Capture (CDC)

Objectives

After completing this lesson, you will be able to:

  • Use target-based CDC

Target-Based Changed Data Capture (CDC)

Source–based CDC evaluates the source tables to determine what has changed and only extracts changed rows to load into the target tables. Target–based CDC, by contrast, extracts all the data from the source, compares the source and target rows, and then loads only the changed rows into the target with new surrogate keys.

Use target-based CDC when source-based information is limited.

To implement target-based changed data capture, you must use the Table_Comparison transform. Other transforms can help you complete your data flow.

Target-based CDC Transforms

Three Data Services transforms support target-based CDC:

TransformDescription
Table ComparisonCompares two data sets and produces the difference between them as a data set with rows flagged asINSERT or UPDATE .
History PreservingConverts rows flagged as UPDATE to UPDATE plus INSERT, so that the original values are preserved in the target. Specify the column in which to look for updated data.
Key GenerationGenerates new keys for source data, starting from a value based on existing keys specified by you in the table.

The steps in the target-based change data capture process are outlined in the following figure:

Target-based CDC with History Preserving

Preserve history by creating a data flow that contains:

  • A source table that contains the rows to be evaluated.
  • A Query transform that maps columns from the source.
  • A Table Comparison transform that compares the data in the source table with the data in the target table to determine what has changed.
  • A History Preserving transform that converts certain UPDATE rows to INSERT rows based on the columns in which values have changed. This transform produces a second row in the target instead of overwriting the first row.
  • A Key Generation transform that generates new keys for the updated rows that are now flagged as INSERT .
  • A target table that receives the rows. The target table cannot be a template table.

The Table Comparison Transform

You can detect and forward changes that have occurred since the last time a target was updated with the Table Comparison transform. This transform compares two data sets and produces the difference between them as a data set with rows flagged as INSERT or UPDATE.

Input, Comparison, and Output

The transform compares the input table and the comparison table.  it selects rows from the comparison table based on the primary key values from the input data set. Then it compares columns that exist in the schemas for both inputs.

The input data set must be flagged as NORMAL.

The output data set contains only the rows that make up the difference between the tables. The schema of the output data set is the same as the schema of the comparison table.

If a column has a date data type in one table and a datetime data type in the other, the transform compares only the date section of the data. The columns can also be time and datetime data types, in which case Data Services only compares the time section of the data.

Table_Comparison Transform Outcomes

There are three possible outcomes from the transform for each row in the input data set:

  • The row is ignored

    The primary key value from the input data set matches a value in the comparison table, but the comparison does not indicate any changes to the row values.

  • An INSERT row is generated

    The primary key value from the input data set does not match a value in the comparison table. The transform produces an INSERT row with the values from the input data set row.

    If there are columns in the comparison table that are not present in the input data set, the transform adds these columns to the output schema and fills them with NULL values.

  • An UPDATE row is generated

    The primary key value from the input data set matches a value in the comparison table. Values in the non–key compare columns differ in the corresponding rows from the input data set and the comparison table.

    The transform produces an UPDATE row with the values from the input data set row.

    If there are columns in the comparison table that are not present in the input data set, the transform adds these columns to the output schema and fills them with values from the comparison table.

  • No DELETE operations are produced by default but you can select the Detect deleted row(s) from comparison table option to extract them from the comparison table.

Table_Comparison Transform Options

Table_Comparison Transform Options

OptionDescription
Table nameSpecifies the fully qualified name of the source table. Table name is represented as datastore.owner.table .Datastore is the name of the datastore Data Services uses to access the key source table. Owner depends on the database type associated with the table.
Generated key columnSpecifies a column in the comparison table. When there is more than one row in the comparison table with a given primary key value, this transform compares the row with the largest generated key value of these rows and ignores the other rows.
Input contains duplicate keysProvides support for input rows with duplicate primary key values.
Detect deleted rows from comparison tableIdentifies rows that have been deleted from the source.
Comparison methodAccess the comparison table by using row–by–row select, cached comparison table, or sorted input.
Input primary key columnsSpecifies the columns in the input data set that uniquely identify each row. These columns must be present in the comparison table with the same column names and data types.
Compare columnsImproves performance by comparing only the subset of columns you drag into this box from the input schema. If no columns are listed, all columns in the input data set that are also in the comparison table are used as compare columns.

The History Preserving Transform

The History Preserving transform has its own data input requirements, data output results, and options.

History Preserving Data Sets

  • Input Data Set

    This data set is the result of a comparison between two versions of the same data. Rows with changed data from the newer version are flagged as UPDATE rows and new data from the newer version are flagged as INSERT rows. It can contain rows flagged as DELETE if the option has been set in the Table_Comparison transform.

  • Output Data Set

    This data set contains rows flagged as INSERT or UPDATE.

History_Preserving Transform Options

History Preserving Options

The History Preserving transform offers the options outlined in this table:

OptionDescription
Valid fromSpecify a date or datetime column from the source schema. Specify a Valid from date column if the target uses an effective date to track changes in data.
Valid toSpecify a date value in the format: YYYY.MM.DD. The Valid to date cannot be the same as the Valid from date.
ColumnSpecify a column from the source schema that identifies the current valid row from a set of rows with the same primary key. The Flag column indicates whether a row is the most current data in the target for a given primary key.
Set valueDefine an expression that outputs a value with the same datatype as the value in the Set flag column. Use this value to update the current flag column. The new row in the target preserves the history of an existing row.
Reset valueDefine an expression that outputs a value with the same datatype as the value in the Reset flag column. Use this value to update the current flag column in an existing row in the target when the row in the target includes changes in one or more of the compare columns.
Preserve delete rows as update rowsConvertDELETE rows to UPDATE rows in the target. If you previously set Valid from and Valid to values, this option sets the Valid to value to the execution date. Use this option to maintain slowly changing dimensions by feeding a complete data set through the Table Comparison transform first. Select the Detect deleted rows from comparison table option in the Table Comparison transform.
Compare columnsList the columns in the input data set to compare for changes.
  • If the values in the specified compare columns in each version match, the transform flags the row as UPDATE . The row from the before version and the date and flag information is updated.
  • If the values in each version do not match, the row from the latest version is flagged as INSERT when output from the transform. This adds a new row to the warehouse with the new values.

    Updates to non-history preserving columns update all versions of the row if the update is performed on the natural key, but only update the latest version if the update is on the generated key.

The Key Generation Transform

The Key Generation transform generates new keys before inserting the data set into the target in the same way as the key_generation function.

When it is necessary to generate artificial keys in a table, this transform looks up the maximum existing key value from a table and uses it as the starting value to generate new keys. The transform expects the generated key column to be part of the input schema.

For example, the History Preserving transform produces rows to add to a warehouse, and these rows have the same primary key as rows that already exist in the warehouse. In this case, add a generated key to the warehouse table to distinguish these two rows that have the same primary key.

Key_Generation Data Sets

  • Input data set

    This data set is the result of a comparison between two versions of the same data. Changed data from the newer version is flagged as an UPDATE row and new data from the newer version is flagged as an INSERT row.

  • Output data set

    This data set is a duplicate of the input data set, with the addition of key values in the generated key column for input rows flagged as INSERT.

Key Generation Transform Options

The table outlines the options available with the Key Generation Transform:

OptionDescription
Table nameSpecify the fully qualified name of the key source table from which the maximum existing key is determined. This table must be already imported into the repository. Table name is represented as datastore.owner.table where datastore is the name of the datastore that Data Services uses to access the key source table and owner depends on the database type associated with the table.
Generated key columnSpecify the column in the key source table containing the existing key values. A column with the same name must exist in the input data set. The new key is inserted in this column.
Increment valuesIndicate the interval between generated key values.

Let's Try It

Let me guide you through the implementation of a Target Based Changed Data Capture data flow:

Log in to track your progress & complete quizzes