Creating Data Flows

Objectives

After completing this lesson, you will be able to:

  • Create a basic data flow

Data Services Data Flows

Data flows contain the source, transform, and target objects that represent the key activities in data integration and data quality processes.

Data Flow Usage

Data flows determine how information is extracted from sources, transformed, and loaded into targets. The lines connecting objects in a data flow represent the flow of data with data integration and data quality processes.

Data flows:

  • Extract, transform and load data.
  • Determine the flow of data.
  • Are closed operations.
  • Are created in the Local Object Library or in the tool palette.

Each icon you place in the data flow diagram becomes a step in the data flow as shown in the figure.

You can use source and target objects and transforms as steps in a data flow. Make connections between the icons to determine the order in which Data Services completes the steps.

Data Flow Steps

Each step in a data flow, up to the target definition, produces an intermediate result. The intermediate result is called a data set.

The intermediate result is a set of rows from the previous operation and the schema in which the rows are arranged. This data set may be further processed and directed into another data set. A data set could be, for example, the results of a query containing a WHERE clause, to filter some rows, that flows to the next step in the data flow that will cleanse those rows.

Data flows are closed operations, even when they are steps in a work flow. Any data set created within a data flow is not available to other steps in the work flow or job. So the only way for a data flow to generate data that could be further processed is to load the data into a table or file.

Source Objects

A source object is generated from an object metadata in a datastore or from a file format when you drag and drop it to the data flow workspace.

Datastore Source Objects

If you want to read data from a table or other object in a datastore, you need to have first imported the metadata for this object. It will then be available in the object library and you will be able to drag and drop it into your workflow as a source.

Data Services analyzes the metadata to define the structure (the schema) of input data. This schema is the only thing needed to go on building the data flow. Of course, at execution, it will also use the datastore information to connect to the source and fetch the actual data.

File Format Source Objects

For file formats it is the same principle but then, once you have created the source object in the data flow, some properties are available for modification. Such as file location or error handling options.

A file format created with a sample file in local could then be made to access files from the job server which is the only option that would work since it's the job server that does execute the job.

You could also create several source objects from the same file format, in the same data flow, but fetching different files (with the same structure) to join them for example.

Let's Try It

Let's now define source objects in your data flows:

If you want to test it by yourself, go ahead:

The Query Transform

The Query transform, which is one of the Platform transforms, is the most commonly used transform, and is included in most data flows. For this reason, it is included in the tool palette with other standard objects.

The Query transform enables you to select data from a source and filter it or reformat it as it moves to the target. The figure shows an example of a Query transform applying a filter between the source and the target.

Query Transform Operations

The Query transform can perform the following operations:
  • Filtering data extracted from sources
  • Joining data from multiple sources
  • Mapping columns from input to output schemas
  • Performing transformations and functions on the data
  • Performing data nesting and unnesting
  • Adding new columns, nested schemas, and function results to the output schema
  • Assigning primary keys to output columns

To define the operations you need, you use the Transform Editor. It is a graphical interface for defining the properties of transforms. The workspace contains the following areas:

  • Input schema
  • Output schema
  • Options

Input and Output Schemas

The following figure displays the input and output schema areas.

The input schema area displays the schema of the input data set. The output schema area displays the schema of the output data set, including any functions.

You need to define a relationship between the input and output schemas in order to move data from the source to the target. To achieve this, you have to map each input column to the corresponding output column.

Map Input Columns to Output Columns

Perform one of the following actions in the transform editor to map input columns to output columns:

  • Drag a single column from the input schema area to the output schema area.
  • Drag a single input column over the corresponding output column, release the cursor, and select Remap Column from the menu.
  • Select multiple input columns using Ctrl+click or Shift+click on your keyboard and drag to the Query output schema for automatic mapping.
  • Select the output column and manually enter the mapping on the Mapping tab in the options area. You can type the column name in the options area or drag the column from the input schema pane.
  • Select the output column, highlight and manually delete the mapping on the Mapping tab in the options area.

Options Area

The Options area is below the input and output schema areas in the Query Transform Editor.

Options Area Tabs

TabDescription
Mapping

Specify how the selected output column is derived.

Select

Select only distinct rows, discarding duplicate rows.

From

Specify the input schemas used in the current output schema.

Outer Join

Specify an inner table and an outer table for joins that you want to treat as outer joins.

Where

Set conditions to determine which rows are output.

Group By

Specify a list of columns to combine output.

Order By

Specify columns to sort the output data set.

Advanced

Create separate subflows to process resource-intensive query clauses.

Find

Search for a specific item in the input schema or in the output schema.

Query Transform Editor for Joins

You define joins in the From tab of the Transform Editor. The From tab is displayed in the following figure.

Note

There is an older form of the join on the WHERE tab, but it is too restrictive. It is recommended to use join feature on the From tab.

Let's Try It

Let's start with a Query transform filtering data:

If you want to test it by yourself, go ahead:

I can also show you how to create a Query transform joining two sources:

Target Objects

The target object for your data flow can be a physical table or a file, and you add them to your data flow by drag and dropping from the datastore's metadata or from a file format, as for the source objects.

When your target object is a physical table in a database, the target table editor opens in the workspace. The editor contains tabs for database type properties, table loading options, and tuning techniques for loading a job.

Note
Most of the tabs in the target table editor focus on migration or on performance-tuning techniques. We will concentrate on the Options tab only.

Target Table Editor Options

OptionDescription
Column comparison

Specify how the input columns are mapped to output columns. Validation errors occur if the data types of the columns do not match.

Delete data from a table before loading

Use this option to send a TRUNCATE statement to clear the contents of a table before loading during batch jobs. The option defaults to Not selected.

Ignore columns with value

Specify a value in a source column that you do not want updated in the target table.

Use input keys

Enable Data Services to use the primary keys from the source table. By default, it uses the primary key of the target table.

Update key columns

Update key column values when loading data to the target.

File format as targets

If you use a file format as a target, you will be able to change the file location, as for the source object.

You also have an option to delete the content of the file before loading it with new data.

Depending on what you defined in the file format, and this is not editable in the target object, you can ask to write a header row or not,

Template Tables

You can use template tables in early application development when you are designing and testing a project.

Template tables have the following features:

  • They allow schema changes without going to the Relational Database Management System (RDMS).
  • They do not exist in the underlying database until the data flow has been executed successfully once.
  • Once executed, they become actual tables in the underlying database.
  • They are only identified as template tables in the metadata within the Data Services Repository.
  • The Target Table Editor has an option to Drop and Recreate Table for template tables.
  • The Import Table option converts a template table into a normal table.

With template tables, you do not have to create a new table in your datastore and import the metadata into Data Services. Data Services automatically creates the table in the database with the schema defined by the data flow when you execute the job.

When you create a template table as a target in one data flow, you can use it as a source in other data flows

You must convert template tables to normal tables so that you can use the new table in expressions, functions, and transform options. When you convert the template table, you can no longer alter the schema.

Let's Try It

Let's add a target table in the first job and see how to use a template table in the second one:

If you want to test it by yourself, go ahead:

Job Execution

When you create your project, jobs, and associated data flows, you can execute the job in Data Services to move the data from source to target.

Immediate Jobs and Scheduled Jobs

You can run jobs in the two following ways:

  • Immediate Jobs

    Data Services initiates batch and real-time jobs and runs them immediately from within the Designer. The Designer and the designated job server must be running to execute the job. Run immediate jobs only during the development cycle.

  • Scheduled Jobs

    Batch jobs are scheduled. Use the Data Services Management Console or a third-party scheduler to schedule the job. The job server must be running to execute a scheduled job.

Note
A job does not execute if it has syntax errors.

Manage Execution Options

Before executing your job, you can edit some options:

  • By changing the job properties, so that each and every execution has the same settings.
  • By changing the execution properties, so that the options are valid for this particular execution only.

Shared Execution Properties

OptionDescription
Print all trace messages

Record all trace messages in the log.

Collect statistics for optimization

Collect statistics so that the Data Services optimizer can choose an optimal in-memory or pageable cache type.

Collect statistics for monitoring

Display cache statistics in the Performance Monitor in Administrator.

Use collected statistics

Use the cache statistics collected on a previous execution of the job.

Execution Properties not available as job properties

OptionDescription
System configuration

Specify the system configuration to use when executing the job. A system configuration defines a set of datastore configurations, which define the datastore connections.

Job server or server group

Specify the job server or server group to execute the job.

Data Services can put failed request information into a log file. Data Services does not log failed requests by default because it can slow down performance. Placing failed request information in a log file must be enabled in the Management Console Administrator.

Check for errors

As a job executes, Data Services produces 3 log files, which are viewed from the Monitor tab of the project area. The log files are, by default, also set to display automatically in the workspace when a job is executed.

Select the Trace , Monitor , and Error icons to view the log files, which are created during job execution.

If the third icon, representing a cross, is displayed in color, and not grayed out, it means that errors have been found.

It is possible to have the trace log saying that the job has completed successfully but still have some errors if you have set the data flow to capture some specific errors, like data conversion errors during file access or file load for example.

Let's Try It

Now, finally, let's execute those jobs:

If you want to test it by yourself, go ahead:

Work flows

As we have seen earlier, you can create work flows inside your job and design the data flows inside those work flows instead of putting them directly into the job.

Here are a couple of analogies to help understand the role of work flows in your jobs:

For people who might think like programmers, you can compare work flows to subroutines. The job is the main program and it can call work flows (like subroutines). Work flows can do almost anything a job can do, but they cannot execute independently (they must be called by a job or another work flow).

For non-programmers, think of work flows in the same way you think of parentheses in arithmetic. Parentheses allow you to manipulate the order of operations in mathematical expressions. Similarly, work flows let us control the order of execution of operations, like scripts and data flows, inside of a job.

For the simple jobs and data flows we are starting with here, we probably won’t use work flows. But as your jobs become more complex – with multiple data flows, scripts, etc – work flows can make parts of job modular and more flexible, with no effect on performance. And as a bonus, they are reusable across multiple jobs!

Remember that data flows are reusable. But if two dependent data flows need to go together, maybe with some script between them, you could wrap them up in a work flow and reuse the complete work flow. (remember also that scripts are NOT reusable objects as such). This saves a lot of design time!

Log in to track your progress & complete quizzes