Optimizing data flows using the Data Transfer transform

Objectives

After completing this lesson, you will be able to:

  • Optimize data flows using the Data Transfer transform

Push Down Operations

When processing database data, SAP Data Services can leverage the power of the database engine. That may be very important for performance reasons. The mechanism applied is called SQL pushdown: part of the transformation is "pushed down" to the database in the form of generated SQL statements. Databases are often able to process data much faster. Internal processing within the database layer avoids or significantly reduces time-consuming data transfers between database server memory and Data Services memory and vice versa.

Data Services examines the database and its environment when determining which operations to push down to the database. There are two types of push-down operations:

  • Full: Pushes down all transform operations to the database servers and the data streams directly from the source database to the target database.
  • Partial: Pushes down the SELECT statements to the source database server.

Full Push-Down Operations

The Data Services optimizer always tries to do a full push–down operation. Full push–down operations can be pushed down to the databases and to the data streams directly from the source database to the target database.

For example, Data Services sends SQL INSERT INTO... SELECT statements to the target database and it sends SELECT to retrieve data from the source.

Full Push-Down Conditions

Full push–down operations to the source and target databases are only possible when the following conditions are met:

  • All of the operations between the source table and target table can be pushed down.
  • The source and target tables are from the same datastore or they are in datastores that have a database link defined between them.

Partial Push-Down Operations

When a full push–down operation is not possible, Data Services tries to push down a partial push-down with a SELECT statement to the source database.

SELECT Statement Operations that can be Pushed Down

The table lists operations within the SELECT statement that can be pushed to the database:

OperationDescription
AggregationsUsed with a GROUP BY statement to produce a data set smaller than or the same size as the original data set.
Distinct rowsData Services only outputs unique rows when you use distinct rows.
FilteringProduce a data set smaller than or equal to the original data set.
JoinsProduce a data set smaller than or similar in size to the original tables.
OrderingOrdering does not affect data set size. Data Services can efficiently sort data sets that fit in memory. Since Data Services does not perform paging by writing out intermediate results to disk, use the DBMS itself to order large data sets.
ProjectionsProduce a smaller data set because they only return columns referenced by a data flow.
FunctionsMost Data Services functions that have equivalents in the underlaying database are appropriately translated.

Operations that Can't be Pushed Down

Data Services cannot push some transform operations to the database, for example:

  • Expressions that include Data Services functions without database correspondents.
  • Load operations that contain triggers.
  • Transforms other than the Query transform.
  • Joins between sources that are on different database servers without database links defined between them.
Note
For a full list of operators, functions and transforms that you can use as push-down functions with SAP Data Services, see SAP Note 2212730, SAP Data Services push-down operators, functions, and transforms.

The View Optimized SQL Feature

You can view the SQL generated by the data flow before running a job. You can then adjust your design to maximize the SQL that is pushed down to improve performance and to improve the data flow when necessary.

Data Services only shows the SQL generated for table sources. Data Services does not show the SQL generated for SQL sources that are not table sources, for example, the lookup function, the Key Generation transform, the key_generation function, the Table Comparison transform, and target tables.

The following figure shows the Optimized SQL dialog box.

The Data Transfer Transform

The Data Transfer transform allows a data flow to push down resource–consuming operations to the database server.

The Data Transfer transform moves data from a source or from the output of another transform into a transfer object. Then reads data from the transfer object.

Use the Data Transfer transform to push down resource–intensive database operations that occur anywhere within the data flow.

Data Transfer Usage

  • Transfer a file or table from another database to obtain a full push-down operation in your data flow.
  • Push down resource-consuming operations such as joins, GROUP BY , and sorts.

The following figure shows the interface where you can push down operations.

Transfer Type

Choose one of the following transfer types to temporarily store the data:

  • Table: Database table from an existing datastore.

    Specify the Table options in Table name, Database type, and Array fetch size.

  • File: A flat file.

    Specify the File options in Root directory and File name.

  • Automatic: The optimizer chooses the transfer type from either:
    • Your datastores that have the Enable automatic data transfer checkbox selected.
    • The pageable cache directory that you specify in the Server Manager.

Data Input and Output Requirements

Data Input Requirements

  • When the input data set for the Data Transfer transform is a table or a file transfer type, the rows must be flagged with the NORMAL operation code.
  • The input data set must not contain hierarchical or nested data.

Data Output Requirements

Output data sets have the same schema and the same operation code as the input data sets.

  • The output rows are in the sort or GROUP BY order.
  • Data Services automatically splits the data flow into secondary data flows and executes them serially. The secondary data flow names use the following format: dataflowname_n

    The variable <n> is the number of the secondary data flow.

Let's Try It

Let me guide you through the use of the data transfer transform:

Log in to track your progress & complete quizzes