Implementing Imperative and Declarative Logic

Objective

After completing this lesson, you will be able to implement imperative and declarative logic.

Imperative Logic

What Is Imperative Logic?

Let’s begin by establishing that declarative logic is free from side effects. That means there is no change of status for the underlying objects at either the database or at the application level. In other words, declarative logic works on a read-only basis. By focusing purely on read-only operations, SAP HANA is able to fully optimize the whole processing logic and parallelize as much of the execution as possible.

Writing code using only declarative language limits what you can achieve, and sometimes you need to control the data flow logic using imperative (sometimes called orchestration) language. For these situations, SQLScript supports the use of imperative language.

Imperative language is used to implement data-flow and control-flow logic using constructs such as loops and conditionals. When you include these constructs, the code then executes procedurally and follows the logic of the loop or conditional. You should bear in mind that this reduces the possibilities for SAP HANA to apply optimizations to the code.

Examples of imperative logic:

  • Scalar variable manipulation

  • DDL/DML logic

  • FOR & WHILE loops

  • Emptiness checks

  • Branching logic based on some conditions, for example, IF-ELSE

  • Executed procedurally, but the SQLScript engine will still continue to make all efforts to optimize for parallelization

Note

The imperative logic is usually represented as a single node in the dataflow graph; thus, it is executed sequentially, and so the opportunities for optimizations are limited.

As a caution to the reader, for the next several lessons, we will be discussing syntaxes and structures that are explained in terms of each other. Structure A will be demonstrated within an example of Structure B, which hasn't been presented yet. Structure B will be shown in terms of Structure C, which in turn will be shown in terms of Structures D and E, which will be shown in terms of Structure A. Unfortunately, there is no one base structure that doesn't reference at least one other, so we will have to include warnings that a given item is yet to be discussed. At the conclusion of this unit and the next, you should review both in their entirety to understand everything in context.

Variables

Scalar Variables

  • Scalar variables store a typed value under a name.
  • They may be initialized empty (initial NULL value) or with a declared value:

    • DECLARE local_var int;
    • DECLARE C int = 5;
    • DECLARE a int DEFAULT 0;
  • Assignment done using "=":

    It can be referenced by using ":" as the prefix to the variable name.

    local_var = :C + :a + 3 ;

The local scalar and table variables are declared in the functions or the procedures by using the DECLARE keyword. They are optionally initialized in the declaration.

The variables can also be flagged as read-only by using the CONSTANT keyword, and the consequence is that you cannot override the variable anymore. Also, if you use the CONSTANT keyword, it should have a default value and cannot be NULL.

The value in a variable is accessed or retrieved by prefixing its name with a colon (:). Please note that the colon notation will have several different uses in SQLScript, but this is the most common. When you want to assign a value to another variable, use '='. The left side will contain the variable whose value will be updated, as below:

Code Snippet
1
local_var = :C + :a + 3 ;

In this example, the value of variables C and a are retrieved and added together with the constant value 3, then the sum is stored in the variable local_variable. If the syntax example above is the source of the variables, the local_variable is set to a value of 5 + 0 + 3 = 8.

An assignment is possible more than once, and it will overwrite the previous value stored in the scalar variable. At the same time, the SQLScript supports local variable declaration in a nested block (bounded by BEGIN and END). Local variables are only visible in the scope of the code block in which they are declared and defined, meaning the variable value inside the nested block is assigned inside the nested block. For example, if the variable is only declared outside the nested block, but also assigned inside the nested block, the inside assignment will override the value assigned outside of that nested block.

It is also possible to define local variables inside WHILE / FOR / IF-ELSE control structures (to be discussed later).

Table Variables

  • Table variables store a set of values in a table-like structure under a single name.
  • The table variable will have column name and type declarations, and some number of value sets (rows).
  • Defined by using the DECLARE keyword.

  • May be declared as explicit an column name/type list, as matching the structure of a query or function's output, or based on a table type
  • As with scalar variables, may be initialized empty or with a dataset declaration:

    • DECLARE temp TABLE (n int);
    • DECLARE inTab MY_TABLE_TYPE = UNNEST (:arr) as (i);
    • DECLARE temp CONSTANT MY_TABLE_TYPE DEFAULT SELECT * FROM TABLE;
    • DECLARE outTab MY_TABLE_TYPE;
  • Assignment done using "=":

    It can be referenced by using ":" as the prefix in the variable.

    outTab = SELECT * FROM :inTab;

Note

Table types have not yet been discussed. For now, think of a persistent structure that can be used as a shortcut to define table variables: one or more table variables can be defined to match the structure of a resuable table type.

Note

The terminology here is confusing. A table, a [data]type, and a table type are all different things. A table type is NOT the same as the category (type) of a table (column store vs row store)!

Global Session Variables

If there is a need to pass the value of a variable between procedures and functions that are running in the same session, global session variables can be used in SQLScript. This variable value is only available in that session and not visible from another session. The session variables are of type STRING, and if it is not of type STRING, it will implicitly be converted to type STRING.

  • Used to share a scalar value between procedures and functions that are running only in the same session.
  • Not visible from another session.
  • Set in a procedure or a function.

    SET 'MY_SES_VAR' = :new_value ;

  • Retrieve the session variable using SESSION_CONTEXT() function.

    var = SESSION_CONTEXT('MY_SES_VAR');

  • Default value is NULL.
  • Can be reset to NULL using UNSET.

    UNSET 'MY_VAR’ ;

  • Cannot be used in read-only procedures or functions.

The global session variables are set to NULL by default. They can be set and unset (reset to NULL) using the SET and UNSET command inside the procedures or functions.

Working with Arrays

An array is an indexed collection of elements of a single data type. In the following section, we explore the varying ways to define and use arrays in SQLScript. The array can be used to set, return, or remove elements in an array, concatenate two arrays into a single array, turn arrays from multiple arrays into a table using the UNNEST function, and so on.

  • Developers are allowed to define and construct arrays within SQLScript.
  • An array is an indexed collection of elements of a single data type.
  • Operations supported:
    • Set and access elements by index using <array>[<index>]

    • Add and remove elements from the ends of an array

    • Concatenate two arrays using CONCAT or ||

    • Convert arrays into a table using UNNEST()

    • Convert a table column into an array using ARRAY_AGG()

    • Determine the length of the array using CARDINALITY()

The following examples will show imperative logic in the context of procedures, the syntax for which has not yet been described. For the moment, we will use pseudocode to represent procedures so that we can focus on the imperative logic structures.

Code Snippet
123456789101112131415161718192021222324252627282930313233343536373839
-- START OF PSEUDOCODE -- Procedure is named "STUDENT01.PRODUCT_ARRAY", provides one output parameter -- Parameter "output_table" is a tablelike structure, based on an existing structure called IT_PRODUCT_SALES -- The structure of IT_PRODUCT_SALES will be compatible with the data stored in it at the end of the procedure: -- two varchar(20) columns and a decimal(15,2) column -- Procedure logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- declare three arrays for three ordered sets of data that will be populated later: DECLARE productid VARCHAR(20) ARRAY; DECLARE category VARCHAR(20) ARRAY; DECLARE price DECIMAL(15,2) ARRAY; -- populate the productid array with three address-ordered values -- note the := notation, where the value to the right is retrieved and sent to the left productid[1] := 'ProductA'; productid[2] := 'ProductB'; productid[3] := 'ProductC'; -- similarly, populate the category and price arrays category[1] := 'CategoryA'; category[2] := 'CategoryB'; category[3] := 'CategoryC'; price[1] := 19.99; price[2] := 29.99; price[3] := 39.99; -- populate the parameter output_table from the arrays; each row in the tablelike parameter will -- be given the [1] addressed array values in the first row, the [2] address array values in the -- second row, etc, and assigned to columns named in the AS clause -- note the : notation, which is used to retrieve values from the named arrays output_table = UNNEST(:productid, :category, :price) AS (PRODUCTID, CATEGORY, PRICE); -- when called, this procedure will return output_table with three rows END;

As shown in the example, the UNNEST() function converts one or more arrays into a table. The result table includes a row for each element of the specified array. The result of the UNNEST() function needs to be assigned to a table variable.

The ARRAY_AGG() function converts a column of a single table into an array. While converting the column of a table using the ARRAY_AGG() function, the columns can be placed in the array using ORDERY BY to do it in ascending or descending order, and also to place any NULLs at the beginning or in the end using NULLS FIRST or NULLS LAST.

Code Snippet
1
ARRAY_AGG(<table_name>.<column_name> [ORDER BY {<expression> [ASC | DESC] [NULLS FIRST | NULLS LAST]}) ;

Access Any Cell

The index-based cell access allows you to access any cell (read/write) of an intermediate table variable or table parameter directly.

Code Snippet
1
<table_variable>.<column_name>[<index>]

The index is a number which can be a number between 1 to 2^31 or an SQL expression, or a Scalar UDF that returns a number. Reading and writing can be performed on a cell of a table variable using the index value.

When writing a value to the table variable, use the following syntax:

Code Snippet
1
ex_emp.emp_id[1] = '10001';

When reading, you must precede the table variable with a ":"

Code Snippet
1
outvar = :intab.B[100];

Code Snippet
123456789101112131415161718192021222324
-- START OF PSEUDOCODE -- Procedure is named "HA150::build_emp_list", provides one output parameter -- Parameter "ex_emp" is a tablelike structure with four columns: -- emp_id, name.first, name.last are nvarchar(10, and saleamount is integer -- Procedure is read only -- Procedure logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- Assign values to the first row of ex_emp; note the colon notation is not used as the values -- assigned are not being retrieved from anywhere ex_emp.emp_id[1] = '10001'; ex_emp."name.first"[1] = 'Mickey'; ex_emp."name.last"[1] = 'Mouse'; ex_emp.saleamount[1] = 2000; -- Similarly, assign values to the second row ex_emp.emp_id[2] = '10002'; ex_emp."name.first"[2] = 'Don'; ex_emp."name.last"[2] = 'Duck'; ex_emp.saleamount[2] = 3000; -- when called, this procedure will return ex_emp with two rows END;

Restrictions:

  • Physical tables cannot be accessed.

  • Not applicable in SQL queries like SELECT :MY_TABLE_VAR.COL[55] AS A FROM DUMMY. You need to assign the value to be used to a scalar variable first.

Arrays can be used to support data manipulation and conversion.

Code Snippet
1234567891011121314151617181920212223242526272829303132
-- START OF PSEUDOCODE -- Function is named "Convert_OfficialsHours" and takes one input parameter and returns one output value -- Parameter im_to is varchar(1) and indicates the target time unit to convert accrued vacation hours to -- Output value lv_factor is decimal(5,2) and will hold the conversion of hours into the specified unit -- A table variable for holding calculations, lt_conversion, is created based on the output of an UNNEST() -- Functions are read only by definition -- Function logic is bounded by BEGIN and END -- END OF PSEUDOCODE BEGIN -- declare two arrays for two ordered sets of data that will be used in a calculation later -- array "unit" will indicate unit of time: minute, hour, or day -- array "factor" is a multiplier to convert hours to target unit DECLARE unit VARCHAR(1) ARRAY; DECLARE factor DEC(5,2) ARRAY; -- populate the arrays with conversion units and multiplier factors -- the formatting of the statements is to make it clear which unit goes with which factor unit[1] := 'm'; factor[1] := 60; unit[2] := 'h'; factor[2] := 1; unit[3] := 'd'; factor[3] := 1.0/24.0; lt_conversion = UNNEST(:unit, :factor) AS (Unit, Factor); SELECT COALESCE(Factor, 1.0) INTO lv_factor FROM :lt_conversion WHERE Unit = :im_to; -- additional syntax omitted, including return of lv_factor to the calling SELECT calling this function END;

The above syntax performs the following tasks:

  • The UNNEST() function converts the referenced arrays into a tablelike variable, with specified column names.
  • In this case, the unit and factor arrays become columns named Unit and Factor in table variable lt_conversion.
  • The COALESCE() function returns the first non-NULL value from its input arguments.
  • In this case, the :im_to parameter is searched in the lt_conversion table variable and the appropriate value of Factor is returned.
  • If the :im_to value is invalid, no rows are returned and Factor is NULL.
  • If no rows are returned and Factor is NULL, COALESCE() skips the value of Factor and returns a conversion ratio of 1.0.

There are certainly easier ways to convert a value from hours to minutes or days (CASE, IF, and a lookup table are just some possibilities), but this example allowed a relatively intuitive demonstration of array operations. Also note that an array was chosen instead of creating a table for conversion factors because tables (or other permanent structures) cannot be created in functions.

Conditional Logic

Within the imperative logic of SQLScript, there are control-of-flow statements like IF, and loop statements like WHILE and FOR. They will allow selective or iterative execution of blocks of SQLScript code.

The IF statement tests a Boolean expression condition and branches execution based on the outcome of a test. IF statements are part of nearly every programming language, and they are structured fundamentally the same, with only minor syntactic differences; if you know how to code an IF in any other language, this will look familiar.

Broadly speaking, the initial test evaluates to either true or false; if true, the block of code following the THEN keyword is executed and the terminates. If the initial test is false, then the execution skips to the next test, following the ELSE IF key phrase. If this new test is true, the following block of code is executed and the IF exits. If false, execution passes to each subsequent ELSE IF test until a true is encountered, or a final ELSE block is executed (if present). In SQLScript, IF statements are finished with an explicit END IF.

IF Condition

Flowchart showing the conditional logic of the IF condition.

The IF statement, as shown in the figure, will have one true condition and will be executed, while all others will be false (or not evaluated) and skipped. Note that the tests are evaluated in top-down order as written, and rearranging the tests may result in a different outcome.

WHILE Loop

Flowchart showing the conditional logic of the WHILE loop.

The WHILE loop executes the block of code in the body of the loop, as long as the Boolean expression at the condition check in the beginning of the loop evaluates to true. If the condition evaluates as false, the whole block of code is skipped. WHILE loops may be nested.

FOR Loop

Flowchart showing the conditional logic of the FOR loop.

The FOR loop iterates over a range of numeric values and binds the current value to a variable in ascending order. Iteration starts with the value of a starting value and is incremented by one until the variable is greater than ending value. If the start value is greater than the ending value in the range of numeric values, the loop will not be evaluated. Unlike WHILE loops which allow for any logical test, the FOR loop only asks if the ending iteration bound has been reached. The optional keyword REVERSE will cause the loop to iterate in descending order. FOR loops may be nested.

BREAK and CONTINUE

Code block showing the conditional logic of the BREAK and CONTINUE statements.

The BREAK and CONTINUE statements provide internal control functionality for loops.

BREAK causes an unconditional and immediate exit of the current loop and moves execution to first statement following the loop, if there is any. In nested loops, this causes an exit from the current nesting level and reverts to the immediately-higher level. Note that loops are not named, and so nested loops must each be explicitly exited; you cannot "jump" over intermediate levels of nesting to a non-adjacent level.

CONTINUE skips the remainder of the statements to the end of the loop move execution back to the top of the loop and re-evaluates the Boolean condition (WHILE loops) or the iteration count (FOR loops) to determine if the loop will run again. Unlike BREAK, the loop iteration may continue if the loop condition evaluates as true (WHILE) or there are still values in the iteration bounds (FOR).

Declarative Logic

Declarative Logic

The aim of declarative logic in SQLScript is to enable optimization for the execution of data-intensive computations. It does so by carefully writing SQL code that declares only the intent of the request and not how to process.

Declarative Logic

  • Allows the developer to describe the data flow declare the data selection via SELECT statements.
  • Is used for efficient execution of data-intensive computations.
  • Is internally represented as a data flow graph with nodes that might execute in parallel.
  • Each statement is bound to a variable to be passed as input to the next statement.
  • With no changes to the database, the declarative logic is side-effect free.
  • It can only use a limited set of SQLScript language.

You implement declarative logic by coding in SQLScript, following specific rules in order to preserve the optimization. Once you break the declarative coding rules, you begin to limit the optimization possibilities.

Declarative logic in SQLScript maximizes the parallelization possibilities, and as SAP HANA is built for high levels of parallel processing, declarative programming is highly recommended.

The main requirement of declarative coding is that the operations in a data-flow graph must be free of side effects. This means the statements must not change any global state in the database, or at the application level. SQLScript must execute read-only statements and any writing should be restricted to session variables and parameters which only exist during the execution and do not have a permanent effect on the database, such as changing a record in a table or creating a persistent SQL object.

Parallelization and being free of side effects can only be achieved by allowing changes to the data set that is passed as inputs to the operator, and by allowing only a limited subset of language features to express the logic of the operator.

When writing SQLScript code, careful planning and design is needed when specifying the statements in order to write the flow logic. If you do this well, then the SAP HANA database has the freedom to figure out the data flow step dependencies and thus optimize the data flow, which may result in better performance.

The following examples will show declarative logic in the context of procedures, and will continue the use of pseudocode to represent the procedure structures.

Code Snippet
1234567891011121314151617181920212223242526272829303132333435363738
-- START OF PSEUDOCODE -- Procedure is named "HA150::Colorhpcars", takes two input parameters: -- Parameter "hpval" is an integer, paramter "selclr" is varchar(10) -- Prodcedure is read-only -- Procedure logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- Q1 identifies cars with horsepower greater than the supplied input paramater, hpval OVERHP = SELECT * FROM "HA150::CAR" WHERE HP > :hpval; -- Q2 identifies cars from the high-HP set that match the color of input parameter selclr SELCOLOR = SELECT * FROM :OVERHP WHERE color = :selclr; -- Q3 identifies how many cars from the HP- and color-matched set are BMWs BMWHP = SELECT COUNT(*) FROM :SELCOLOR WHERE BRAND = 'BMW'; -- Q4 identifies all cars from the HP- and color-matched set that do not have registered owners NOOWNER = SELECT * FROM :SELCOLOR WHERE OWNER IS NULL; -- return the count of high-HP, color-matched BMWs SELECT * FROM :BMWHP; -- return the high-HP, color-matched cars with no registered owners SELECT * FROM :NOOWNER; END;

In this example, the placeholder names OVERHP, SELCOLOR, BMWHP, and NOOWNER are created by implication – a name is defined to reference the result of a simple SELECT, then passed as an input (":" notation ) to the next SELECT.

Example of a data flow graph.

A procedure in SQLScript that only uses declarative constructs can be completely translated into an acyclic (does not loop) dataflow graph, where each node represents a data transformation.

The above example defines a read-only procedure that has two scalar input parameters.

Query Q1, identifies the cars that have a HP over a certain value of any color (using the input parameter ’hpval’).

Query Q2, identifies cars from Q1 that meet the specified color (using the input parameter ‘selclr’).

Finally, this information is aggregated in two different ways:

  • Query Q3 is a count of the BMW cars from Q2.

  • Query Q4 returns cars that have no owner but are of any make; this selection is also from Q2.

  • These resulting tables constitute the output tables of the function.

  • There are no side-effects; OVERHP, SELCOLOR, BMWHP, and NOOWNER do not exist after the conclusion of the procedure's execution .

As you can see after Query Q2, there is parallelization happening for both queries Q3 and Q4.

All data flow operations have to be side-effect free, that is they must not change any global state either in the database or in the application logic, which will be optimized for parallel execution by the optimizer.

Table Types

In SQLScript, table types are a form of user-defined datatype that, instead of representing a single domain (such as values that may be found in a column), instead represent a tablelike structure. The table type, once created, can act as the basis of a table parameter or table variable, which will pick up the column names, definitions, and order of the underlying table type. Table types are created using the CREATE TYPE command:

Code Snippet
1
CREATE TYPE <type_name> AS TABLE (<column_list_definition>)

Where the <type_name> is a valid name, optionally prefixed by a schema name, and the <column_list_definition> is a set of name-datatype pairs, similar to the creation of a table.

Examples:

Code Snippet
1234567891011121314
CREATE TYPE CAR_TT AS TABLE (Car_ID varchar(5), PlateNumber varchar(10), Brand varchar(10), Color varchar(10), HP integer, Owner varchar(5)); CREATE TYPE OFFICIAL_TT AS TABLE (PNR varchar(5), Name varchar(10), Overtime integer, Salary varchar(5), Manager varchar(5));

After these table types are created, table parameters or table variables could be created referencing them. The parameter or variable would have whatever name is designated, and would have the structure of the underlying table type – it would inherit the column names and column datatypes.

Table types are persistent structures that may be reused until they are dropped with the DROP TYPE command.

Table Parameters

You can define table parameters in one of two ways.

  1. Define directly in the definition of the procedure or function:
    • (IN inTab TABLE1(I INT), OUT outTab TABLE2 (I INT, J DOUBLE))
    • Advantage: Define directly and not have to manage the object.
    • Disadvantage: It cannot be reused.
  2. Use a table type that was already defined:
    • (IN inTab tableType1, OUT outTab tableType2)
    • Advantage: It can be reused in other procedures and functions.
    • Disadvantage: Need to take care of the object lifecycle.

Table parameters are used to define inputs and outputs for procedures and functions. Table parameters that are defined in the procedure or function signature are either declared as input or output. They must be typed explicitly. This can be done either by using a table type previously defined with the CREATE TYPE command, or by writing it directly in the signature without any previously defined table type.

With table parameters, conceptually, this enables a temporary table. A reserved and locally visible schema is introduced with the table parameters. All objects within this special schema are only visible for the current transaction.

Table Variables

A table variable is an intermediate variable and appears in the body of a procedure or a table function and can be either derived from a query result, or declared explicitly.

If the table variable has derived its type from the SQL query, the SQLScript compiler determines its type from the first assignments of the variable thus providing a lot of flexibility. One disadvantage of this approach is that it also leads to many type conversions in the background because sometimes the derived table type does not match the typed table parameters in the signature. This can lead to additional unnecessary conversions. Another disadvantage is the unnecessary internal statement compilation to derive the types. To avoid this unnecessary effort, you can declare the type of a table variable explicitly.

Table Variable Type

The DECLARE keyword is used to define a table variable type in SQLScript. Again, you can define table variables by two methods.

  • Define the table variable explicitly: DECLARE local_table TABLE (n int);
  • Reference a table type that was already defined: DECLARE local_table MY_TABLE_TYPE;
  • Default value assignment: DECLARE temp MY_TABLE_TYPE = UNNEST (:arr) as (i);
  • Default value assignment for a read only table variable (CONSTANT): DECLARE temp CONSTANT MY_TABLE_TYPE DEFAULT SELECT * FROM TABLE;

A declared table variable is always initialized with empty content. The name of the table variable must be unique among all other scalar variables and table variables in the same code block. However, you can use names that are identical to the name of another variable in a different code block. Additionally, you can reference those identifiers only in their local scope.

Code Snippet
123456789101112131415161718192021222324252627282930313233343536
-- START OF PSEUDOCODE -- Procedure is named "HA150::LCL_DEF", and has no parameters -- Prodcedure is read-only -- Procedure logic is bounded by nested BEGIN/END pairs -- END OF PSUEDOCODE BEGIN -- scope level 1 starts -- declare table variable A, based on a previously-defined table type -- called CAR_TT, with a structure that matches the table CAR DECLARE A CAR_TT; -- populate A with the contents of the table CAR A = SELECT * FROM CAR; -- start a new, nested BEGIN/END block BEGIN -- scope level 2 starts -- declare table variable A in this new, nested context, based on a -- previously-defined table type called OFFICIAL_TT, with a structure that -- matches the table OFFICIAL DECLARE A OFFICIAL_TT; -- populate A with the contents of the table OFFICIAL A = SELECT * FROM OFFICIAL; END; -- scope level 2 ends -- retrieve from A; the table variable that was defined at scope level 2 ceased to -- exist when scope level 2 ended, but the table variable that was defined at scope -- level 1 still exists SELECT * FROM :A; END; -- scope level 1 ends

Binding and Referencing – Table Variable

Local table variables can be bound and referenced after they are assigned. Table variables are bound using an equals (=) operator. The results of a SELECT are passed to the intermediate table variable or to the output table parameter.

Binding example:

Code Snippet
1234
lt_expensive_books = SELECT title, price, crcy FROM :it_books WHERE price > :minPrice AND crcy = :currency;

The result set of the SQL statement on the right side of the expression is bound to table variable lt_expensive_books. The it_books variable in the FROM clause of the statement, refers to an input parameter of a table type or a previously-bound variable.

Table variables are referred to using the colon (:) symbol that prefixes the name of the table variables. Table variables can be referred to multiple times.

Note

If a table's variable is not consumed by a subsequent statement, then it is ignored by the optimizer.

Map Merge

The MAP_MERGE operator is used to apply each row of the input table to the mapper function and unite all intermediate result tables. The purpose of the operator is to replace sequential FOR loops and UNION patterns, like in the example below, with a parallel operator to significantly improve performance.

Diagram showing the input, map phase, merge phase, and output using the MAP_MERGE operator.

The mapper procedure is a read-only procedure with only one output that has a tabular output.

As an example, let us rewrite this example to leverage the parallel execution of the MAP_MERGE operator. We need to transform the procedure into a table function, because MAP_MERGE only supports table functions as <mapper_identifier>. We will replace this code ...

Code Snippet
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- START OF PSEUDOCODE -- Procedure is named "Merger", provides one output parameter: -- Parameter "ret_table" is table, with one column (col_a nvarchar(200)) -- A defined table, tab, contains the data to run through the desired logic -- The desired logic is found in a separate procedure, "mapper" (not shown) -- Procedure logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- define a variable, i, to control the loop iterations DECLARE i int; -- define a variable, varb, to hold the content the data-bearing column of the table tab DECLARE varb nvarchar(200); -- placeholder name t is initialized with the contents of table tab t = SELECT * FROM tab; -- iterate the loop for each row in t FOR i IN 1 .. record_count(:t) DO -- populate varb with the ith position in the table (i = 1, 1st row; i = 2, 2nd row, etc) varb = :t.col_a[:i]; -- apply the logic in the procedure "mapper", taking the current row's data in varb as the input, -- returning the output of the logic to the placeholder name out_tab call mapper(:varb, out_tab); -- update ret_tab to equal its current content UNIONed with the current row's output of "mapper" -- which is stored in out_tab ret_tab = SELECT * from :out_tab UNION SELECT * FROM :ret_tab; END FOR; -- at the conclusion of the loop, "mapper" will have been called once and applied its logic -- to each row of the data table, tab (via the table variable t), and each output appended -- to the output set, ret_tab, by a series of UNIONs -- at the conclusion of the procedure, this cumulative output parameter, ret_tab, will be -- returned to whoever called Merger END;

… with this, much briefer code:

Code Snippet
1234567891011121314151617181920212223242526
-- START OF PSEUDOCODE -- Function is named "Merger", provides one output parameter: -- Parameter "ret_table" is table, with one column (col_a nvarchar(200)) -- A defined table, tab, contains the data to run through the desired logic -- The desired logic is found in a separate procedure, "mapper" (not shown) -- Function logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- placeholder name t is initialized with the contents of table tab t = SELECT * FROM tab; -- output ret_tab is constructed by calling the MAP_MERGE operation on -- the logic in the procedure "mapper", taking the current row's data in varb as the input, -- returning the output of the logic to the placeholder name out_tab ret_tab = MAP_MERGE(:t, mapper(:t.col_a)); -- at the conclusion of the MAP_MERGE, "mapper" will have been called once and applied its logic -- to each row of the data table, tab (via the table variable t), and each output held as -- an intermediate result set, which will be UNIONed together -- at the conclusion of the function, this cumulative output parameter, ret_tab, will be -- returned to whoever called Merger END;

MAP_REDUCE is a programming model introduced by Google that allows easy development of scalable parallel applications for processing big data on large clusters of commodity machines. The MAP_REDUCE operator is a specialization of the MAP_MERGE operator.

Map Reduce Example. Left table showing the intput with two columns as ID and String. Right table shows the required output with three columns as VAL, STMT_FREQ, and TOTAL_FREQ.
Diagram for MAP_REDUCE showing the tables as Source, Mapper, and Reducer.

1) Build the Mapper:

Code Snippet
12345678910111213141516171819202122232425262728
-- START OF PSEUDOCODE -- Function is named "mapper", takes two input parameters and returns one output table: -- Parameter "id" is integer, parameter "sentence" is varchar(5000) -- Output table structure has three columns: (id int, c varchar(5000), freq int) -- A block of logic is retrieved from a defined library; the library is called "lib", -- and the specific logic is called split_to_table ... this is the applied transformation (not shown) -- split to table will take a long sentence, and based on a supplied delimiter, split it into -- distinct elements and store them in a table -- Function logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- table variable tv is defined with one column result, to hold the result of the transformation declare tv table(result varchar(5000)); -- tv is populated with the output of the split_to_table logic -- input is the value of the parameter sentence, and a blank space (' ') is the defined delimiter tv = lib:split_to_table(:sentence, ' '); -- return as output of the function a table with each distinct element from the input sentence -- numbered sequentially with a count of occurrences RETURN SELECT :id as id, result as c, count(result) as freq FROM :tv GROUP By result; END;

2) Build the Reducer:

Code Snippet
123456789101112131415161718
-- START OF PSEUDOCODE -- Function is named "reducer", takes two input parameters and returns one output table: -- Parameter "c" is varchar(5000), parameter "values" is a table with two columns: -- (id int, freq int) -- Output table is three columns: (c varchar(5000), stmt_freq int, total_freq int) -- Function logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- return as output of the function a table with each distinct element from the input sentence -- along with a count of unique elements and total elements RETURN SELECT :c as c, count(distinct id) as stmt_freq, sum(freq) as total_freq FROM :values GROUP By result; END;

Final code:

Code Snippet
1234567891011121314151617181920212223242526
-- START OF PSEUDOCODE -- Call references to the mapper and reducer functions -- Source table, tab, and calls of the mapper and reducer functions are the inputs -- to the MAP_REDUCE operator -- Logic is bounded by BEGIN and END -- END OF PSUEDOCODE BEGIN -- build a table variable to hold the result of the MAP_REDUCE operator DECLARE result table (c varchar(5000), stmt_freq int, total_freq int); -- populate the result table variable with the MAP_REDUCE of the mapper and reducer functions result = MAP_REDUCE(tab, mapper(tab.id, tab.sentence) GROUP BY c as X, reducer(X.c, X)); -- return the result to the user SELECT * FROM :result ORDER BY c; END;