Orchestrate Processes via Scheduled Task Chains
The SAP HANA data warehousing scheduler (DWS) is a scheduling and process organizing tool that comes with the Data Warehouse Foundation (DWF) that we introduced in Unit 1. You can use the DWF to:
- Define task chains to design dependencies of individual data load processes
- Use flexible and simple scheduling instructions
- Monitor job executions and logs
DWS provides a UI to design task chains. Similarly to the flowgraph GUI and the calculation view GUI, you can create powerful workflows on the basis of nodes and connections between the nodes. These task chains are then used by the DWS to control and orchestrate processes in the Data Warehouse. Task chains are based on Triggers, Tasks, and Collectors. Every task chain has a Trigger that starts the execution of successor Tasks and Collectors. Tasks are the actual jobs that are supposed to be executed. Collectors control the execution order of Tasks.
The following Task types can be used:
- Task chain: Execute another task chain
- Activate NDSO: Activate all requests of a given datastore
- Load via SQL: Write the result set of a query to a specified inbound table
- Load from a URL: Retrieve a CSV resource from the web and write it to a specified inbound table
- Execute flowgraph: Execute a flowgraph without parameters
- Execute Procedure: Execute a procedure (SAP HANA database procedures, for example)
- Clear Log: Delete all logs older than the entered parameter
The following Collector types can be used:
- First: The successor tasks of the first task are executed when the first task is reached by one of its predecessor tasks the first time. Use this collection process when you want to combine processes and when further processing is dependent on all these predecessors.
- Last: The successor tasks of the last task are executed after all its predecessor tasks have reached the last task. Use this collection process when you want to process processes in parallel and schedule further independent processes after these ones.
Powerful Scheduling Options
To set schedules for your task chains you can choose from immediate execution as well as cron-based or discrete schedule definition. Immediate schedules are executed right away. The cron-based definition utilizes cron expression - known from the same-named UNIX utility. A cron expression is a string comprising of seven fields which is used to define specific date-times. Choosing the discrete schedule type, you can manually pick a start and end time, an interval time, and an interval unit.
Task Chain Examples
The task chain for the BID starts the execution of the flowgraphs which provision the BID tables with data from the staged tables. The important aspect here is that within the task chain the Hub Table flowgraphs are executed first. This is because the surrogate keys are created in the hub table flowgraphs. The subsequent flowgraphs for the link and satellite tables depend on these surrogate keys and would fail if the surrogate key columns in the hub tables were empty. For this reason, it is very important to work with Collectors of the type Last in this task chain. The Last Collectors control the execution of Tasks within the task chain. The first Collector after the hub table flowgraphs makes sure that the Link Table flowgraphs only start execution as soon as all hub table flowgraphs are finished. The second Last Collector after the Link Tables does the same, but for the satellite tables. The Last Collector after the satellite table flowgraphs is optional. Note that the screenshot below shows only a snippet of the whole task chain, which includes the execution of all hub, link, and satellite table flowgraphs.
The orchestration task chain starts the execution of "child" task chains which execute the flowgraphs that provision their respective DWH areas with data. There are child task chains for the areas Stage, BID, and RAW. Note the execution order of the task chains. It is important to first execute the Stage task chain because the BID flowgraphs depend on the staged tables which are the output of the Stage flowgraphs. It is also important that the BID task chain is executed before the RAW task chain, because the BID Tables that result from the BID flowgraphs include the surrogate keys that are necessary to successfully execute the RAW flowgraphs which depend on the surrogate keys.
Comprehensive Monitoring and Logging
Each task chain execution is monitored and logged. You can see the comprehensive execution information, including the status of each execution and the respective log details, in the Data Warehouse Monitor.