Introduction to Data Transformation
Data transformation is the process of changing the format, structure or values of raw data from ERP-systems in order to upload a file to the process mining system. This final file is called event log and includes all recorded events with their timestamp assigned to certain case IDs.
This data transformation is typically achieved through:
- Translation and mapping
- Filtering, aggregation, and summarization
- Enrichment and imputation
- Indexing and ordering
- Anonymization and encryption
- Modeling, typecasting, formatting, and renaming
Actions in the System Leave a Trace to Follow
Every step in a system gets recorded and leaves behind a trace. Looking at a procure-to-pay example, certain business objects in the system are used through steps:
- Purchase Requisition (PR)
- Purchase Order (PO)
All changes and transactions referring to these objects are stored in a database. Now, with Process Intelligence, those details can be explored. They get extracted and transformed in a way that allows backtracking of all the steps. Finally, those recreated steps are stored in an event log.
Why Does the Data Need to be Transformed?
All process data are stored in tables in a database. For analysis purposes, it is important that the data is uniform and standardized. There can be differences in the data, especially if they come from different source systems (e.g. different data formats or data types). Typically, the data is aligned to a specific target format.
Steps for Data Transformation.
- Definition of the target format
- Conversion of the extracted data
- Saving the converted data into a new file
Why is data transformation necessary? Because all data is stored in different tables. We need to ensure that the extracted data is linked to their specific cases. How will a system know that Order ID 123 in the order table and Invoice ID 456 in the invoice table belong to the same case?
What's your Case?
Defining the correct case identifier (ID) is one of the most important points in data transformation. The case ID defines the scope of the process. It determines where the process starts and ends. In a Procurement process, if the case ID is defined by the purchase document ID, every single request will be considered a new case - it doesn't matter if multiple requests might be combined into one order.
If the case ID is defined by the order ID, the data set will contain all orders as cases, regardless of their underlying purchase requests. A combination of both would also lead to cases for every purchase request. At the end, the answer depends on what business object or document should be analyzed in terms of its lifecycle.
The last part of ETL is the Data Load phase. This covers the tasks to upload the transformed data into the process mining system. The following points need to be addressed for your data load.
Select each level in the figure for more information.