Introduction to ETL
Introduction to ETL
Data and ETL
There is no data mining without data. For process mining, we analyze process-related data using a specialized mining tool obtain insights on process inefficiencies and improvements.
Process data is hidden in ERP systems that captures data in tables. This needs to first be extracted and then transformed into a specific format before they can be analyzed by a process mining tool such as SAP Signavio Process Intelligence.
With that in mind, these next sections will introduce you to the general concept on how to prepare transactional data for mining. This concept is called ETL and stands for:
- [Data] Extraction
- [Data] Transformation
- [Data] Load
So, what's data extraction? In terms of a business process, it means the retrieval of all business-related data within the system used for process mining.
Creating a data dump is easy. But what data is required and where is it stored? We need to ask ourselves this to identify the relevant data to extract.
- What process is it?
- Which IT-Systems are used?
- What is the timeframe?
- Which system-based activities (events) are executed in the process?
- Does all recorded activity have a timestamp?
- Are all activities tracked in the data system?
- What additional information is required for an analysis (e.g. type of product, order value, etc.)?
Minimum key requirements include a valid case with a case ID, event name identifier and a timestamp for each event. Ideally, the timeframe includes ALL records, but this can be a lot of data. Most companies will set the timeframe to a smaller parameter, such as 1 year.
Whenever a limited timeframe is considered, there will be incomplete cases since some may have fully executed in that timeframe. Now, you must ask whether they should be included from the extraction? These questions need to be answered before the data extraction.
Let's now take a look at how to extract our data. Follow these 3 steps to ensure the correct data gets extracted.
Data Extraction from Multiple Systems
The process can also be supported by multiple systems. In these cases, it's recommended to start small by extracting the data from one system to get your first results. More data can be included to expand the process in the next iteration.
If the data is difficult to extract (in case of external systems) or there is no unique identifier to track cases across the system, you can combine two values, such as order value and order time. You can also reduce the process timeframe if the ID can't be created.