What is Data Modeling?
Data modeling is the process of collecting, harmonizing and preparing data so that it is ready to use in business analytics.
Sounds complex? Let`s have a look at a real business example.
Mark from the sales department and his manager Kristina are discussing the sales figures of the current quarter which are displayed in the sales report. Mark is proud of the quantity of the products sold in Germany during the last quarter. But Kristina wants a view of the financial position. She would like to see the profit which is a calculation based on taking the overall sales revenue and deducting the costs. Mark agrees and is also keen to see how the sales figures of Germany compare to France. They both agree that adjustments need to be made to the quarterly sales report.
In this simple example, the requirements of the analytical report are defined by the business. But to meet these analytical requirements, it is first necessary to identify, prepare and to provide the required data. This is the responsibility of the data modeler. Usually, the source data is not ready for consumption by the analytical tools. The source of data is a data base table (or many tables) and this data is raw and usually requires tidying up and reformatting. For example, removing leading zeros from customer numbers, re-formatting the dates so they are consistent with the date formats of other countries, converting currencies and much more. A data modeler must carefully combine the new data with the existing data that is used in the report.
Kristina explains that the calculation of the overall profit consists of three steps: First, the sales figures are multiplied with the unit prices. A subsequent aggregation calculates the overall sales revenue. The subtraction of the total costs from the revenue generates the overall profit. These mathematical calculations need to be implemented through specific operations using the relevant set of data. Quite often, a simple calculation can easily be added to a report when the data that is needed for a calculation is already present in the report. In this case, this is simply an adjustment to the analytics, and not a data modeling task. But quite often, new calculations require complex data processing and often additional data that is not present in the existing report. That is when we need the skills of the data modeler who prepares the new data needed in the calculation.
In addition to the new profit calculation, Mark also requires another change to the quarterly sales report. He would like to see the results for Germany alongside the results for France by product, to enable an easy comparison. After identifying how to extract the French sales data and make it available, the data modeler then implements a join operation to bring together the data from the two countries.
All analytical requirements that come from the business rely on the skills of the data modeler who makes the data available.
A data modeler should have a basic understanding of databases and database development. This means that they should understand the objects in databases, such as tables, views, functions, procedures, and synonyms. They should understand modeling terms such as inner join, outer join, union, and star schema. A data modeler will usually have SQL skills, although with the use of graphical tools, this is less important. Even though the sources of data are very diverse, the principles of data modeling still apply.
Now, you have an idea of the role of data modeling. Let's dig a little deeper and describe some of the key tasks of a data modeler.
Watch the video to learn more about the key tasks.
Virtual versus Persistence Data Modeling
There are two distinct and complementary approaches to data modeling: the virtual data approach and the persistent data approach.
On the one hand, there is the virtualization approach which focuses mainly on reporting use cases of live, operational business data. The key aspect of this approach is real-time data access provided by an agile, data model that sits on the top of the raw, operational data. In the virtual approach, analytical data is not stored and is always generated on-demand.
Let us have a quick look at a simple purchasing example that illustrates what we mean by real-time reporting:
A purchasing clerk is mainly responsible for ordering raw materials, sub-components, or products to ensure that the deliveries are accurate. Besides the correct amount and quality of the deliveries it is the punctuality of the delivered materials that is critical. Delays of ordered components can cause production stops and increase the costs. To avoid that, there must be always enough material available and sub-quantities need to be increased with punctual reorders. To ensure that, the purchasing clerk needs to have a constant overview of the physical stock quantity in the warehouse. This critical figure is usually being displayed in a specific purchasing dashboard. This means that each physical change of the stock quantity needs to be mirrored in real-time in this dashboard. A timely delay between the physical change of the stock in the warehouse and the display in the dashboard could cause the negative impact mentioned above. Data modelers have the necessary set of tools and techniques to support that.
Now we come to the persistence approach. This approach is mainly used for strategic reporting where we need historical data to look back at trends and past performance so business decisions can be made to improve future performance. The persistence approach generates historical data snapshots that are later used for reporting purposes. In other words, the data modeler freezes the state of the data at a certain points in time to enable a historical view of data.
Here is a simple business example of reporting with historical data:
Let us take the same purchasing clerk who would now like to analyze the stock quantity trends over the last 8 years. This would require access to historical data that represent the stock position at any point in time for the last 8 years. For example, it should be possible to drill-down to the quantities of individual products as well as generating totals by region for groups of products.
Graphical Modeling versus Script-based Modeling
Data modelers have access to a wide range of tools and methods to prepare the data so that is meets the need of analytics. Data modelers either work with graphical tools or with script-based tools. Both options can be used side by side and many data modeling projects use a combination both graphical and script-based tools. The decision on which tools to use depends on many factors. These include the capability of the graphical tools and the experience of the data modeler. For example, do the data modelers know how to code in SQL? Often the choice of a tool is a strategic one, to ensure that a common tools set is used for all data modeling projects across an organization to promote consistency and sharing of team members across projects.
Custom Data Modeling versus Pre-delivered Content
Many data modeling solutions provide pre-delivered content. The advantage is that the data modeler has less work to do when the models are already created. However, the pre-delivered models rarely fulfill all business requirements. Custom data modeling is therefore needed. Custom data modeling is the creation of new data modeling content that is developed by the data modeler on top of the raw business data. Technically, there is no difference between custom data modeling content and the pre-delivered content and many projects use a combination of both.
Implementing Data Modeling Solutions
If you were able to scan a business report with an X-ray machine, you would see that all data that appears originates in database tables. So, why not just move the data directly to the report and skip the step of using a dedicated data modeling solution? Especially as many reporting tools also provide data preparation.
Below you can find some of the key benefits of using dedicated data modeling solutions. Watch the video to explore them.
Examples of Data Modeling Functions
A very important activity of data modeling is to express the relationships between the different data sources. The most common way to implement this is to use joins. In the demo below you will see two different source tables. Table 1 contains customer master data, including the customer id, customer name and country. Table 2 contains customer orders with one record for each order. The goal is to bring the data together to display all customer master data alongside each order. Watch the video to see the result.
When analyzing data, it is often necessary to filter the data to reduce the result set. Filtering data is often required when analyzing data, in order to reduce the result set. In the video below you will see that Table 3 contains customer and order-specific data for three countries (Germany, USA, Spain). The goal is to eliminate all data for all countries except Germany. Watch the video below to see the result.