Since we are talking about Artificial Intelligence, why not use a Large Language Model (LLM) to generate a considerable input dataset of Synthetic customer Profile and Activity data needed to see these models in action?
But before writing down the synthetic data requirements, let’s see one sample for Profile data:
12345678910111213{
"masterDataId": "1",
"firstName": "Tara",
"lastName": "Willis",
"gender": 9,
"birthDate": "1970-12-27",
"nationality": "KH",
"language": "sw",
"primaryEmail": "millsrobert@example.com",
"primaryPhone": "+1-337-249-3281x99476",
"country": "DM",
"timestamp": "2024-07-13T01:24:09.394159"
}
Note
gender is a number that fulfills the extended ISO 5218 table:
Code | Description |
---|---|
0 | Unknown |
1 | Male |
2 | Female |
3 | Nonbinary |
9 | Non Determined |
Now let’s see a sample for Order data:
123456789{
"id": "50ad80ee-939d-439d-8f6c-1964b0220850",
"currency": "NAD",
"tax": 12.05,
"amount": 53.9,
"productId": "52",
"timestamp": "2024-10-18T17:34:50.513574",
"masterDataId": "1"
}
Remember that LLMs cannot generate a very big response to prompts, so instead of just asking for all the synthetic data as one or multiple JSON payloads, we can ask the LLM to generate source code that in turn will generate all the data we need. After that we run the source code in our machines and upload the resulting JSON files into a cloud storage that will be used by the Application Source and its Events configured in the SAP Customer Data Platform Console.
While your synthetic test data requirements might be different, one good starter scenario to generate Profile data in a controlled way that will give you a good example of how the Churn and CLV Models work can be as follows:
- Each profile includes a masterDataId, firstName, lastName, gender, birthDate, nationality, language, primaryEmail, primaryPhone, country, and a timestamp.
- The timestamp for each profile is generated to be between 1 year ago and 61 days ago from today.
- Gender values are assigned using the specified gender table.
Same goes for Orders, here is a starter scenario for fake Orders data that can be used as input for the SAP Customer Data Platform AI Workbench Models:
- Each order includes currency, tax, amount, productId, timestamp, id, and masterDataId (matching the profile's masterDataId).
- The timestamp for each order is generated to be between 60 days ago and now.
- Orders are distributed based on the specified criteria:
- 10% of profiles have no orders.
- 20% have one small order in the last 2 months.
- 30% have 4 to 8 small orders in the last 2 months.
- 20% have one large order in the last 2 months.
- 20% have 4 to 8 large orders in the last 2 months.
As for numbers and timestamps for Profiles and Orders, one possibility is asking the LLM for a dataset that’s composed of 50k customer Profiles that were created between 1 year ago and 61 days ago. Some of these will have a number of Orders with low and high amounts. All orders were placed between 60 days ago and today. Also, don’t forget to tell the LLM about the Profile and Order schema attributes.