Provisioning Customer Data Set for AI Workbench

Objective

After completing this lesson, you will be able to describe a customer dataset for AI Workbench models consisting of synthetic sample customer data that would result in meaningful Predictive Indicator values

Lesson Overview

The purpose of this lesson is to explain how you can generate synthetic sample customer datasets to produce meaningful Predictive Indicator values after running the available AI Workbench Models. You can skip this lesson if you already have a dataset that satisfies the minimum requirements for AI Workbench, e.g. production, or any other test customer data.

After completing this lesson, you will understand how input data shapes results when running AI Models. The type of data needed to get foreseeable values, and the way it is distributed, plays a critical role in obtaining sharp Predictive Indicators that best reflect the most likely future customer behavior.

Building a Customer Data Set using a Generative AI Large Language Model (LLM) Prompt

What if we could delegate the generation of bulk synthetic sample customer data to generative AI? We can leverage any of the freely or commercially available mainstream chat-based LLMs to produce for us the bulk data we will use to run the predictive AI models. The best part of this approach is that, since we fully control the input data specification, we can ask the generative AI to produce source code that reflects the distributions we need to produce the expected Predictive Indicator results.

We cannot ask the LLM to generate the dataset itself, as the result will produce too many tokens for the LLM to manage. Instead, we will ask the LLM to provide us with source code we can run locally to generate the customer data results as a plain text file in JSON format.

We will provide the LLM a sample JSON schema definition that applies to both profile and order data. We’ll keep these simple to minimize data ingestion volume and time.

Consider the following 10-item list of specifications for our sample dataset:

  1. Produce source code that generate bulk customer data
  2. To avoid data transformations, we will use JSON, one of the native formats for SAP Customer Data Platform
  3. Using one file for profiles and another for orders will give us ideal ingestion performance
  4. We need 50,000 profiles
  5. For each profile, we want the following attributes:
    1. firstName, matching the person’s gender
    2. lastName. Avoid duplicates of the combined firstName and lastName, so that each customer name is unique
    3. gender. with the following integer values: 0 for Unknown, 1 for Male, 2 for Female, 3 for Nonbinary, and 9 for Non Determined
    4. birthDate, a random JavaScript ISO date representing the birthday of someone between 18 and 90 years old
    5. primaryEmail, in the format firstName+lastName@email.com
    6. primaryPhone
    7. timestamp, as a random JavaScript ISO date set between one year old and 60 days old
    8. masterDataId, as a sequential number starting with 1, converted to a string
  6. Create orders with the following JSON schema:
    1. Currency, set to USD;
    2. Tax, as 10% of the order amount and converted to a string
    3. Amount, as a number converted to a string
    4. productid, as an integer between 1 and 100, also converted to a string
    5. timestamp, as a JavaScript ISO date
    6. id, as a unique identifier string
    7. masterDataId, value copied from the profile this order belongs to
  7. For each profile, generate orders split into two big distribution groups:
    1. For old orders:
      1. Generate a random number of orders and assign between 5 and 10 to each profile
      2. For the number of orders, pick a random value between 20 and 80
      3. The timestamp is a random JavaScript ISO date between its profile’s timestamp and 60 days prior to that date
    2. For recent orders:
      1. The timestamp is a random JavaScript ISO date up to 60 days in the past
      2. The first 10% of profiles should not have any orders
      3. The next 20% of profiles should have a single small order (less than $100)
      4. The next 30% of profiles should have between 4 to 8 small orders (each less than $100)
      5. The next 20% of profiles should perform a single large order (more than $300) within the last 2 months
      6. The last 20% of profiles should perform between 4 to 8 large orders (each more than $300)
  8. All JSON output has to be pretty printed
  9. The profiles file should be named profiles.json
  10. The orders file should be named orders.json

Producing a Customer Data Set using a Python Script

As result of the prompt described in the previous section, we have the following Python script excerpt, which produces the profiles.json file containing the 50,000 synthetic customer profiles.

Code Snippet
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import random import uuid import json from datetime import datetime, timedelta from faker import Faker fake = Faker() # Constants NUM_PROFILES = 50000 EMAIL_DOMAIN = "@email.com" GENDER_MAP = {0: "Unknown", 1: "Male", 2: "Female", 3: "Nonbinary", 9: "Not Determined"} NOW = datetime.now() ONE_YEAR_AGO = NOW - timedelta(days=365) SIXTY_DAYS_AGO = NOW - timedelta(days=60) EIGHTEEN_YEARS = timedelta(days=18 * 365) NINETY_YEARS = timedelta(days=90 * 365) # Generate unique names unique_names = set() def generate_unique_name(gender): while True: first = fake.first_name_male() if gender == 1 else fake.first_name_female() if gender == 2 else fake.first_name() last = fake.last_name() name = (first, last) if name not in unique_names: unique_names.add(name) return first, last # Generate a random date between two datetimes def random_date(start, end): return fake.date_time_between(start_date=start, end_date=end).isoformat() # Generate profiles profiles = [] profile_timestamps = [] for i in range(1, NUM_PROFILES + 1): gender = random.choice([0, 1, 2, 3, 9]) first_name, last_name = generate_unique_name(gender) email = f"{first_name}{last_name}{EMAIL_DOMAIN}" birth_date = (NOW - timedelta(days=random.randint(18 * 365, 90 * 365))).date().isoformat() phone = fake.phone_number() timestamp = random_date(ONE_YEAR_AGO, SIXTY_DAYS_AGO) profile_timestamps.append(timestamp) profiles.append({ "firstName": first_name, "lastName": last_name, "gender": gender, "birthDate": birth_date, "primaryEmail": email, "primaryPhone": phone, "timestamp": timestamp, "masterDataId": str(i) }) # Save profiles.json profiles_path = "profiles.json" with open(profiles_path, "w") as f: json.dump(profiles, f, indent=4)

For the orders, the following piece of Python code was provided to us by the LLM that generates them according to the specification.

Code Snippet
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
# Generate orders based on profile segments orders = [] for idx, profile in enumerate(profiles): master_id = profile["masterDataId"] profile_timestamp = datetime.fromisoformat(profile["timestamp"]) # Old orders (5-10 per profile, $20-$80, timestamp between profile ts and 60 days ago) for _ in range(random.randint(5, 10)): amount = round(random.uniform(20, 80), 2) tax = round(amount * 0.10, 2) timestamp = random_date(profile_timestamp, SIXTY_DAYS_AGO) orders.append({ "currency": "USD", "tax": f"{tax:.2f}", "amount": f"{amount:.2f}", "productId": str(random.randint(1, 100)), "timestamp": timestamp, "id": str(uuid.uuid4()), "masterDataId": master_id }) # Recent orders by segment segment = idx / NUM_PROFILES if segment < 0.10: continue # No recent orders elif segment < 0.30: # 1 small order (<$100) amount = round(random.uniform(10, 99), 2) tax = round(amount * 0.10, 2) timestamp = random_date(SIXTY_DAYS_AGO, NOW) orders.append({ "currency": "USD", "tax": f"{tax:.2f}", "amount": f"{amount:.2f}", "productId": str(random.randint(1, 100)), "timestamp": timestamp, "id": str(uuid.uuid4()), "masterDataId": master_id }) elif segment < 0.60: # 4-8 small orders (<$100) for _ in range(random.randint(4, 8)): amount = round(random.uniform(10, 99), 2) tax = round(amount * 0.10, 2) timestamp = random_date(SIXTY_DAYS_AGO, NOW) orders.append({ "currency": "USD", "tax": f"{tax:.2f}", "amount": f"{amount:.2f}", "productId": str(random.randint(1, 100)), "timestamp": timestamp, "id": str(uuid.uuid4()), "masterDataId": master_id }) elif segment < 0.80: # 1 large order (>$300) amount = round(random.uniform(301, 600), 2) tax = round(amount * 0.10, 2) timestamp = random_date(SIXTY_DAYS_AGO, NOW) orders.append({ "currency": "USD", "tax": f"{tax:.2f}", "amount": f"{amount:.2f}", "productId": str(random.randint(1, 100)), "timestamp": timestamp, "id": str(uuid.uuid4()), "masterDataId": master_id }) else: # 4-8 large orders (>$300) for _ in range(random.randint(4, 8)): amount = round(random.uniform(301, 600), 2) tax = round(amount * 0.10, 2) timestamp = random_date(SIXTY_DAYS_AGO, NOW) orders.append({ "currency": "USD", "tax": f"{tax:.2f}", "amount": f"{amount:.2f}", "productId": str(random.randint(1, 100)), "timestamp": timestamp, "id": str(uuid.uuid4()), "masterDataId": master_id }) # Save orders.json orders_path = "orders.json" with open(orders_path, "w") as f: json.dump(orders, f, indent=4)

The resulting file contents for both profiles.json and orders.json can now be ingested on a new Business Unit inside the SAP Customer Data Platform console.