Using SAP Cloud SDK for AI to Evaluate Prompts

Objective

After completing this lesson, you will be able to evaluate prompts for a larger data set using functions in SDK.

After learning to develop and refine prompts in the previous lesson, the next critical step is to ensure their quality and reliability. In this lesson, we’ll explore how to establish a systematic evaluation framework for your prompts using the SAP Cloud SDK for AI. You’ll learn to automate the assessment of your LLM’s output, setting a crucial baseline for continuous improvement in your generative AI applications.

Evaluate Prompts Using the SAP Cloud SDK for AI

You’ve successfully developed a prompt that assigns urgency, sentiment, and categories to customer messages in JSON format. However, for the facility solutions company, the accuracy of these outputs is important as they directly impact critical customer-facing applications and operational decisions.

To guarantee this accuracy, automated and consistent evaluation of prompts across various scenarios is essential, ensuring reliability and efficiency. This is where custom evaluation functions, implemented using the SAP Cloud SDK for AI, become indispensable.

The SAP Cloud SDK for AI provides key benefits for prompt evaluation:

Reliable Testing: It automates prompt testing across various scenarios, ensuring consistent and efficient results.
Measuring Performance: It provides objective metrics (for example, relevance, coherence, fluency) to quantitatively assess response quality.
Tailored Evaluations: It allows you to create custom evaluators that meet specific needs, enabling more precise and relevant assessments.
Scalable Results: It supports large-scale evaluations, making it easier to test prompts on extensive datasets.

Implementing Evaluation Functions

Import the packages.

Python

12345678910111213141516171819202122232425
from tqdm.auto import tqdm
import time


class RateLimitedIterator:
    def __init__(self, iterable, max_iterations_per_minute):
        self._iterable = iter(iterable)
        self._max_iterations_per_minute = max_iterations_per_minute
        self._min_interval = 1.0 / (max_iterations_per_minute / 60.)
        self._last_yield_time = None

    def __iter__(self):
        return self

    def __next__(self):
        current_time = time.time()

        if self._last_yield_time is not None:
            elapsed_time = current_time - self._last_yield_time
            if elapsed_time < self._min_interval:
                time.sleep(self._min_interval - elapsed_time)

        self._last_yield_time = time.time()
        return next(self._iterable)

The code defines a "RateLimitedIterator" class to control the rate at which you can iterate over an iterable. By specifying a maximum number of iterations per minute, it ensures that the iteration process adheres to a defined speed, preventing hitting rate limits. It uses the "tqdm" library for progress visualization and the "time" module for timing control.

Define an evaluation function.

Python

1234567891011121314151617181920
def evaluation(mail: Dict[str, str], extract_func: Callable, _print=True, **kwargs):
    response = extract_func(input=mail["message"], _print=_print, **kwargs)
    result = {
        "is_valid_json": False,
        "correct_categories": False,
        "correct_sentiment": False,
        "correct_urgency": False,
    }
    try:
        pred = json.loads(response)
    except json.JSONDecodeError:
        result["is_valid_json"] = False
    else:
        result["is_valid_json"] = True
        result["correct_categories"] = 1 - (len(set(mail["ground_truth"]["categories"]) ^ set(pred["categories"])) / len(categories))
        result["correct_sentiment"] = pred["sentiment"] == mail["ground_truth"]["sentiment"]
        result["correct_urgency"] = pred["urgency"] == mail["ground_truth"]["urgency"]
    return result
evaluation(mail, f_8)

This code evaluates the predictions made by a function processing an email message. It uses a provided extraction function to analyze the email's content and compares the results against predefined ground truth data, checking for valid JSON, correct categories, sentiment, and urgency. This ensures that the extraction function performs accurately and consistently.

The last sentence evaluates the combined prompt function that we developed previously.

A prompt asking for a JSON response based on an urgent HVAC repair request email. The response JSON includes urgency, sentiment, and categories. The correct response is also shown.

You can see that the evaluation shows that all predictions are correct except sentiment.

Implement an evaluation function, for a large number of mails, to support large-scale evaluations, making it easier to test prompts with extensive datasets.

Python

12345678910111213141516171819202122232425262728293031323334353637383940414243444546
from tqdm.auto import tqdm

def transpose_list_of_dicts(list_of_dicts):
    keys = list_of_dicts[0].keys()
    transposed_dict = {key: [] for key in keys}
    for d in list_of_dicts:
        for key, value in d.items():
            transposed_dict[key].append(value)
    return transposed_dict

def evalulation_full_dataset(dataset, func, rate_limit=100, _print=False, **kwargs):
    results = [evaluation(mail, func, _print=_print, **kwargs) for mail in tqdm(RateLimitedIterator(dataset, rate_limit), total=len(dataset))]
    results = transpose_list_of_dicts(results)
    n = len(dataset)
    for k, v in results.items():
        results[k] = sum(v) / len(dataset)
    return results


def pretty_print_table(data):
    # Get all row names (outer dict keys)
    row_names = list(data.keys())

    # Get all column names (inner dict keys)
    if row_names:
        column_names = list(data[row_names[0]].keys())
    else:
        column_names = []

    # Calculate column widths
    column_widths = [max(len(str(column_name)), max(len(f"{data[row][column_name]:.2f}") for row in row_names)) for column_name in column_names]
    row_name_width = max(len(str(row_name)) for row_name in row_names)

    # Print header
    header = f"{'':>{row_name_width}} " + " ".join([f"{column_name:>{width}}" for column_name, width in zip(column_names, column_widths)])
    print(header)
    print("=" * len(header))

    # Print rows
    for row_name in row_names:
        row = f"{row_name:>{row_name_width}} " + " ".join([f"{data[row_name][column_name]:>{width}.1%}" for column_name, width in zip(column_names, column_widths)])
        print(row)

overall_result = {}

This code now performs evaluation on the entire dataset, evaluates each entry through a function with rate limiting, transposes the results for better aggregation, and then pretty-prints the final evaluation metrics in a tabular format. It uses the "tqdm" library to show a progress bar, making it easier to track processing status. The entire flow ensures streamlined, efficient processing and clear presentation of results.

Implement the final function to the final combined function.

Python

123
overall_result["basic--llama3.1-70b"] = evalulation_full_dataset(test_set_small, f_8)
pretty_print_table(overall_result)

You can get the following output:

Code Snippet

1234
 0%|          | 0/20 [00:00<?, ?it/s]
                    is_valid_json correct_categories correct_sentiment correct_urgency
======================================================================================
basic--llama3.1-70b        100.0%              83.5%             30.0%           70.0%

You can see the results for the basic prompt. The output varies significantly when you rerun the code with different input examples. This sets the baseline for further improvement in prompt accuracy and relevancy.

The key take away till now in this learning journey is that we can create a basic prompt and then evaluate the prompt on the dataset to set a baseline for further enhancement.

Lesson Summary

Developing a prompt is one part of solving a business problem; rigorous and automated evaluation is essential for enterprise-grade generative AI. By leveraging the SAP Cloud SDK for AI, you can implement custom evaluation functions to consistently measure prompt performance, validate output quality (like JSON format and accuracy of extracted entities), and apply these evaluations at scale across datasets. This systematic approach allows you to establish a clear baseline and iteratively enhance your prompts to meet critical business requirements.

Continue to quiz