Implementing Advanced Prompt Engineering Techniques

Objective

After completing this lesson, you will be able to design a systematic approach to develop and evaluate prompt engineering from a simple baseline

Few-shot Prompting

Let's implement promoting techniques and then evaluate the results to see improvement in the prompt results.

We use the following code:

Python
12345678910111213141516171819202122232425262728293031323334353637
prompt_10 = """Your task is to extract and categorize messages. Here are some example: --- {{?few_shot_examples}} --- Use the examples when extract and categorize the following message: --- {{?input}} --- Extract and return a json with the follwoing keys and values: - "urgency" as one of {{?urgency}} - "sentiment" as one of {{?sentiment}} - "categories" list of the best matching support category tags from: {{?categories}} Your complete message should be a valid json string that can be read directly and only contain the keys mentioned in the list above. Never enclose it in ```json...```, no newlines, no unnessacary whitespaces. """ import random random.seed(42) k = 3 examples = random.sample(dev_set, k) example_template = """<example> {example_input} ## Output {example_output} </example>""" examples = '\n---\n'.join([example_template.format(example_input=example["message"], example_output=json.dumps(example["ground_truth"])) for example in examples]) f_10 = partial(send_request, prompt=prompt_10, few_shot_examples=examples, **option_lists) response = f_10(input=mail["message"])

The code aims to create a prompt template to extract and categorize messages according to their urgency, sentiment, and support category tags. By using randomly selected examples from a development set, it generates a formatted few-shot learning prompt. The prompt is sent to a language model to process and categorize a given input message, and the overall performance of the model is then evaluated and displayed in a table format.

Here’s an expanded explanation for a few parts of the code:

  1. Setting the Random Seed: It sets a random seed using "random.seed(42)" to ensure that the random sampling of the examples is reproducible. This helps in maintaining consistency in experiments and evaluations.
  2. Sampling Examples: The variable "k" is set to 3, indicating the number of examples to sample from the "dev_set" dataset. The "random.sample(dev_set, k)" function selects three random examples from the development set.
  3. Formatting Examples: The selected examples are formatted into a template "example_template". Each example includes the input message and the expected output in JSON format. This formatted string is then joined using "\n---\n" to create a cohesive set of examples.
  4. Partial Function Application: The "partial" function is used to bind the generated prompt and examples to the "send_request" function, creating a function "f_10" that can be called with just the input message. This streamlines the process of sending requests to the model with the necessary context.
  5. Sending Request and Evaluating: The script sends the request using "f_10(input=mail["message"])" with the input message from "mail["message"]". The result is stored and evaluated against a small test dataset "test_set_small". The evaluation results are stored in "overall_result["few_shot--llama3-70b"]".
  6. Output Display: Finally, the "pretty_print_table(overall_result)" function is used to display the evaluation results in a formatted table, making it easier to interpret the results.

You can get the following output prompts:

A prompt for extracting and categorizing messages, along with an example message and its categorized output. The example message is an inquiry about training programs for an in-house maintenance team. The output categorizes the message under general_inquiries and training_and_support_requests, with a neutral sentiment and low urgency.

You can see an example prompt here:

An email template and its corresponding output in JSON format. The email is a request for help with training programs from a support team. The JSON output categorizes the request and indicates its sentiment and urgency.

This is another prompt example here:

An email requesting urgent HVAC system repair, highlighting the issue's impact on comfort and daily activities, and asking for immediate professional intervention.

You can see another example prompt and the response here.

A progress bar and a table with evaluation metrics for two models.

This is the output for evaluation after implementing few-shot prompting.

You can see improvement in sentiment and urgency assignment.

We established a baseline earlier, and now we can evaluate and compare the results of the refined prompts with the baseline using the test data.

Metaprompting

Here we'll implement metaprompting to create detailed guides for prompts for various tags like urgency, sentiments, and so on.

We use the following code:

Python
12345678910111213141516171819202122232425262728293031323334353637383940
example_template_metaprompt = """<example> {example_input} ## Output {key}={example_output} </example>""" prompt_get_guide = """Here are some example: --- {{?examples}} --- Use the examples above to come up with a guide on how to distinguish between {{?options}} {{?key}}. Use the following format: ``` ### **<category 1>** - <instruction 1> - <instruction 2> - <instruction 3> ### **<category 2>** - <instruction 1> - <instruction 2> - <instruction 3> ... ``` When creating the guide: - make it step-by-step instructions - Consider than some labels in the examples might be in correct - Avoid including explicit information from the examples in the guide The guide has to cover: {{?options}} """ guides = {} for i, key in enumerate(["categories", "urgency", "sentiment"]): options = option_lists[key] selected_examples_txt_metaprompt = '\n---\n'.join([example_template_metaprompt.format(example_input=example["message"], key=key, example_output=example["ground_truth"][key]) for example in dev_set]) guides[f"guide_{key}"] = send_request(prompt=prompt_get_guide, examples=selected_examples_txt_metaprompt, key=key, options=options, _print=False, _model='gpt-4o') print(guides['guide_urgency'])

This code generates step-by-step guides for different categories—like "categories," "urgency," and "sentiment"—from labeled examples in a dataset.

It creates tailored guides for distinguishing between categories, urgency, and sentiment in text data. It formats examples using a specific template, then sends these examples to a model for generating step-by-step instructions. The guides help users distinguish between these categories based on patterns in the provided examples.

Here's a more detailed explanation:

  1. Template Definitions:

    • "example_template_metaprompt": Defines a template to format examples, specifying how to structure input and output within an example.
    • "prompt_get_guide": Outlines a prompt format to request the generation of a guide based on formatted examples. It also specifies the format and requirements for the guide, including making it a step-by-step instruction, accounting for possible incorrect labels, and avoiding explicit replication of the examples.
  2. Guide Preparation:

    • The script iterates over three keys: "categories", "urgency", and "sentiment".
    • For each key, it retrieves relevant options from "option_lists".
  3. Example Selection and Formatting: It formats examples from "dev_set" using the predefined template for each key, embedding the input message and corresponding ground truth.

  4. Guide Generation:

    • It sends a formatted prompt along with the examples to a model (gpt-4o), requesting the generation of a guide for distinguishing between the specified options for each key.
    • It stores the generated guides in a dictionary (guides), with each guide associated with its respective key (for example, "guide_categories", "guide_urgency", "guide_sentiment").

This process ensures that comprehensive and accurate instruction guides are generated for different classification tasks, facilitating the correct categorization of text data.

The last line of the code prints the guide for urgency.

You can see the following output:

Contains guidelines for determining the urgency of issues, categorized into High Urgency, Medium Urgency, and Low Urgency, with specific criteria for each level.

You can see the guide describing three rules for each urgency category that can be used in a prompt.

We use the following code to utilize these guides in a prompt.

Python
1234567891011121314151617181920212223242526
prompt_12 = """Your task is to classify messages. This is an explanation of `urgency` labels: --- {{?guide_urgency}} --- This is an explanation of `sentiment` labels: --- {{?guide_sentiment}} --- This is an explanation of `support` categories: --- {{?guide_categories}} --- Giving the following message: --- {{?input}} --- Extract and return a json with the following keys and values: - "urgency" as one of {{?urgency}} - "sentiment" as one of {{?sentiment}} - "categories" list of the best matching support category tags from: {{?categories}} Your complete message should be a valid json string that can be read directly and only contain the keys mentioned in the list above. Never enclose it in ```json...```, no newlines, no unnecessary whitespaces. """ f_12 = partial(send_request, prompt=prompt_12, **option_lists, **guides) response = f_12(input=mail["message"])

The code prepares a prompt for classifying messages based on urgency, sentiment, and support categories by utilizing predefined guides generated through the metaprompt code. It then uses a partial function to send this prompt as a request with specific options and guides. Finally, it processes an email message to extract and return these classifications in a JSON format.

See the following video to see the output.

Let’s evaluate this prompt and its response using the following code:

Python
1234
overall_result["metaprompting--llama3-70b"] = evalulation_full_dataset(test_set_small, f_12) pretty_print_table(overall_result)

You can get the following output:

A table with evaluation metrics for three different models

Now, we see that accuracy for urgency is improved, however accuracy for other categories is similar or even worse in the case of sentiment.

Combining Metaprompting and Few-shot Prompting

We can combine metapromting and few-shot prompting using the following code:

Python
123456789101112131415161718192021222324252627282930313233
prompt_13 = """Your task is to classify messages. Here are some examples: --- {{?few_shot_examples}} --- This is an explanation of `urgency` labels: --- {{?guide_urgency}} --- This is an explanation of `sentiment` labels: --- {{?guide_sentiment}} --- This is an explanation of `support` categories: --- {{?guide_categories}} --- Giving the following message: --- {{?input}} --- extract and return a json with the following keys and values: - "urgency" as one of {{?urgency}} - "sentiment" as one of {{?sentiment}} - "categories" list of the best matching support category tags from: {{?categories}} Your complete message should be a valid json string that can be read directly and only contain the keys mentioned in the list above. Never enclose it in ```json...```, no newlines, no unnecessary whitespaces. """ f_13 = partial(send_request, prompt=prompt_13, **option_lists, few_shot_examples=examples, **guides) response = f_13(input=mail["message"])

This Python code creates a template prompt for a message classification task, specifying how to extract and return information about urgency, sentiment, and support categories in a JSON format. The code uses this prompt to configure a function, "f_13", to analyze a given input message and generate a structured JSON response. This ensures consistent and accurate message classification.

You can see that it's combining few examples with guides generate during metaprompting.

See the following video to see the output.

Let’s evaluate this prompt and its response using the following code:

Python
1234
overall_result["metaprompting_and_few_shot--llama3-70b"] = evalulation_full_dataset(test_set_small, f_13) pretty_print_table(overall_result)

You can get the following output:

A performance evaluation table for different models.

Now, we see that accuracy for almost all categories except urgency is improved. This prompt has good accuracy. However, it's a more expensive prompt needing more resources.

Note

You may get a slightly different response to the one shown here and in all the remaining responses of models shown in this learning journey.

When you execute the same prompt in your machine, an LLM produces varying outputs due to its probabilistic nature, temperature setting, and nondeterministic architecture, leading to different responses even with slight setting changes or internal state shifts.

Evaluation Summary

We need to consider the overall accuracy and quality of a model along with its cost and scale.

At times, smaller models and simpler techniques may give better results.

In the preceding output, we can see that few-shot gives optimal performance with less expensive prompt.

Let's recap what we have done to solve the business problem so far:

  1. We created a basic prompt in SAP AI Launchpad using an open-source model.
  2. We recreated the prompt using generative-ai-hub-sdk to scale the solution.
  3. We created a baseline evaluation method for the simple prompt.
  4. Finally, we used techniques like few shot and metaprompting to further enhance the prompts.
  5. The results show improvement in the quality of prompt responses after implementing advanced techniques.

We'll study the costs associated with these techniques using other models in the next unit.

Log in to track your progress & complete quizzes