In this lesson, you will discover how to improve the intelligence and precision of LLM responses through advanced prompt engineering techniques. Building upon the baseline evaluations established previously, you’ll learn to implement powerful strategies like Few-shot Prompting and Meta-prompting using the SAP Cloud SDK for AI, and observe their impact on improving the quality and accuracy of your generative AI applications.
Few-Shot Prompting
Let's implement promoting techniques and then evaluate the results to see improvement in the prompt results.
We use the following code:
123456789101112131415161718192021222324252627282930313233343536373839
prompt_10 = Template(
messages=[
SystemMessage(
"""You are an intelligent assistant. Your task is to extract and categorize messages. Here are some example:
{{?few_shot_examples}}
Use the examples when extract and categorize the following message:
Extract and return a json with the follwoing keys and values:
- "urgency" as one of {{?urgency}}
- "sentiment" as one of {{?sentiment}}
- "categories" list of the best matching support category tags from: {{?categories}}
Your complete message should be a valid json string that can be read directly and only contain the keys mentioned in the list above. Never enclose it in ```json...```, no newlines, no unnessacary whitespaces."""),
UserMessage("{{?input}}")
]
)
import random
random.seed(42)
k = 3
examples = random.sample(dev_set, k)
example_template = """<example>
{example_input}
## Output
{example_output}
</example>"""
examples = '\n---\n'.join([example_template.format(example_input=example["message"], example_output=json.dumps(example["ground_truth"])) for example in examples])
f_10 = partial(send_request, prompt=prompt_10, few_shot_examples=examples, **option_lists)
response = f_10(input=mail["message"])
overall_result["few_shot--llama3.1-70b"] = evalulation_full_dataset(test_set_small, f_10)
pretty_print_table(overall_result)
The code aims to create a prompt template to extract and categorize messages according to their urgency, sentiment, and support category tags. By using randomly selected examples from a development set, it generates a formatted few-shot learning prompt. The prompt is sent to a language model to process and categorize a given input message, and the overall performance of the model is then evaluated and displayed in a table format.
Here’s an expanded explanation for a few parts of the code:
- Setting the Random Seed: It sets a random seed using "random.seed(42)" to ensure that the random sampling of the examples is reproducible. This helps in maintaining consistency in experiments and evaluations.
- Sampling Examples: The variable "k" is set to 3, indicating the number of examples to sample from the "dev_set" dataset. The "random.sample(dev_set, k)" function selects three random examples from the development set.
- Formatting Examples: The selected examples are formatted into a template "example_template". Each example includes the input message and the expected output in JSON format. This formatted string is then joined using "\n---\n" to create a cohesive set of examples.
- Partial Function Application: The "partial" function is used to bind the generated prompt and examples to the "send_request" function, creating a function "f_10" that can be called with just the input message. This streamlines the process of sending requests to the model with the necessary context.
- Sending Request and Evaluating: The script sends the request using "f_10(input=mail["message"])" with the input message from "mail["message"]". The result is stored and evaluated against a small test dataset "test_set_small". The evaluation results are stored in "overall_result["few_shot--llama3-70b"]".
- Output Display: Finally, the "pretty_print_table(overall_result)" function is used to display the evaluation results in a formatted table, making it easier to interpret the results.
Response Example:
12345 0%| | 0/20 [00:00<?, ?it/s]
is_valid_json correct_categories correct_sentiment correct_urgency
=========================================================================================
basic--llama3.1-70b 100.0% 83.5% 30.0% 70.0%
few_shot--llama3.1-70b 100.0% 84.0% 50.0% 90.0%This is the output for evaluation after implementing few-shot prompting.
You can see improvement in sentiment and urgency assignment.
We established a baseline earlier, and now we can evaluate and compare the results of the refined prompts with the baseline using the test data.
Meta-prompting
Here we'll implement meta-prompting to create detailed guides for prompts for various tags like urgency, sentiments, and so on.
We use the following code:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
example_template_metaprompt = """<example>
{example_input}
## Output
{key}={example_output}
</example>"""
prompt_get_guide = Template(
messages=[
SystemMessage(
"""Here are some example:
---
{{?examples}}
---
Use the examples above to come up with a guide on how to distinguish between {{?options}} {{?key}}.
Use the following format:
```
### **<category 1>**
- <instruction 1>
- <instruction 2>
- <instruction 3>
### **<category 2>**
- <instruction 1>
- <instruction 2>
- <instruction 3>
...
```
When creating the guide:
- make it step-by-step instructions
- Consider than some labels in the examples might be in correct
- Avoid including explicit information from the examples in the guide
The guide has to cover: {{?options}}
"""
),
UserMessage("{{?input}}")
]
)
guides = {}
for i, key in enumerate(["categories", "urgency", "sentiment"]):
options = option_lists[key]
selected_examples_txt_metaprompt = '\n---\n'.join([example_template_metaprompt.format(example_input=example["message"], key=key, example_output=example["ground_truth"][key]) for example in dev_set])
guides[f"guide_{key}"] = send_request(prompt=prompt_get_guide, examples=selected_examples_txt_metaprompt,input=selected_examples_txt_metaprompt, key=key, options=options, _print=False, _model='gpt-4o')
print(guides['guide_urgency'])
This code generates step-by-step guides for different categories—like "categories," "urgency," and "sentiment"—from labeled examples in a dataset.
It creates tailored guides for distinguishing between categories, urgency, and sentiment in text data. It formats examples using a specific template, then sends these examples to a model for generating step-by-step instructions. The guides help users distinguish between these categories based on patterns in the provided examples.
Detailed explanation:
Template Definitions:
- "example_template_metaprompt": Defines a template to format examples, specifying how to structure input and output within an example.
- "prompt_get_guide": Outlines a prompt format to request the generation of a guide based on formatted examples. It also specifies the format and requirements for the guide, including making it a step-by-step instruction, accounting for possible incorrect labels, and avoiding explicit replication of the examples.
Guide Preparation:
- The script iterates over three keys: "categories", "urgency", and "sentiment".
- For each key, it retrieves relevant options from "option_lists".
Example Selection and Formatting: It formats examples from "dev_set" using the predefined template for each key, embedding the input message and corresponding ground truth.
Guide Generation:
- It sends a formatted prompt along with the examples to a model (gpt-4o), requesting the generation of a guide for distinguishing between the specified options for each key.
- It stores the generated guides in a dictionary (guides), with each guide associated with its respective key (for example, "guide_categories", "guide_urgency", "guide_sentiment").
This process ensures that comprehensive and accurate instruction guides are generated for different classification tasks, facilitating the correct categorization of text data.
The last line of the code prints the guide for urgency.
You will see the guide describing rules for each urgency category that can be used in a prompt.
We use the following code to utilize these guides in a prompt.
12345678910111213141516171819202122232425262728293031prompt_12 = Template(
messages=[
SystemMessage(
"""You are an intelligent assistant. Your task is to classify messages.
This is an explanation of `urgency` labels:
---
{{?guide_urgency}}
---
This is an explanation of `sentiment` labels:
---
{{?guide_sentiment}}
---
This is an explanation of `support` categories:
---
{{?guide_categories}}
---
Giving the following message:
Extract and return a json with the following keys and values:
- "urgency" as one of {{?urgency}}
- "sentiment" as one of {{?sentiment}}
- "categories" list of the best matching support category tags from: {{?categories}}
Your complete message should be a valid json string that can be read directly and only contain the keys mentioned in the list above. Never enclose it in ```json...```, no newlines, no unnessacary whitespaces.
"""
),
UserMessage("{{?input}}")
]
)
f_12 = partial(send_request, prompt=prompt_12, **option_lists, **guides)
response = f_12(input=mail["message"])
The code updates the system role in the prompt for classifying messages based on urgency, sentiment, and support categories by utilizing predefined guides generated through the meta-prompt code. It then uses a partial function to send this prompt as a request with specific options and guides. Finally, it processes an email message to extract and return these classifications in a JSON format.
Evaluate this prompt and its response, using the following code:
1234
overall_result["metaprompting--llama3.1-70b"] = evalulation_full_dataset(test_set_small, f_12)
pretty_print_table(overall_result)
You can get the following output:
1234560%| | 0/20 [00:00<?, ?it/s]
is_valid_json correct_categories correct_sentiment correct_urgency
==============================================================================================
basic--llama3.1-70b 100.0% 83.5% 30.0% 70.0%
few_shot--llama3.1-70b 100.0% 84.0% 50.0% 90.0%
metaprompting--llama3.1-70b 100.0% 90.0% 30.0% 95.0%Now, we see that accuracy for urgency and categories is improved, however accuracy is reduced in the case of sentiment.
Combining Meta-prompting and Few-shot Prompting
We can combine meta-prompting and few-shot prompting using the following code:
1234567891011121314151617181920212223242526272829303132333435363738
prompt_13 = Template(
messages=[
SystemMessage(
"""You are an intelligent assistant. Your task is to classify messages.
Here are some examples:
---
{{?few_shot_examples}}
---
This is an explanation of `urgency` labels:
---
{{?guide_urgency}}
---
This is an explanation of `sentiment` labels:
---
{{?guide_sentiment}}
---
This is an explanation of `support` categories:
---
{{?guide_categories}}
---
Giving the following message:
extract and return a json with the follwoing keys and values:
- "urgency" as one of {{?urgency}}
- "sentiment" as one of {{?sentiment}}
- "categories" list of the best matching support category tags from: {{?categories}}
Your complete message should be a valid json string that can be read directly and only contain the keys mentioned in the list above. Never enclose it in ```json...```, no newlines, no unnessacary whitespaces.
"""
),
UserMessage("{{?input}}")
]
)
f_13 = partial(send_request, prompt=prompt_13, **option_lists, few_shot_examples=examples, **guides)
response = f_13(input=mail["message"])
This code defines a template for an intelligent assistant to classify messages based on urgency, sentiment, and support categories. It uses partial application to customize the request handling with specific examples and guidelines, then processes the input message to return a structured JSON response. This aids in accurate and efficient message classification.
It's combining few examples, with guides, generate during meta-prompting.
Evaluate this prompt and its response using the following code:
1234
overall_result["metaprompting_and_few_shot--llama3.1-70b"] = evalulation_full_dataset(test_set_small, f_13)
pretty_print_table(overall_result)
You will receive the following output:
12345670%| | 0/20 [00:00<?, ?it/s]
is_valid_json correct_categories correct_sentiment correct_urgency
===========================================================================================================
basic--llama3.1-70b 100.0% 83.5% 30.0% 70.0%
few_shot--llama3.1-70b 100.0% 84.0% 50.0% 90.0%
metaprompting--llama3.1-70b 100.0% 90.0% 30.0% 95.0%
metaprompting_and_few_shot--llama3.1-70b 100.0% 88.5% 50.0% 90.0%Now, we see that accuracy for almost all categories is similar or has reduced. In addition, it's a more expensive prompt needing more resources.
Note
You may get a slightly different response to the one shown here and in all the remaining responses of models shown in this learning journey.
When you execute the same prompt in your machine, an LLM produces varying outputs due to its probabilistic nature, temperature setting, and nondeterministic architecture, leading to different responses even with slight setting changes or internal state shifts.
Evaluation Summary
We need to consider the overall accuracy and quality of a model along with its cost and scale.
At times, smaller models and simpler techniques may give better results.
In the preceding output, we can see that few-shot gives optimal performance with less expensive prompt.
Let's recap what we have done to solve the business problem so far:
- We created a basic prompt in SAP AI Launchpad using an open-source model.
- We recreated the prompt using SAP Cloud SDK for AI (Python) to scale the solution.
- We created a baseline evaluation method for the simple prompt.
- Finally, we used techniques like few shot and meta-prompting to further enhance the prompts.
- The results show improvement in the quality of prompt responses after implementing advanced techniques.
Lesson Summary
You’ve successfully implemented and evaluated key prompt engineering techniques: Few-shot Prompting to provide the LLM with context-rich examples, and Meta-prompting to generate explicit instructions and guides for consistent behavior. You also explored combining these methods. Through iterative evaluation, you’ve witnessed how these techniques, used with the SAP Cloud SDK for AI, can significantly enhance the accuracy and quality of LLM responses, moving your solutions closer to business-ready applications, while also understanding the trade-offs in terms of complexity and cost.























