We’ve refined our prompts and even enhanced them with multi-modal input. Now, it’s time to choose the best model to power our solution. In this lesson, we’ll move beyond evaluating just prompts to evaluating the models themselves. You’ll learn how to systematically test various models available in the generative AI hub using the SAP Cloud SDK for AI comparing their results on our Facility Solutions Company scenario to make an informed, business-driven selection.
Let's now evaluate different models for Facility solutions problem that we're solving.
Solution Using Different Models
Continuing with our scenario, we’ve learned to create and refine prompts that assign urgency, sentiment, and categories to customer messages. We also evaluated these advanced prompting techniques to analyze their results.
We will now see how these prompts perform with different models available through the generative AI hub. This is important in building a robust solution because the choice of a model significantly impacts your application’s accuracy, efficiency, and overall cost.
Key reasons proving why evaluating and selecting the right model is essential for your business problem:
| Reasons for Evaluating Model | |
|---|---|
| Specialization: | Different models are optimized for specific tasks. Some excel at creative text generation, others at precise classifications, summarizations, or even handling multi-modal inputs (like text and images). |
| Performance: | Not all models perform equally on every task. Comparing them helps you find the most accurate or suitable model for your specific requirements. |
| Cost Efficiency: | You can save cost by choosing an model that is just right for your task. Sometimes, a simpler, more affordable model can deliver the necessary accuracy, allowing you to avoid the higher costs associated with much more powerful models when they aren’t strictly needed. |
| Flexibility: | Different models offer varied capabilities, including support for various input types or generating diverse output formats, providing a more comprehensive solution for complex needs. |
| Redundancy and Reliability: | For critical enterprise applications, relying on a single model introduces risk. Evaluating multiple models provides tested alternatives, enhancing your solution’s robustness and minimizing downtime. |
Different Models in generative-AI-hub Code
mistralai ai models
We begin with mistralai ai models and use the basic prompt. These models are the less expensive open source, SAP hosted models available on generative AI hub.
123overall_result["basic--mixtral-large-instruct"] = evalulation_full_dataset(test_set_small, f_8, _model='mistralai--mistral-large-instruct')
pretty_print_table(overall_result)
This code evaluates a dataset and prints the results. It calculates a specific model's performance on a small test set, storing the results under a key in the "overall_result" dictionary. The "pretty_print_table" function then formats and prints these results, making the evaluation data clear and easy to read.
You can have the following output.
123456780%| | 0/20 [00:00<?, ?it/s]
is_valid_json correct_categories correct_sentiment correct_urgency
===========================================================================================================
basic--llama3.1-70b 100.0% 83.5% 30.0% 70.0%
few_shot--llama3.1-70b 100.0% 84.0% 50.0% 90.0%
metaprompting--llama3.1-70b 100.0% 90.0% 30.0% 95.0%
metaprompting_and_few_shot--llama3.1-70b 100.0% 88.5% 50.0% 90.0%
basic--mixtral-large-instruct 40.0% 34.5% 25.0% 20.0%Similarly, let's evaluate results using a combination of few-shot and meta-prompting for the same model.
123overall_result["metaprompting_and_few_shot--mixtral-large-instruct"] = evalulation_full_dataset(test_set_small, f_13, _model='mistralai--mistral-large-instruct')
pretty_print_table(overall_result)
You will see the evaluation results.
Open AI models
We perform similar steps with Open AI models. These models are one of the best proprietary OpenAI models available on generative AI hub.
123overall_result["basic--gpt4o"] = evalulation_full_dataset(test_set_small, f_8, _model='gpt-4o')
pretty_print_table(overall_result)
1234567890%| | 0/20 [00:00<?, ?it/s]
is_valid_json correct_categories correct_sentiment correct_urgency
=====================================================================================================================
basic--llama3.1-70b 100.0% 83.5% 30.0% 70.0%
few_shot--llama3.1-70b 100.0% 84.0% 50.0% 90.0%
metaprompting--llama3.1-70b 100.0% 90.0% 30.0% 95.0%
metaprompting_and_few_shot--llama3.1-70b 100.0% 88.5% 50.0% 90.0%
basic--mixtral-large-instruct 40.0% 34.5% 25.0% 20.0%
metaprompting_and_few_shot--mixtral-large-instruct 80.0% 71.0% 40.0% 75.0%Similarly, let's evaluate results using a combination of few-shot and meta-prompting for the same model.
123overall_result["metaprompting_and_few_shot--gpt4o"] = evalulation_full_dataset(test_set_small, f_13, _model='gpt-4o')
pretty_print_table(overall_result)
You will see the evaluation results.
Gemini models
We perform similar steps with Gemini models. These models are best Google models available on generative AI hub.
123overall_result["basic--gemini-2.5-flash"] = evalulation_full_dataset(test_set_small, f_8, _model='gemini-2.5-flash')
pretty_print_table(overall_result)
You can have the following output:
123456789101112 0%| | 0/20 [00:00<?, ?it/s]
is_valid_json correct_categories correct_sentiment correct_urgency
=====================================================================================================================
basic--llama3.1-70b 100.0% 83.5% 30.0% 70.0%
few_shot--llama3.1-70b 100.0% 84.0% 50.0% 90.0%
metaprompting--llama3.1-70b 100.0% 90.0% 30.0% 95.0%
metaprompting_and_few_shot--llama3.1-70b 100.0% 88.5% 50.0% 90.0%
basic--mixtral-large-instruct 40.0% 34.5% 25.0% 20.0%
metaprompting_and_few_shot--mixtral-large-instruct 80.0% 71.0% 40.0% 75.0%
basic--gpt4o 100.0% 89.0% 35.0% 55.0%
metaprompting_and_few_shot--gpt4o 100.0% 91.5% 60.0% 100.0%
basic--gemini-2.5-flash 100.0% 91.0% 30.0% 60.0%You can see results for these outputs.
Similarly, let's evaluate results using a combination of few-shot and meta-prompting for the same model.
123overall_result["metaprompting_and_few_shot--gemini-2.5-flash"] = evalulation_full_dataset(test_set_small, f_13, _model='gemini-2.5-flash')
pretty_print_table(overall_result)
You can see the evaluation results.
12345678910111213 0%| | 0/20 [00:00<?, ?it/s]
is_valid_json correct_categories correct_sentiment correct_urgency
=====================================================================================================================
basic--llama3.1-70b 100.0% 83.5% 30.0% 70.0%
few_shot--llama3.1-70b 100.0% 84.0% 50.0% 90.0%
metaprompting--llama3.1-70b 100.0% 90.0% 30.0% 95.0%
metaprompting_and_few_shot--llama3.1-70b 100.0% 88.5% 50.0% 90.0%
basic--mixtral-large-instruct 40.0% 34.5% 25.0% 20.0%
metaprompting_and_few_shot--mixtral-large-instruct 80.0% 71.0% 40.0% 75.0%
basic--gpt4o 100.0% 89.0% 35.0% 55.0%
metaprompting_and_few_shot--gpt4o 100.0% 91.5% 60.0% 100.0%
basic--gemini-2.5-flash 100.0% 91.0% 30.0% 60.0%
metaprompting_and_few_shot--gemini-2.5-flash 100.0% 92.5% 55.0% 90.0%Note
You may get a slightly different response to the one shown here and in all the remaining responses of models shown in this learning journey.
When you execute the same prompt in your machine, a model produces varying outputs due to its probabilistic nature, temperature setting, and non-deterministic architecture, leading to different responses even with slight setting changes or internal state shifts.
Exercise
In exercises later, you will explore how to select the optimal models for your business needs by leveraging the Model Library in the SAP's generative AI hub.
Evaluation Guidelines for Different Models
When selecting an model within the SAP's generative AI hub, pricing and various factors play a crucial role. Key considerations include:
- Cost Efficiency: Assess whether a smaller, more affordable model can deliver the required performance for your specific task. It’s vital to weigh the model’s cost against the expected return on investment, as effective solutions don’t always require the most expensive models. Refer to SAP notes 3437766 - Availability of Generative AI Models and 3505347 - Orchestration for pricing details in generative AI hub.
Model updates and new capabilities: Analyze technical details for available models, including token conversion rates, rate limits, and deprecation schedules using the 3437766 note. This note also lists the latest generative AI hub models, such as Claude and SAP-RPT-1 models, to help you integrate industry-standard capabilities into your AI solutions.
- Scalability: Consider how easily the model’s pricing and infrastructure can scale with your application’s growth. Subscription-based models offered in the generative AI hub provide predictable costs and are designed to support scalable AI development and deployment.
- Performance vs. Cost Balance: High-performing models typically come at a higher cost. Organizations must evaluate whether the incremental performance gains of a more powerful model truly justify the additional expense for their specific application and its business value. Sometimes, a slightly less performant but significantly cheaper model offers better overall value.
- Flexibility: Look for pricing and model options that allow for adjustments based on fluctuating usage patterns or evolving AI demands. This adaptability is crucial for optimizing spending in dynamic enterprise environments.
By considering these guidelines, businesses can make informed decisions about which generative AI models to deploy, achieving the best balance between cost, performance, and strategic fit for their SAP-integrated solutions.
Evaluation Summary
We saw how the generative AI hub can solve a business problem and learned about its features and options for supporting custom-built AI solutions.
Throughout this course, you’ve gained a comprehensive understanding of this process. We embarked on an iterative path:
- Starting with basic prompt creation in SAP AI Launchpad.
- Scaling our solution by recreating prompts and interactions using the SAP Cloud SDK for AI.
- Establishing a baseline through systematic evaluation.
- Enhancing prompt accuracy and effectiveness with advanced techniques like Few-shot Prompting and Meta-prompting, and even incorporating multi-modal input.
- Finally, we evaluated various models offered by the generative AI hub, comparing their performance, cost, and suitability for our specific business needs.
In our Facility Solutions Company scenario, for example, the evaluation clearly showed that combining few-shot prompting with an efficient, readily available model offered an optimal balance of accuracy, cost, and scalability for assigning urgency, sentiment, and categories to customer emails. This ensures the output is precise and ready for consumption by other applications within the organization, significantly enhancing customer service and operational efficiency.
The SAP’s generative AI hub empowers you to develop, deploy, and manage custom-built AI solutions that enhance your existing business applications programmatically, driving innovation across your enterprise.









