Selecting a Suitable Model

Objective

After completing this lesson, you will be able to evaluate different models using SAP Cloud SDK for AI.

We’ve refined our prompts and even enhanced them with multi-modal input. Now, it’s time to choose the best model to power our solution. In this lesson, we’ll move beyond evaluating just prompts to evaluating the models themselves. You’ll learn how to systematically test various models available in the generative AI hub using the SAP Cloud SDK for AI comparing their results on our Facility Solutions Company scenario to make an informed, business-driven selection.

Let's now evaluate different models for Facility solutions problem that we're solving.

Solution Using Different Models

Continuing with our scenario, we’ve learned to create and refine prompts that assign urgency, sentiment, and categories to customer messages. We also evaluated these advanced prompting techniques to analyze their results.

We will now see how these prompts perform with different models available through the generative AI hub. This is important in building a robust solution because the choice of a model significantly impacts your application’s accuracy, efficiency, and overall cost.

Key reasons proving why evaluating and selecting the right model is essential for your business problem:

Reasons for Evaluating Model
Specialization:	Different models are optimized for specific tasks. Some excel at creative text generation, others at precise classifications, summarizations, or even handling multi-modal inputs (like text and images).
Performance:	Not all models perform equally on every task. Comparing them helps you find the most accurate or suitable model for your specific requirements.
Cost Efficiency:	You can save cost by choosing an model that is just right for your task. Sometimes, a simpler, more affordable model can deliver the necessary accuracy, allowing you to avoid the higher costs associated with much more powerful models when they aren’t strictly needed.
Flexibility:	Different models offer varied capabilities, including support for various input types or generating diverse output formats, providing a more comprehensive solution for complex needs.
Redundancy and Reliability:	For critical enterprise applications, relying on a single model introduces risk. Evaluating multiple models provides tested alternatives, enhancing your solution’s robustness and minimizing downtime.

Different Models in generative-AI-hub Code

mistralai ai models

We begin with mistralai ai models and use the basic prompt. These models are the less expensive open source, SAP hosted models available on generative AI hub.

Python

123
overall_result["basic--mixtral-large-instruct"] = evalulation_full_dataset(test_set_small, f_8, _model='mistralai--mistral-large-instruct')
pretty_print_table(overall_result)

This code evaluates a dataset and prints the results. It calculates a specific model's performance on a small test set, storing the results under a key in the "overall_result" dictionary. The "pretty_print_table" function then formats and prints these results, making the evaluation data clear and easy to read.

You can have the following output.

Code Snippet

12345678
0%|          | 0/20 [00:00<?, ?it/s]
                                         is_valid_json correct_categories correct_sentiment correct_urgency
===========================================================================================================
                     basic--llama3.1-70b        100.0%              83.5%             30.0%           70.0%
                  few_shot--llama3.1-70b        100.0%              84.0%             50.0%           90.0%
             metaprompting--llama3.1-70b        100.0%              90.0%             30.0%           95.0%
metaprompting_and_few_shot--llama3.1-70b        100.0%              88.5%             50.0%           90.0%
           basic--mixtral-large-instruct         40.0%              34.5%             25.0%           20.0%

Similarly, let's evaluate results using a combination of few-shot and meta-prompting for the same model.

Python

123
overall_result["metaprompting_and_few_shot--mixtral-large-instruct"] = evalulation_full_dataset(test_set_small, f_13, _model='mistralai--mistral-large-instruct')
pretty_print_table(overall_result)

You will see the evaluation results.

Open AI models

We perform similar steps with Open AI models. These models are one of the best proprietary OpenAI models available on generative AI hub.

Python

123
overall_result["basic--gpt4o"] = evalulation_full_dataset(test_set_small, f_8, _model='gpt-4o')
pretty_print_table(overall_result)

Code Snippet

123456789
0%|          | 0/20 [00:00<?, ?it/s]
                                                   is_valid_json correct_categories correct_sentiment correct_urgency
=====================================================================================================================
                               basic--llama3.1-70b        100.0%              83.5%             30.0%           70.0%
                            few_shot--llama3.1-70b        100.0%              84.0%             50.0%           90.0%
                       metaprompting--llama3.1-70b        100.0%              90.0%             30.0%           95.0%
          metaprompting_and_few_shot--llama3.1-70b        100.0%              88.5%             50.0%           90.0%
                     basic--mixtral-large-instruct         40.0%              34.5%             25.0%           20.0%
metaprompting_and_few_shot--mixtral-large-instruct         80.0%              71.0%             40.0%           75.0%

Similarly, let's evaluate results using a combination of few-shot and meta-prompting for the same model.

Python

123
overall_result["metaprompting_and_few_shot--gpt4o"] = evalulation_full_dataset(test_set_small, f_13, _model='gpt-4o')
pretty_print_table(overall_result)

You will see the evaluation results.

Gemini models

We perform similar steps with Gemini models. These models are best Google models available on generative AI hub.

Python

123
overall_result["basic--gemini-2.5-flash"] = evalulation_full_dataset(test_set_small, f_8, _model='gemini-2.5-flash')
pretty_print_table(overall_result)

You can have the following output:

Code Snippet

123456789101112
 0%|          | 0/20 [00:00<?, ?it/s]
                                                   is_valid_json correct_categories correct_sentiment correct_urgency
=====================================================================================================================
                               basic--llama3.1-70b        100.0%              83.5%             30.0%           70.0%
                            few_shot--llama3.1-70b        100.0%              84.0%             50.0%           90.0%
                       metaprompting--llama3.1-70b        100.0%              90.0%             30.0%           95.0%
          metaprompting_and_few_shot--llama3.1-70b        100.0%              88.5%             50.0%           90.0%
                     basic--mixtral-large-instruct         40.0%              34.5%             25.0%           20.0%
metaprompting_and_few_shot--mixtral-large-instruct         80.0%              71.0%             40.0%           75.0%
                                      basic--gpt4o        100.0%              89.0%             35.0%           55.0%
                 metaprompting_and_few_shot--gpt4o        100.0%              91.5%             60.0%          100.0%
                           basic--gemini-2.5-flash        100.0%              91.0%             30.0%           60.0%

You can see results for these outputs.

Similarly, let's evaluate results using a combination of few-shot and meta-prompting for the same model.

Python

123
overall_result["metaprompting_and_few_shot--gemini-2.5-flash"] = evalulation_full_dataset(test_set_small, f_13, _model='gemini-2.5-flash')
pretty_print_table(overall_result)

You can see the evaluation results.

Code Snippet

12345678910111213
 0%|          | 0/20 [00:00<?, ?it/s]
                                                   is_valid_json correct_categories correct_sentiment correct_urgency
=====================================================================================================================
                               basic--llama3.1-70b        100.0%              83.5%             30.0%           70.0%
                            few_shot--llama3.1-70b        100.0%              84.0%             50.0%           90.0%
                       metaprompting--llama3.1-70b        100.0%              90.0%             30.0%           95.0%
          metaprompting_and_few_shot--llama3.1-70b        100.0%              88.5%             50.0%           90.0%
                     basic--mixtral-large-instruct         40.0%              34.5%             25.0%           20.0%
metaprompting_and_few_shot--mixtral-large-instruct         80.0%              71.0%             40.0%           75.0%
                                      basic--gpt4o        100.0%              89.0%             35.0%           55.0%
                 metaprompting_and_few_shot--gpt4o        100.0%              91.5%             60.0%          100.0%
                           basic--gemini-2.5-flash        100.0%              91.0%             30.0%           60.0%
      metaprompting_and_few_shot--gemini-2.5-flash        100.0%              92.5%             55.0%           90.0%

Note

You may get a slightly different response to the one shown here and in all the remaining responses of models shown in this learning journey.

When you execute the same prompt in your machine, a model produces varying outputs due to its probabilistic nature, temperature setting, and non-deterministic architecture, leading to different responses even with slight setting changes or internal state shifts.

Exercise

In exercises later, you will explore how to select the optimal models for your business needs by leveraging the Model Library in the SAP's generative AI hub.

Evaluation Guidelines for Different Models

When selecting an model within the SAP's generative AI hub, pricing and various factors play a crucial role. Key considerations include:

Cost Efficiency: Assess whether a smaller, more affordable model can deliver the required performance for your specific task. It’s vital to weigh the model’s cost against the expected return on investment, as effective solutions don’t always require the most expensive models. Refer to SAP notes 3437766 - Availability of Generative AI Models and 3505347 - Orchestration for pricing details in generative AI hub.
Model updates and new capabilities: Analyze technical details for available models, including token conversion rates, rate limits, and deprecation schedules using the 3437766 note. This note also lists the latest generative AI hub models, such as Claude and SAP-RPT-1 models, to help you integrate industry-standard capabilities into your AI solutions.
Scalability: Consider how easily the model’s pricing and infrastructure can scale with your application’s growth. Subscription-based models offered in the generative AI hub provide predictable costs and are designed to support scalable AI development and deployment.
Performance vs. Cost Balance: High-performing models typically come at a higher cost. Organizations must evaluate whether the incremental performance gains of a more powerful model truly justify the additional expense for their specific application and its business value. Sometimes, a slightly less performant but significantly cheaper model offers better overall value.
Flexibility: Look for pricing and model options that allow for adjustments based on fluctuating usage patterns or evolving AI demands. This adaptability is crucial for optimizing spending in dynamic enterprise environments.

By considering these guidelines, businesses can make informed decisions about which generative AI models to deploy, achieving the best balance between cost, performance, and strategic fit for their SAP-integrated solutions.

Evaluation Summary

We saw how the generative AI hub can solve a business problem and learned about its features and options for supporting custom-built AI solutions.

Throughout this course, you’ve gained a comprehensive understanding of this process. We embarked on an iterative path:

Starting with basic prompt creation in SAP AI Launchpad.
Scaling our solution by recreating prompts and interactions using the SAP Cloud SDK for AI.
Establishing a baseline through systematic evaluation.
Enhancing prompt accuracy and effectiveness with advanced techniques like Few-shot Prompting and Meta-prompting, and even incorporating multi-modal input.
Finally, we evaluated various models offered by the generative AI hub, comparing their performance, cost, and suitability for our specific business needs.

In our Facility Solutions Company scenario, for example, the evaluation clearly showed that combining few-shot prompting with an efficient, readily available model offered an optimal balance of accuracy, cost, and scalability for assigning urgency, sentiment, and categories to customer emails. This ensures the output is precise and ready for consumption by other applications within the organization, significantly enhancing customer service and operational efficiency.

The SAP’s generative AI hub empowers you to develop, deploy, and manage custom-built AI solutions that enhance your existing business applications programmatically, driving innovation across your enterprise.

Continuing with the scenario discussed previously, we created prompts and prompt templates that assign urgency, sentiment, and categories to customer messages that can be used in software.

We used the few-shot technique to arrive at a better prompt.

We used prompt template to help scale the solution.

Task 1: Access Different Models using the Model Library

We will start with exploring the Model Library.

Steps

Navigate to the Model Library. You will see the Model Library interface.
The model library provides comprehensive information on models available in the generative AI hub to support informed decision-making. To explore the available models and their metadata, utilize the catalog mode. To benchmark data to guide your decisions, use leaderboard mode. For detailed information about a specific model, including data input types, cost details, and metrics where available, refer to its model card.
You can apply filters such as capabilities, Input types, Model provider etc.
Select Leaderboard.
Select any criteria based on your business needs. For example, select ChatBot Arena score. You can hover over any column to know about them.
Select the column and click Sort Descending.
You can see model ratings. Similarly, you can compare ratings of different benchmarks in the Chart option.
Note
You can see all the models that are offered in Generative AI hub. However, this system is configured to allow few selected models only. These are: GP4.1 nano, GPT4o-mini, and Gemini 2.0 Flash Lite, and Mistral Small Instruct
Go back to Catalog mode and Search and select GPT 4.1 nano in the Catalog tab.
The model card is displayed. These cards provide all the details about the models, including Metrics, Cost, and Properties.
You can deploy or use the deployed model directly from Model Library. Select the Use in Chat or Use in Prompt Editor options based on your need. Here we will select the Use in Chat option.

Use the following prompt in the chat:

Code Snippet

12345678910111213141516171819202122232425262728
"<System role>
You are an expert customer service analyst for a facility management company. Your task is to analyze incoming customer messages and extract specific attributes for automated processing.

For 'urgency', classify the message as one of: `low`, `medium`, or `high`.
For 'sentiment', classify the message as one of: `positive`, `neutral`, or `negative`.
For 'categories', assign a list of the best matching support tags from the following predefined list:
`facility_management_issues`, `cleaning_services_scheduling`, `general_inquiries`, `specialized_cleaning_services`, `routine_maintenance_requests`, `emergency_repair_services`, `sustainability_and_environmental_practices`, `training_and_support_requests`, `quality_and_safety_concerns`, `customer_feedback_and_complaints`.

Your complete response MUST be a valid JSON string, ready for parsing by an application. It should contain ONLY the keys 'urgency', 'sentiment', and 'categories'. Do not include any other text, explanations, or formatting like markdown code blocks (e.g., ```json). Ensure there are no newlines or unnecessary white spaces outside the JSON structure.
<User role>. 
Analyze the following message:
Subject: Urgent HVAC System Repair Needed

Dear Support Team,

I hope this message finds you well. My name is [Sender], and I am reaching out to you from [Residential Complex Name], where I have been residing for the past few years. I have always appreciated the meticulous care and attention your team provides in maintaining our facilities.

However, I am currently facing a pressing issue with the HVAC system in my apartment. Over the past few days, the system has been malfunctioning, resulting in inconsistent temperatures and, at times, complete shutdowns. Given the current weather conditions, this has become quite unbearable and is affecting my daily routine significantly.

I have attempted to troubleshoot the problem by resetting the system and checking the thermostat settings, but these efforts have not yielded any improvement. The situation seems to be beyond my control and requires professional intervention.

I kindly request that a repair team be dispatched immediately to address this urgent issue. The urgency of the matter cannot be overstated, as it is impacting not only my comfort but also my ability to carry out daily activities effectively.

Thank you for your prompt attention to this matter. I look forward to your swift response and resolution.

Best regards,
[Sender]"

Copy the message and paste it in the chat, and then click Send. You will see the response.
You can further analyze the response by taking advantage of the chat interface, like using the following prompt.
Code Snippet
```
1
"Analyze this response further to add a field for "key concern" in JSON. Ensure correct JSON format."
```
You will see a response.
Similarly, you can see results from other models and select the best model for your use case.

Task 2: Access Different Models using Prompt Editor

We will use latest version of the prompt template created in the previous exercises. This is the latest few-shot prompt version with variables and their default values. We will execute this prompt template with different models.

Steps

Ensure that you are logged on to the Generative AI hub.
Select Prompt Management and then Templates.
Select the All button. You can see your template here. You can also search for your template.
Select the latest version of the template which is 5.0.0. Ensure that you select your template and the correct timestamp within the template. A good practice is to read the template before using it.
Select the prompt template and then click Open in Prompt Editor. Your prompt is ready to use.
Scroll to the Model Configuration tab.
Click Selected Model.
The Model Selection dialog box is displayed.
Select GPT-4o Mini.
Run the prompt. Note the differences in the response.
Similarly, select Mistral AI and GPT4o nano models and evaluate results.
Note
In case a particular model is not running properly, you can continue the exercise using other models.
You have tested various models in the Generative AI hub and used a consistent prompt template to evaluate LLMs for cost, performance, scalability, and flexibility. You can evaluate results within a consistent framework and assess both cost and performance for enterprise generative AI tasks.
You need to weigh the cost of using advanced models against the expected return on investment.
Refer to SAP notes 3437766 - Availability of Generative AI Models and 3505347 - Orchestration for pricing details in the Generative AI hub.

Continue to quiz