Selecting the Suitable LLM

Objective

After completing this lesson, you will be able to evaluate different models using generative-ai-hub-sdk

Different Models in generative-AI-hub Code

Let's now evaluate different models for Facility solutions problem that we're solving.

mistralai--mixtral-8x7b-instruct-v01

We begin with mistralai--mixtral-8x7b-instruct-v01 and use the basic prompt. This model is an example of the cheapest open source, SAP hosted models available on generative AI hub.

Python
123
overall_result["basic--mixtral-8x7b"] = evalulation_full_dataset(test_set_small, f_8, _model='mistralai--mixtral-8x7b-instruct-v01') pretty_print_table(overall_result)

This code evaluates a dataset and prints the results. It calculates a specific model's performance on a small test set, storing the results under a key in the "overall_result" dictionary. The "pretty_print_table" function then formats and prints these results, making the evaluation data clear and easy to read.

You can have the following output and can see the results.

A table with performance metrics for different models. Metrics include valid JSON, correct categories, sentiment, and urgency. Models listed are llama3-70b, mixtral-8x7b, and few-shot variations.

Similarly, let's evaluate results using a combination of few-shot and metaprompting for the same model.

Python
123
overall_result["metaprompting_and_few_shot--mixtral-8x7b"] = evalulation_full_dataset(test_set_small, f_13, _model='mistralai--mixtral-8x7b-instruct-v01') pretty_print_table(overall_result)

You can have the following output:

A table with evaluation results for different models and prompting techniques. Metrics include valid JSON, correct categories, sentiment, and urgency, with scores for each model configuration.

You can see the evaluation results.

gpt-4o

We perform similar steps with gpt-4o. This model is an example of the best proprietary OpenAI models available on generative AI hub.

Python
123
overall_result["basic--gpt4o"] = evalulation_full_dataset(test_set_small, f_8, _model='gpt-4o') pretty_print_table(overall_result)
A table with performance metrics for different models and configurations. Metrics include JSON validity, correct categories, sentiment, and urgency. Models include llama3, mixtral, and gpt4o.

You can see results for these outputs.

Similarly, let's evaluate results using a combination of few-shot and metaprompting for the same model.

Python
123
overall_result["metaprompting_and_few_shot--gpt4o"] = evalulation_full_dataset(test_set_small, f_13, _model='gpt-4o') pretty_print_table(overall_result)

You can have the following output.

A table comparing different models (llama3-70b, mixtral-8x7b, gpt4o) on metrics: valid JSON, correct categories, sentiment, and urgency. All models have 100% valid JSON, with varying other scores.

You can see the evaluation results.

gemini-1.5-flash

We perform similar steps with gemini-1.5-flash. This model is the cheapest and fastest Google model available on generative AI hub.

Python
123
overall_result["basic--gemini-1.5-flash"] = evalulation_full_dataset(test_set_small, f_8, _model='gemini-1.5-flash') pretty_print_table(overall_result)

You can have the following output:

A table of model performance metrics, including is_valid_json, correct_categories, correct_sentiment, and correct_urgency, for various models like llama3-70b, mixtral-8x7b, and gpt4o.

You can see results for these outputs.

Similarly, let's evaluate results using a combination of few-shot and metaprompting for the same model.

Python
123
overall_result["metaprompting_and_few_shot--gemini-1.5-flash"] = evalulation_full_dataset(test_set_small, f_13, _model='gemini-1.5-flash') pretty_print_table(overall_result)

You can have the following output:

A table of model performance metrics, including is_valid_json, correct_categories, correct_sentiment, and correct_urgency, for various models like llama3-70b, mixtrxl-8x7b, and gpt4.

You can see the evaluation results.

Note

You may get a slightly different response to the one shown here and in all the remaining responses of models shown in this learning journey.

When you execute the same prompt in your machine, a LLM produces varying outputs due to its probabilistic nature, temperature setting, and non-deterministic architecture, leading to different responses even with slight setting changes or internal state shifts.

Evaluation Guidelines for Different Models

Pricing and rates play a crucial role in model selection within generative AI hub. Key considerations include:

  1. Cost Efficiency: Smaller, more affordable models often provide excellent results at a lower cost. They offer a cost-effective solution that balances performance with budget constraints for many applications. Companies need to weigh the cost of using advanced models against the expected return on investment. Refer to SAP notes 3437766 - Availability of Generative AI Models and 3505347 - Orchestrationfor pricing details in generative AI hub.
  2. Scalability: Subscription-based pricing models provide predictable costs and are easier to scale. This is where generative AI hub can support scalable AI development and deployment.
  3. Performance vs. Cost: Although high-performing models come at a higher cost, smaller models can deliver comparable results for specific tasks. This makes them a viable option for organizations looking to optimize spending without compromising on quality. Organizations must evaluate whether the performance gains justify the additional expense by assessing the specific needs of the application and its business value.
  4. Flexibility: Pricing models that allow for adjustments based on usage patterns help organizations optimize their spending. This is particularly important in dynamic environments where AI capability demands fluctuate.
  5. Competitive Advantage: Strategic pricing can provide a competitive edge. Companies adopting innovative pricing strategies, such as outcome-based pricing, can attract more customers and drive higher adoption rates.

Businesses can make use of these considerations and take informed decisions about which generative AI models to deploy, achieving the best balance between cost and performance.

Evaluation Summary

We saw how we can use generative AI hub to solve a business problem and learned about features and options that generative AI hub offers to develop, deploy, and manage custom-built AI solutions.

Let's add to the recap what we have done to solve the business problem so far:

  1. We created a basic prompt in SAP AI Launchpad using an open-source model.
  2. We recreated the prompt using generative-ai-hub-sdk to scale the solution.
  3. We created a baseline evaluation method for the simple prompt.
  4. Finally, we used techniques like few-shot and metaprompting to further enhance the prompts.
  5. We tested various models in generative AI hub using SAP AI Launchpad.
  6. We evaluated various models for the problem using generative-ai-hub-sdk.

For example, in the Facility Solutions Company scenario, we can see through the evaluation that the few_shot--llama3-70b option gives a better result. Although some models give a better accuracy for certain fields, considering the cost, scalability, and performance the company can use the few-shot prompt using meta--llama3-70b-instruct to get a structured response in a format that can be used by other applications within the organization.

Facility solutions teams can leverage generative AI hub for categorizing customers mails and prioritize tasks to enhance customer service and overall experience.

With SAP’s generative AI hub, you can build your own solutions that enhance business applications with using LLMs programmatically.

Log in to track your progress & complete quizzes