Enhancing Prompt Effectiveness Through Multi-modal Input

Objective

After completing this lesson, you will be able to optimize AI responses by leveraging multimodal input in your prompts.

We’ve mastered the art of refining text-based prompts to get structured information. But what if a picture could make your LLM’s job much easier? In the real world, problems often come with visual clues. This lesson will show you how to give your LLM that visual context by using multi-modal input, meaning both text and images. We’ll explore how to do this directly in SAP AI Launchpad and by extending our code with the SAP Cloud SDK for AI, leading to smarter and more accurate AI solutions.

Why Multi-modal Input Matters

Imagine a customer reporting a broken machine. They could describe it in text, but if they also include a photo of the damaged part, the problem becomes much clearer. By allowing your prompts to accept both text and images, you provide the LLM with a complete picture, which can lead to:

Better Understanding: The LLM can "see" what you mean, reducing confusion.
More Accurate Results: Visuals can help confirm details or reveal issues text alone might miss.
Solving New Problems: This opens possibilities for AI to help with tasks that require both visual and textual analysis, like quality inspections or maintenance.

The generative AI hub, with SAP AI Launchpad and the SAP Cloud SDK for AI, makes this powerful capability accessible.

Multi-modal Prompts in SAP Launchpad

You don’t always need to write code to use multi-modal prompts. The SAP AI Launchpad provides a user-friendly interface where you can easily combine text and images. It supports many multi-modal models, such as GPT-4o, allowing you to create and test these advanced prompts visually.

To see which models support multi-modal modes, see Model Library and model cards.

Let’s look at how this appears in the Prompt Editor:

In the Prompt Editor, you’ve entered a sample email and instructions for the LLM to extract JSON output with urgency and sentiment. You’ll notice an "Upload Image" button. This is where you can add a visual component to your prompt.

The AI’s response for this text-only input might be: {"urgency": "high", "sentiment": "neutral"}.

Here, after clicking "Upload Image" (or dragging and dropping), a small, embedded image is now visible directly within the input text area. This image, showing a fallen tree in front of entrance, is now part of the prompt.

Now, the response will be like : {"urgency": "high", "sentiment": "negative"}

You can see that the complaint text is really short, but with this additional visual information, the AI can often provide a more precise response.

Multi-modal Prompts with the SAP Cloud SDK for AI

For programmatic access and integrating multi-modal capabilities into your custom applications, you can use the SAP Cloud SDK for AI. The main change involves how we define the user’s message to the LLM. Instead of just a text string, the UserMessage can now take a list of content parts, where each part can be text or an image URL.

Here’s how we adapt our send_request function and the prompt (prompt_13_multimodal) to include an image:

Python

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889

# We need to import TextContent and ImageUrlContent for multimodal messages
from gen_ai_hub.orchestration.models.message import SystemMessage, UserMessage, TextContent, ImageUrlContent
from gen_ai_hub.orchestration.models.template import Template, TemplateValue
from gen_ai_hub.orchestration.service import OrchestrationService
from functools import partial # Imported for consistency with prior lesson code

# The send_request function is updated to accept an optional 'image_url'
def send_request(prompt: str, _print: bool = True, _model: str = 'meta--llama3-70b-instruct', image_url: Optional[str] = None, **kwargs):
    # We create a list to hold all parts of our message (text and optional image)
    content_parts = []

    # If an image URL is provided, we add it as an ImageUrlContent part
    if image_url:
        content_parts.append(ImageUrlContent(url=image_url))

    # We always add the text prompt as a TextContent part
    content_parts.append(TextContent(text=prompt))

    # Now, our OrchestrationConfig uses a UserMessage with this list of content_parts
    config = OrchestrationConfig(
        llm=LLM(name=_model),
        template=Template(messages=[UserMessage(content=content_parts)]) # Key change here!
    )
    template_values = [TemplateValue(name=key, value=value) for key, value in kwargs.items()]
    answer = orchestration_service.run(config=config, template_values=template_values)
    result = answer.module_results.llm.choices[0].message.content
    if _print:
        print(f"<-- PROMPT TEXT --->\n{prompt}")
        if image_url:
            print(f"<-- IMAGE URL --->\n{image_url}")
        print(f"<--- RESPONSE --->\n{result}")
    return result

# --- Updated Multimodal Prompt (prompt_13_multimodal) ---
prompt_13_multimodal = """Your task is to classify messages and the provided image.
Here are some examples:
---
{{?few_shot_examples}}
---
This is an explanation of `urgency` labels:
---
{{?guide_urgency}}
---
This is an explanation of `sentiment` labels:
---
{{?guide_sentiment}}
---
This is an explanation of `support` categories:
---
{{?guide_categories}}
---
Giving the following message and considering the image for visual context:
---
{{?input}}
---
extract and return a JSON with the following keys and values:
- "urgency" as one of {{?urgency}}
- "sentiment" as one of {{?sentiment}}
- "categories" list of the best matching support category tags from: {{?categories}}
Your complete message should be a valid json string that can be read directly and only contain the keys mentioned in the list above. Never enclose it in ```json...```, no newlines, no unnecessary whitespaces.
"""

# --- Example Usage (requires 'option_lists', 'examples', 'guides', 'mail' to be defined from prior lessons) ---
# For illustration purposes, let's assume these are set up:
# option_lists = { "urgency": ["low", "medium", "high"], ... }
# examples = "..." # Formatted few-shot examples
# guides = { "guide_urgency": "...", ... }
# mail = {"message": "The HVAC system is making a loud banging noise and no longer cooling. It needs immediate attention."}

# This URL should point to a real image accessible by the LLM
example_image_url = "https://example.com/assets/faulty_hvac_part.png" # Replace with a real image URL

# We create our partial function, now including the image_url
f_13_multimodal = partial(
    send_request,
    prompt=prompt_13_multimodal,
    # Pass all our usual options, few-shot examples, and guides
    **option_lists,
    few_shot_examples=examples,
    **guides,
    image_url=example_image_url # This is the crucial addition!
)

# When you call this function, the LLM receives both text and the image
# response_multimodal = f_13_multimodal(input=mail["message"])
# print("\nReceived Multimodal Response:", response_multimodal)

Understanding Code Changes

New Imports: We now import TextContent and ImageUrlContent from gen_ai_hub.orchestration.models.message. These special types tell the SDK that we’re sending different kinds of content.
send_request Update:
- It now accepts an optional image_url parameter.
- Inside the function, we create a list called content_parts.
- If image_url is provided, we add it to content_parts using ImageUrlContent(url=image_url).
- The original text prompt is also added to content_parts using TextContent(text=prompt).
- The UserMessage in our OrchestrationConfig now receives this content_parts list. This tells the SDK to send both the image and the text together to the LLM.
prompt_13_multimodal: The text of the prompt itself is updated slightly to explicitly tell the LLM to "classify messages and the provided image" and to consider "the image for visual context." This helps guide the LLM’s attention.
Calling the Function: When we create f_13_multimodal using partial, we simply include image_url=example_image_url as an argument. Now, every time f_13_multimodal is called, it will send the text from mail["message"] along with the image at example_image_url to the LLM.

Evaluating Multi-modal Responses

Just like with text-only prompts, it’s vital to evaluate multi-modal responses. You’ll use the same evaluation functions used previously to check if the JSON output is correctly formatted and if the extracted categories, sentiment, and urgency are accurate. The key difference is that now the LLM has more information (the image) to arrive at its answer, so your "ground truth" for what’s correct implicitly includes that visual context. This helps you confirm that adding images genuinely improves your AI’s understanding and accuracy.

Practical application

A practical application of this multi-modal capability of generative AI hub can be a web-based intelligent chatbot capable of interacting with users via text, audio, images, and video. It returns context-aware responses using a multi-modal AI model.

See Multimodal Response Assistant Chatbot Using SAP AI Core

Lesson Summary

In this lesson, you’ve taken a significant step forward by learning to incorporate multi-modal input into your generative AI applications. Whether using the intuitive SAP AI Launchpad or programmatically with the SAP Cloud SDK for AI, you now understand how to provide LLMs with both text and image context. This powerful approach leads to more intelligent, precise, and contextually rich responses, expanding the range of real-world business problems you can effectively solve within the SAP ecosystem.

Continue to quiz