Once you have explored Large Language Models (LLMs), prompt engineering, fine-tuning, and other techniques to streamline the performance of LLMs for your use case, you want to assess the performance of LLM in your use case.
In this unit, you will learn about the key metrics and methods to assess LLM performance for your use case. You will learn basic concepts, for example, model inference, to provide stable and reliable LLM predictions. You will also identify some best practices to ensure optimal performance for LLM applications.
See the list of common metrics used to assess the LLM performance:
- Perplexity:
- Perplexity measures how well a language model predicts a given sequence of words. Lower perplexity indicates better performance.
- Bilingual Evaluation Understudy (BLEU):
- BLEU assesses the quality of machine-generated translations by comparing them to reference translations. It computes precision for n-grams (typically 1 to 4) and averages them. N-grams are set a consecutive n-words like pairs of consecutive words, three consecutive words, and so on, that are extracted from a given sample of text or speech. These are used to analyze the word order or context for tasks like text classification.
While commonly used in machine translation evaluation, BLEU can also be relevant for text generation tasks. For example, it can be used to measure the similarity of generated text (for example, summaries) to human reference text.
- Recall-Oriented Understudy for Gisting Evaluation (ROUGE):
- ROUGE evaluates the quality of text summarization systems. It measures the overlap between system-generated summaries and reference summaries.
- Classification Accuracy:
- This metric measures the proportion of correctly predicted instances out of the total. For example, if an LLM classifies customer reviews as positive or negative sentiment, accuracy tells us how often it predicts correctly.
- Precision and Recall:
- These metrics are essential for binary classification tasks.
- Precision quantifies how many of the predicted positive instances are positive. It helps avoid false positives.
- Recall (also known as sensitivity) measures how many actual positive instances were correctly predicted. It helps avoid false negatives.
- F1 Score:
- Often used for text classification tasks, the F1 score balances precision (correctly predicted positive instances) and recall (actual positive instances).
- Word Error Rate (WER):
- WER assesses the accuracy of automatic speech recognition systems by comparing system-generated transcriptions to human transcriptions.
- Semantic Similarity Metrics:
- These metrics include Cosine Similarity, Jaccard Index, and Word Mover’s Distance (WMD). They measure semantic similarity between sentences or documents.
The evaluation metric that you must apply depends on your use case. Each metric provides different insights into model performance.