Trying it out - What's next? - Generative AI with SnapGPT

Objective

After completing this lesson, you will be able to apply the concepts of the previous lesson - What's next? - Generative AI with SnapGPT.

Exercise

Your Turn!

Now it’s your turn. Download this document for a recap of the blocks we covered in this lesson and some hands-on exercises for you to explore.

What You Have Learned in This Lesson

In this lesson you have learned about "next token prediction", one of the important concepts of generative AI and modern Large Language Models (LLMs).

If you are looking for teaching resources or more information, check out this blog post about Snap!GPT.

n-grams in Natural Language Processing

When Claude Shannon, also known as the "father of information theory", published his paper "A Mathematical Theory of Communication" in 1948, he developed a mathematical framework for understanding communication. In his work, he was trying to improve the reliability of long-distance communication like telephones or telegraphs by increasing the signal-to-noise ratio.

Shannon was interested in how information can be transmitted and stored efficiently and how the information in a message can be quantified. Therefore, he was investigating how to mathematically describe the structure of language – not from a grammatical but rather a statistical perspective.

At the time, most linguists tried to analyze and model language through complex grammatical rules. Shannon followed a different approach. He assumed that language contains statistical patterns: some words are more likely to follow certain other words. Because word sequences do not occur with equal probability, he argued that much of language structure could be described using these probabilities.

To do this, he introduced the concept of n-grams, sequences of n consecutive words (or characters). By studying how frequently such sequences occur, Shannon showed that it is possible to approximate language structure using statistical methods rather than explicit grammatical rules.

The text “Once upon a time there was a king with a great forest” is divided into its bigrams, a list containing each word and its successor.

If you want to dive deeper into Shannon’s mathematical approach to information theory, you can read his paper here.

Next Token Prediction

The next token prediction used in modern generative AI requires input data where the order of tokens encodes meaning. This is the case for words in a sentence where changing the order can change the meaning of the sentence. It is also true for the order of musical notes in a composition which determines the melody.

In addition, next-token prediction only works well if some tokens are more likely than others to follow a given sequence, there must be meaningful differences in the probabilities of potential next tokens.

For example, most humans know that in the sequence

Once upon a _

the next token is more likely the word "time" than the word "flowerpot" or "coffee".

A computer system should also be able to predict that next token correctly, because in its corpus (all texts that were used for training), it should have more tetragrams where the fourth element is the word "time" than any of the other suggested words.

in the tetragrams of the “30 fairy tales” corpus, there are 9 occurrences of the sequence “once upon a”, always followed by the word “time”.

Context Window

The more information you have about the previous sequence the easier it gets to predict a fitting next token. The number of tokens you can take into consideration to predict the next one is called the "context window" of a model.

In our case, the context window was 4 words, because our maximum n-gram size was 5. In modern LLMs, the context window is several thousand words, therefore they can generate grammatically better texts and keep track of at least large parts of a conversation.