Your Turn!
Now it’s your turn. Download this document for a recap of the blocks we covered in this lesson and some hands-on exercises for you to explore.
What You Have Learned in This Lesson
In this lesson you have learned about "next token prediction", one of the important concepts of generative AI and modern Large Language Models (LLMs).
If you are looking for teaching resources or more information, check out this blog post about Snap!GPT.
n-grams in Natural Language Processing
When Claude Shannon, also known as the "father of information theory", published his paper "A Mathematical Theory of Communication" in 1948, he developed a mathematical framework for understanding communication. In his work, he was trying to improve the reliability of long-distance communication like telephones or telegraphs by increasing the signal-to-noise ratio.
Shannon was interested in how information can be transmitted and stored efficiently and how the information in a message can be quantified. Therefore, he was investigating how to mathematically describe the structure of language – not from a grammatical but rather a statistical perspective.
At the time, most linguists tried to analyze and model language through complex grammatical rules. Shannon followed a different approach. He assumed that language contains statistical patterns: some words are more likely to follow certain other words. Because word sequences do not occur with equal probability, he argued that much of language structure could be described using these probabilities.
To do this, he introduced the concept of n-grams, sequences of n consecutive words (or characters). By studying how frequently such sequences occur, Shannon showed that it is possible to approximate language structure using statistical methods rather than explicit grammatical rules.

If you want to dive deeper into Shannon’s mathematical approach to information theory, you can read his paper here.
Next Token Prediction
The next token prediction used in modern generative AI requires input data where the order of tokens encodes meaning. This is the case for words in a sentence where changing the order can change the meaning of the sentence. It is also true for the order of musical notes in a composition which determines the melody.
In addition, next-token prediction only works well if some tokens are more likely than others to follow a given sequence, there must be meaningful differences in the probabilities of potential next tokens.
For example, most humans know that in the sequence
Once upon a _
the next token is more likely the word "time" than the word "flowerpot" or "coffee".
A computer system should also be able to predict that next token correctly, because in its corpus (all texts that were used for training), it should have more tetragrams where the fourth element is the word "time" than any of the other suggested words.

Context Window
The more information you have about the previous sequence the easier it gets to predict a fitting next token. The number of tokens you can take into consideration to predict the next one is called the "context window" of a model.
In our case, the context window was 4 words, because our maximum n-gram size was 5. In modern LLMs, the context window is several thousand words, therefore they can generate grammatically better texts and keep track of at least large parts of a conversation.