Your Turn!
Now it’s your turn. Download this document for a recap of the blocks we covered in this lesson and some hands-on exercises for you to explore.
What You Have Learned in This Lesson
In this video you learned about some of the differences between the simple next token prediction from the last lesson and modern generative AI models.
Here’s the blog post with further teaching resources about Snap! GPT in case you haven’t read it, yet.
Word Embeddings
Instead of directly working with words and text, modern LLMs use tokens in form of vectors. For text, those are called word embeddings.
A word embedding is a vector representation of a word that encodes the meaning of that word. Synonymous words have similar vectors even across different languages.
In the video this was represented by finding the closest word to the features of the word "dog" in German. The result of that was the word "hund", which is indeed the German translation for "dog".

Additionally, distances and directions between words have meaning, too. You can solve word analogy problems with word embeddings like:
cat is to kitten as dog is to X
You can calculate the difference between the word embeddings of kitten and cat and add it to the word embedding of dog to solve what X is – puppy.

Transformer Architecture in LLMs
Modern LLMs use a so-called "transformer" architecture where two neural networks – the encoder and decoder neural network work with each other to generate new text.

Input is first tokenized, i.e. split in individual tokens that form the input for the encoder neural network. Tokens are not always whole words or expressions, sometimes a token is just a part of a word or a punctuation mark.
You can see how text is tokenized in this OpenAI Tokenizer program:
https://platform.openai.com/tokenizer
In the encoder neural network, those tokens are vectorized – meaning they are first converted into word embeddings. Those word embeddings aren’t calculated by the LLM every time. They use huge, precomputed databases of word embeddings to look up the corresponding one for each token.
But those aren’t the only vectors attached to each token. Since the order of words in a sentence also carries meaning, a positional vector is assigned to each token to keep track of the order of tokens.
And there’s yet one more vector for each token. The self-attention vector encodes the relationship between the tokens in a sequence and represents how much the meaning of each token depends on the other tokens in that sequence.
All these vectors are the output of the encoder neural network which are fed into the decoder neural network where the actual next token prediction is happening.
Pre-trained with a process called "masking", the decoder can then predict next tokens for a sequence based on the input it is getting.