Trying it out - Maths with Language: Word Embeddings

Objective

After completing this lesson, you will be able to apply the concepts of the previous lesson - Maths with Language: Word Embeddings.

Exercise

Your Turn!

Now it’s your turn. Download this document for a recap of the blocks we covered in this lesson and some hands-on exercises for you to explore.

What You Have Learned in This Lesson

In this video you learned about some of the differences between the simple next token prediction from the last lesson and modern generative AI models.

Here’s the blog post with further teaching resources about Snap! GPT in case you haven’t read it, yet.

Word Embeddings

Instead of directly working with words and text, modern LLMs use tokens in form of vectors. For text, those are called word embeddings.

A word embedding is a vector representation of a word that encodes the meaning of that word. Synonymous words have similar vectors even across different languages.

In the video this was represented by finding the closest word to the features of the word "dog" in German. The result of that was the word "hund", which is indeed the German translation for "dog".

Finding the closest vector to the vector of a word in a different language results in the translation of that word.

Additionally, distances and directions between words have meaning, too. You can solve word analogy problems with word embeddings like:

cat is to kitten as dog is to X

You can calculate the difference between the word embeddings of kitten and cat and add it to the word embedding of dog to solve what X is – puppy.

In word embeddings, words that are closer together have a more similar meaning. Words with the same distance and direction to another word are analogous. In this case, puppy is to dog as kitten is to cat. The script on the right can calculate the word for baby dog (puppy) from the vectors of kitten, cat and dog.

Transformer Architecture in LLMs

Modern LLMs use a so-called "transformer" architecture where two neural networks – the encoder and decoder neural network work with each other to generate new text.

Rough overview of the transformer architecture. The input is tokenized and then processed by the encoder neural network. The results of the encoder neural network are used by the decoder neural network to generate new output.

Input is first tokenized, i.e. split in individual tokens that form the input for the encoder neural network. Tokens are not always whole words or expressions, sometimes a token is just a part of a word or a punctuation mark.

You can see how text is tokenized in this OpenAI Tokenizer program:

https://platform.openai.com/tokenizer

In the encoder neural network, those tokens are vectorized – meaning they are first converted into word embeddings. Those word embeddings aren’t calculated by the LLM every time. They use huge, precomputed databases of word embeddings to look up the corresponding one for each token.

But those aren’t the only vectors attached to each token. Since the order of words in a sentence also carries meaning, a positional vector is assigned to each token to keep track of the order of tokens.

And there’s yet one more vector for each token. The self-attention vector encodes the relationship between the tokens in a sequence and represents how much the meaning of each token depends on the other tokens in that sequence.

All these vectors are the output of the encoder neural network which are fed into the decoder neural network where the actual next token prediction is happening.

Pre-trained with a process called "masking", the decoder can then predict next tokens for a sequence based on the input it is getting.