Maths with Language: Word Embeddings

Objective

After completing this lesson, you will be able to understand word embeddings, vector representations of a vocabulary used in the training and text generation of Large Language Models.

Content

Video summary:

The video gives an overview how modern large language models work: tokens are converted into numeric vectors (embeddings), enriched with positional information, and processed by a transformer (encoder + decoder). The key transformer innovation is self-attention, which lets the model weigh how each token relates to others. Training uses masking to teach the decoder to predict missing tokens, and generation repeatedly predicts and appends the next token.

Key points:

  • Tokenization + embeddings: words/tokens become high‑dimensional numeric vectors that capture semantic similarity.
  • Positional encoding: position information is added so that identical tokens in different positions are treated differently.
  • Self‑attention (encoder output): the model computes how much each token should attend to every other token, producing context‑aware representations.
  • Decoder training & generation: masking trains the decoder to predict masked tokens; generation uses next‑token prediction repeatedly to produce text (and the same ideas extend to music or sketches).