srinivas raghav blog's

quick guide to understanding attention and transformers!

12 Nov, 2025

Rough Timeline

Might take around ~1 day at max.

best explanation of attention

The Attention Mechanism in Large Language Models - YouTube - Some Brief Idea of Embeddings and Attention

The math behind Attention: Keys, Queries, and Values matrices - YouTube - Math's and Intuition behind THE K,Q,V and MHA and Scaled Dot Product.

What are Transformer Models and how do they work? - YouTube - Putting Things Together.

Keys, Queries, and Values: The celestial mechanics of attention - YouTube - A Quick Look Again.

Attention? Attention! | Lil'Log

Cross Attention | Method Explanation | Math Explained - YouTube

Transformers from scratch | peterbloem.nl

tokenization

Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs – fast.ai

the intuition behind word embeddings

What Are Word Embeddings? - YouTube - An Introduction to Word Embeddings.

Word2vec from Scratch - Jake Tae

the intuition behind the position encoding

How do Transformer Models keep track of the order of words? Positional Encoding - YouTube

The wonderful world of positional encoding – Bocachancla 🫦🩴

Rotary Positional Embeddings Explained | Transformer - YouTube

understand the whole picture

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.

Attention is All You Need - Jake Tae

Some intuitions about transformers - Aryaman Arora

3Blue1brow Full Lecture Transformer

Attention is All you Need Alphaxiv Blog

The Annotated GPT-2

Attention and Augmented Recurrent Neural Networks

The Transformer Family Version 2.0 | Lil'Log

general deep learning

Calculus on Computational Graphs: Backpropagation -- colah's blog

Deep Learning, NLP, and Representations - colah's blog

Neural Networks, Manifolds, and Topology -- colah's blog

Understanding LSTM Networks -- colah's blog

Introduction to seq2seq models - Jake Tae

Introduction to tf-idf - Jake Tae

Dissecting LSTMs - Jake Tae

A Brief Introduction to Recurrent Neural Networks - Jake Tae

Demystifying Entropy (And More) - Jake Tae

Recommendation Algorithm with SVD - Jake Tae LoRA - Jake Tae Likelihood and Probability - Jake Tae