Large Language Models (LLMs) are often treated as black boxes. We feed them prompts, and they spit out code, essays, or architectural advice. But underneath the billions of parameters and GPU clusters, what is actually happening?
On February 12, 2026, Andrej Karpathy published MicroGPT — a single Python file, about 200 lines, zero dependencies, that trains and runs a GPT from scratch. No PyTorch. No TensorFlow. The only imports are os, math, and random. He called it an “art project,” and I think that’s underselling it.
MicroGPT is the end point of a long series of Karpathy’s educational projects: micrograd (autograd engine), makemore (character-level generation), nanoGPT (practical GPT training), and finally this — the whole thing distilled into one file. As he put it: “This file is the complete algorithm. Everything else is just efficiency.”
What it actually does
MicroGPT trains on a dataset of about 32,000 baby names. It learns the statistical patterns of how names are spelled, then generates new fake names that sound real but never existed. Names like “karia” or “alend.” It’s a character-level language model, so each letter is a token.
That might sound trivial compared to ChatGPT, but the mechanism is identical. ChatGPT is this same core loop (predict next token, sample, repeat) scaled up with more data, more parameters, and post-training to make it conversational. When you chat with it, the system prompt, your message, and its reply are all just tokens in a sequence. The model is completing the document one token at a time, same as MicroGPT completing a name.
Why it is worth reading
Production models like Llama 3 or GPT-4 bury their core logic under layers of optimizations, distributed training, and hardware-specific tweaks. Good luck figuring out how attention actually works by reading that code.
MicroGPT strips all of that away. The model has exactly 4,192 parameters. GPT-2 had 1.6 billion. Modern LLMs have hundreds of billions. But the architecture is the same shape: embeddings, attention, feed-forward networks, residual connections. Just much, much smaller.
What makes MicroGPT unusual is that it does not just implement the model. It implements everything from scratch. The entire file contains:
- A character-level tokenizer (each unique character gets an integer ID)
- An autograd engine (the
Valueclass, which tracks gradients through the computation graph) - A GPT-2-like neural network (with RMSNorm instead of LayerNorm, ReLU instead of GeLU, no biases)
- The Adam optimizer
- The training loop and inference loop
The autograd part is the thing that surprised me. In normal deep learning, you use PyTorch or JAX to compute gradients automatically. Here, Karpathy reimplements backpropagation from scratch in about 30 lines using a Value class that wraps scalars and tracks their dependencies. Every addition, multiplication, and activation function creates a node in a computation graph. When you call loss.backward(), it walks the graph in reverse and computes gradients using the chain rule. The same thing PyTorch does, just one scalar at a time instead of batched tensors on a GPU.
The actual code
Here is the model function, slightly cleaned up for readability. In the real code, it is a single flat function, not a class:
def gpt(token_id, pos_id, keys, values): tok_emb = state_dict['wte'][token_id] # token embedding pos_emb = state_dict['wpe'][pos_id] # position embedding x = [t + p for t, p in zip(tok_emb, pos_emb)] x = rmsnorm(x)
for li in range(n_layer): # 1) Multi-head Attention block x_residual = x x = rmsnorm(x) q = linear(x, state_dict[f'layer{li}.attn_wq']) k = linear(x, state_dict[f'layer{li}.attn_wk']) v = linear(x, state_dict[f'layer{li}.attn_wv']) keys[li].append(k) values[li].append(v) # ... attention computation with Q, K, V ... x = [a + b for a, b in zip(x, x_residual)] # residual connection
# 2) MLP block x_residual = x x = rmsnorm(x) x = linear(x, state_dict[f'layer{li}.mlp_fc1']) x = [xi.relu() for xi in x] x = linear(x, state_dict[f'layer{li}.mlp_fc2']) x = [a + b for a, b in zip(x, x_residual)] # residual connection
logits = linear(x, state_dict['lm_head']) return logitsNo classes. No inheritance. No framework. Just functions that take lists of Value objects and return lists of Value objects. linear is a matrix-vector multiply. rmsnorm normalizes activations. softmax turns raw scores into probabilities. Each one is two or three lines long.
The state_dict is a plain dictionary of weight matrices, each made of Value objects that know how to compute their own gradients. That’s how the whole thing fits in 200 lines.
Self-attention, explained concretely
Self-Attention is what makes this a transformer instead of a bag-of-characters model. It is how the model learns that the “m” in “emma” relates to the “e” before it.
Each token gets projected into three vectors: a Query (Q), a Key (K), and a Value (V). The Query says “what am I looking for?”, the Key says “what do I contain?”, and the Value says “what information do I carry?” The model computes dot products between the current Query and all cached Keys, runs them through softmax to get attention weights, then takes a weighted sum of the Values.
In MicroGPT’s code, this is fully explicit. You can see each dot product being computed in a loop, one element at a time. In PyTorch, the same computation is hidden inside batched matrix multiplications. Same math, different packaging.
Running it yourself
The whole thing is a single file. You download it and run python train.py. That’s it. No pip install, no environment setup, no CUDA drivers.
On Karpathy’s MacBook, training takes about a minute. After 500 steps, the loss drops from around 3.3 (random guessing among 27 tokens) to about 2.37, and you start seeing plausible generated names in the output.
You can swap out the dataset and train on anything character-level — city names, Pokemon, short poems. The rest of the code does not change. You can also tweak the hyper-parameters to see what happens: change n_embd, n_layer, n_head, and watch how it affects what the model learns.
If you want the full guided walkthrough, Karpathy wrote a detailed companion blog post at karpathy.github.io/2026/02/12/microgpt that walks through every section of the code. There is also a Google Colab notebook if you want to run it without downloading anything.
The gap between this and ChatGPT
MicroGPT is the algorithm. ChatGPT is that same algorithm with a long list of engineering on top. Karpathy’s blog post spells out the differences section by section, and none of them change the core loop. The tokenizer goes from single characters to BPE with 100K tokens. The autograd goes from scalar Value objects to batched tensor operations on GPUs. The 4,192 parameters become hundreds of billions. The single-document training steps become batches of millions of tokens processed across thousands of GPUs for months.
But the structure is the same. Embed tokens, add positions, pass through attention and MLP blocks on a residual stream, project to logits, compute loss, backpropagate, update parameters. That is what MicroGPT makes visible.
If you are going to spend your days working alongside these tools, it helps to know what they are actually doing. MicroGPT makes that concrete in an afternoon of reading.