Enter the password to view these slides.
First Principles of Large Language Models
Harper Carroll AI · AI User to AI Builder · Cohort 1
The single most important shift you'll make today:
Every frustration you've had with AI comes from the gap
between these two mental models.
It's not magic. It's math.
How LLMs process language and why they hallucinate
Pre-training, fine-tuning, and RLHF
From chatbots to autonomous systems that take action
Navigating the model landscape and choosing the right engine
How LLMs think — and why they hallucinate
A next-token prediction engine trained on massive text — not a reasoning mind.
These capabilities can be layered ON TOP of the core prediction engine — we'll cover RAG, agents, and reasoning models later.
From text to the numbers a transformer actually processes.
Step 1 — Tokenize
Split text into chunks. Note: "building" → two tokens. 1 token ≈ 4 characters.
Step 2 — Token IDs
Each token maps to an integer via a fixed vocabulary table. Just a lookup.
Step 3 — Embeddings
Each ID maps to a high-dimensional vector.
The model computes probabilities for every possible next word.
Prompt: "The capital of France is ___"
The model's working memory — everything it can "see" at once.
400K
tokens (~1,000 pages)
1M
tokens (beta; 200K standard)
1M
tokens (~2,500 pages)
Models pay the most attention to the beginning and end of the context window.
Information buried in the middle gets overlooked — even critical details like "my partner is allergic to shellfish."
The foundation that transformers are built on.
Layers of connected "neurons" that transform inputs into outputs.
We can see inputs and outputs — but not what happens in between.
Input
"The capital of France is"
?
Billions of weights
doing... something
Output
"Paris"
A visual primer before we dive in.
"Attention Is All You Need" — Vaswani et al., Google Brain, 2017. The paper that launched GPT, Claude, Gemini, and the entire modern AI era.
Simplified transformer block:
"The cat sat on the mat because it was tired"
What does "it" attend to?
The model learns that "it" refers to "cat" — not through grammar rules, but through statistical patterns.
They're not bugs — they're a fundamental property of the architecture.
The model doesn't directly retrieve information.
It generates statistically likely text.
It has no internal fact-checker.
It only knows what sounds right based on patterns in training data.
See Appendix for more details on Transformers
Retrieval-Augmented Generation: before the LLM answers, a retriever searches your documents and injects the relevant results into the prompt.
Pre-training, Fine-tuning, and RLHF,
Reinforcement Learning from Human Feedback
Learn language patterns
from massive text data
Learn to follow
instructions & tasks
Learn human preferences
& safety alignment
The product
you interact with
Read the internet. Learn patterns. Predict the next word.
15T+
tokens of training data
$100M+
estimated compute cost
Months
of training on thousands of GPUs
Teaching the base model to follow instructions and be useful.
User: What is the capital of France?
Model: What is the capital of Germany?
What is the capital of Spain?
What is the capital of Italy?...
It continues the pattern, not the conversation.
User: What is the capital of France?
Model: The capital of France is Paris.
It answers the question helpfully.
RLHF — teaching the model what humans actually want.
Model produces multiple responses to the same prompt
Human reviewers rank responses from best to worst
A reward model learns to predict human preferences
Main model is tuned to maximize the reward score
Human reviewers rank model outputs — the model learns from their preferences.
Prompt: "How do I pick a lock?"
"Sure! First, get a tension wrench and a rake pick. Insert the wrench into the bottom of the keyhole..."
Rank: 3rd (last)
"I can't help with that. Lock picking could be used for illegal purposes and I'm not able to provide instructions."
Rank: 2nd
"Lock picking is a legitimate skill used by locksmiths and security professionals. If you're locked out, here are your options..."
Rank: 1st ★
This is why
Claude ≠ ChatGPT
Different reviewers, different rankings
RLHF teaches models to agree with you — even when you're wrong.
RLHF rewards responses humans prefer. Humans prefer responses that agree with them. So the model learns:
"Agreement = reward"
📚
School
Broad knowledge from reading the internet
Cost: $10M–$100M+
Data: trillions of tokens
Time: months
🎯
Job Training
Learn to follow instructions and be useful
Cost: $10K–$1M
Data: thousands of examples
Time: days
🤝
Social Skills
Learn what humans prefer and how to be safe
Cost: $1M–$10M
Data: human rankings
Time: weeks
Why open-source caught up so fast — and why small models keep getting better.
Distilled models are compressed knowledge — like great notes vs. actually taking the class. Not quite as good as the teacher, but dramatically cheaper to build and run.
When distillation, export controls, and geopolitics collided.
Jan 20, 2025 — The claim
DeepSeek releases R1, claiming performance comparable to OpenAI's best models — trained for just $5.6 million on older chips.
Jan 27, 2025 — The plummet
NVIDIA loses $589 billion in market cap in a single day — the largest one-day loss in stock market history.
If you don't need expensive chips, you don't need NVIDIA.
Feb 2025 — The accusations
OpenAI tells Congress it detected DeepSeek employees using obfuscated methods to extract outputs from ChatGPT for training.
White House AI czar David Sacks: "substantial evidence" of distillation.
OpenAI accused DeepSeek of using their API outputs to train R1 — violating OpenAI's terms of service. Evidence presented to Congress but not made public.
Sources: Bloomberg, Feb 2026; Fortune, Jan 2025
DeepSeek allegedly acquired thousands of banned NVIDIA Blackwell chips despite US export controls. DOJ separately busted a $160M smuggling ring moving H100/H200 chips to China.
NVIDIA called the DeepSeek-specific claims "far-fetched." Sources: CNBC, Dec 2025; WinBuzzer, Dec 2025
From chatbots to autonomous systems that take action
Agents don't just respond — they take action in a loop.
One turn at a time. No autonomy.
How they connect to tools, how to control them, and how to build reliable agent workflows.
Navigating the model landscape — and choosing the right engine
You rent access. They control the model.
You can download, run, and modify them.
You rent access. They control the agent.
Claude Code — Anthropic
A coding assistant that runs on your computer. It can read your files, write code, and run commands for you. Your data stays on your machine.
Codex — OpenAI
A coding assistant that runs in the cloud. You give it a task and it works on it in the background, then delivers the result.
You can download, modify, and run them yourself.
OpenClaw — formerly Clawdbot
A personal AI assistant that connects to your messaging apps (WhatsApp, Slack, etc.) and can take actions for you — send emails, manage your calendar, browse the web, and more. Went viral in Jan 2026 (100K+ users in one week).
Heads up: OpenClaw had a serious security flaw in its first week that could let hackers take control. Open-source tools move fast — but you're responsible for vetting them.
| Dimension | Closed-Source | Open-Source (self-hosted) | Open-Source (hosted) |
|---|---|---|---|
| Performance | Frontier capability | Effectively caught up (2026) | Same models, same quality |
| Cost Model | Pay per token (API) | Infrastructure cost; zero per-token | Cheaper API; often 50-90% less |
| Data Privacy | Your data hits their servers | Runs on your own infra | Data hits provider servers |
| Customization | Limited to prompts + some fine-tuning | Full control: fine-tune, modify, merge | Some fine-tuning supported |
| Setup Effort | API key and go | Need GPU infra | API key and go |
| Vendor Risk | They can change models, pricing | Weights are yours forever | Can switch providers easily |
| Providers | OpenAI, Anthropic, Google | Your own GPUs / cloud | Together AI, Fireworks, Groq |
Simplified PyTorch — open source and closed source models all share this same structure.
# 1. Convert token IDs → vectors class Embedding(nn.Module): def forward(self, token_ids): return self.embed(token_ids) \ + self.position(positions) # 2. One transformer block class TransformerBlock(nn.Module): def forward(self, x): attn = self.attention(x) # Attend x = x + attn # Residual x = x + self.ffn(x) # Transform return x # 3. The full model class LLM(nn.Module): def __init__(self): self.embed = Embedding() self.blocks = [TransformerBlock() for _ in range(96)] self.output = Linear(hidden, vocab) def forward(self, token_ids): x = self.embed(token_ids) for block in self.blocks: x = block(x) logits = self.output(x) return softmax(logits / T)
# Repeat trillions of times for batch in training_data: # Forward pass — make predictions predictions = model(batch.input) # How wrong were we? loss = cross_entropy( predictions, batch.next_token ) # Compute gradients # (how much did each weight # contribute to the error?) loss.backward() # Update all weights by a tiny # amount to reduce the error # ← THIS IS LEARNING optimizer.step()
Bigger isn't always better. Match the model to the task.
~1B–8B parameters
Fast, cheap, good for simple tasks
GPT-5 Mini, Haiku 4.5, Llama 8B
~30B–100B parameters
Balanced performance and cost
Sonnet 4.5, Gemini Flash, Mistral 3
Hundreds of billions+
Maximum capability, highest cost
GPT-5.3, Opus 4.6, Gemini 3 Pro
First, install Node.js (needed for npm):
Mac: Install Homebrew first if you don't have it:
/bin/bash -c "$(curl -fsSL https://brew.sh/install.sh)"
Then run: brew install node
Windows: Go to nodejs.org, download the LTS installer, and run it. Accept defaults.
Then install an AI coding agent:
AI tip: Use your favorite LLM to help you debug the installation process!
Transformers:
Context & Hallucination:
RLHF & Training:
AI tip: Try NotebookLM to break down long papers!
Think of a task from your own work that involves text:
We'll build a prompt system for it in Session 2.
Can you answer these from today's session?
What is the single core operation that an LLM performs?
Why do LLMs hallucinate? Explain in one sentence.
Name the three stages of the training pipeline and what each one teaches.
Does RAG eliminate hallucination? Why or why not?
What's one advantage of open-source models over closed-source models?
What makes an AI agent different from a standard chatbot?
Questions, ideas, and "wait, what?" moments welcome.
Harper Carroll AI · AI User to AI Builder · Session 1: The Engine · Cohort 1
Additional detail on transformer internals — for the curious.
Two steps, repeated dozens of times. Each layer builds deeper understanding.
Early layers (1–20)
Syntax, grammar, word relationships
Which adjective modifies which noun?
Middle layers (20–60)
Semantic meaning, idioms, logical patterns
Understanding metaphor, cause and effect
Deep layers (60–96+)
Abstract reasoning, world knowledge, intent
What does the user actually want?
Three concepts you need before we discuss temperature.
The raw scores from the final layer of the neural network — one number per possible next token. Not probabilities yet. Can be negative.
The function that converts logits into probabilities. Makes them all positive and sum to 100%. Bigger logits get exponentially bigger probabilities.
The full set of probabilities across all possible next tokens.
Sharp
One token dominates
Flat
Probability spread evenly
Each token asks: "Which other tokens matter to me?"
The final step: turn enriched representations back into words.
"The cat sat on the mat because it was"
Attention + FFN ×96
→ 100K logits
→ probabilities
"tired"
A 500-word response means the model runs this full pipeline ~700 times.
Each pass: all 96 layers, all attention heads, all feed-forward networks. Billions of calculations per token.
Same logits, same model, same prompt — temperature just reshapes the curve. Low T = sharp (deterministic). High T = flat (creative). This is why the same prompt gives different answers: it's sampling differently from a reshaped distribution.
Deterministic
Prompt: "The cat sat on the ___"
Spike. Always picks "mat."
Best for: code, facts, analysis
Balanced
Prompt: "The cat sat on the ___"
Softened. Top token likely, but others have a chance.
Best for: conversation, writing
Creative
Prompt: "The cat sat on the ___"
Flat. Even unlikely tokens get real chances.
Best for: brainstorming, fiction