Session 1: The Engine

Enter the password to view these slides.

Session 1

The Engine

First Principles of Large Language Models

Harper Carroll AI  ·  AI User to AI Builder  ·  Cohort 1

The single most important shift you'll make today:

Treat AI like a person

Treat AI like a probabilistic machine

Every frustration you've had with AI comes from the gap
between these two mental models.

AI is Math

It's not magic. It's math.

Today's Roadmap

Four things you'll understand by the end

1. Transformers

How LLMs process language and why they hallucinate

2. Training Pipeline

Pre-training, fine-tuning, and RLHF

3. AI Agents

From chatbots to autonomous systems that take action

4. Open vs. Closed-Source

Navigating the model landscape and choosing the right engine

Part 1

Understanding
Transformers

How LLMs think — and why they hallucinate

What is a Large Language Model (LLM)?

A next-token prediction engine trained on massive text — not a reasoning mind.

What it actually is

  • A neural network trained on trillions of wordsBooks, websites, code, conversations, papers
  • One job: predict the most likely next tokenGiven everything before, what probably comes next?
  • "Large" = billions of learned parametersInternal dials tuned during training to capture patterns

What it is NOT (at its core)

  • Not a search engineIt doesn't look things up — though RAG and tool use can add retrieval
  • Not a databaseIt doesn't store or retrieve facts — though it can be connected to databases via agents
  • Not a reasoning mindIt doesn't "think" — though reasoning models add "thinking" tokens that simulate deliberation

These capabilities can be layered ON TOP of the core prediction engine — we'll cover RAG, agents, and reasoning models later.

The key insight: Every answer an LLM gives is a statistically likely continuation of your prompt — not a looked-up fact. This single idea explains almost everything about how these models behave.

It all starts with tokens

From text to the numbers a transformer actually processes.

Step 1 — Tokenize

"I love building apps"
I love build ing apps

Split text into chunks. Note: "building" → two tokens. 1 token ≈ 4 characters.

Step 2 — Token IDs

I love build ing apps
40 8520 5765 287 6394

Each token maps to an integer via a fixed vocabulary table. Just a lookup.

Step 3 — Embeddings

40 8520 ...
[0.21, −0.87, …] [0.93, 0.12, …]

Each ID maps to a high-dimensional vector.

The punchline: The transformer never sees text. It operates on embedding vectors — meaning encoded as geometry.
king queen prince dog cat puppy happy joy glad table EMBEDDING SPACE Similar meanings cluster together

It's all next-token prediction

The model computes probabilities for every possible next word.

Prompt: "The capital of France is ___"

Paris
94.2%
a
2.8%
known
1.1%
not
0.3%
The punchline: LLMs are the world's most sophisticated autocomplete. They don't "know" anything. They predict what text is likely to come next based on patterns in their training data.

The Context Window

The model's working memory — everything it can "see" at once.

GPT-5.2

400K

tokens (~1,000 pages)

Claude Opus 4.6

1M

tokens (beta; 200K standard)

Gemini 3 Pro

1M

tokens (~2,500 pages)

The "Lost in the Middle" problem (Liu et al., Stanford, 2024)
▲ HIGH ATTENTION
You are an expert travel planner. I need a 10-day itinerary for Japan covering Tokyo, Kyoto, and Osaka.
The trip starts on March 15. We are two adults traveling on a moderate budget of about $200/day.
We love street food, temples, and nature walks. We prefer trains over buses for intercity travel.
Day 1 should include Shinjuku Gyoen and Meiji Shrine with dinner in Omoide Yokocho for yakitori.
For Day 2, focus on Akihabara in the morning and Asakusa with Senso-ji Temple in the afternoon.
On Day 3, take the Shinkansen to Kyoto. Hotel should be near Kyoto Station for easy access to buses.
Day 4 should cover Fushimi Inari early morning and Kinkaku-ji in the afternoon. Budget about ¥3000 for lunch.
Day 5 could be Arashiyama bamboo grove and monkey park. Consider renting bikes for the area near the river.
IMPORTANT: My partner is allergic to shellfish. Please make sure all restaurant recommendations account for this.
Day 6 in Nara to see the deer park and Todai-ji. This can be a half-day trip returning to Kyoto by evening.
Day 7 is the transfer to Osaka via local train. Check into hotel near Namba for the street food scene.
▼ LOW ATTENTION — information here gets overlooked ▼
Day 8 should feature Osaka Castle in the morning and Dotonbori in the evening for takoyaki and okonomiyaki.
Day 9 is a flex day. Options: day trip to Himeji Castle or explore Shinsekai and Tsutenkaku Tower area.
Day 10 is departure from KIX. Allow 2 hours for airport transfer. Morning could include last-minute shopping.
Keep a small budget reserve for souvenirs — about ¥10,000. Pack light layers for March weather variability.
We'll need pocket wifi or eSIM. Prefer an eSIM that covers the full 10 days with unlimited data if possible.
Please output the itinerary as a day-by-day table with columns for Date, Location, Morning, Afternoon, Evening.
Include estimated costs per day in USD. Flag any days where we might exceed the $200 budget.
Format the response in markdown. Use bold for must-see attractions and italic for optional activities.
▲ HIGH ATTENTION

Models pay the most attention to the beginning and end of the context window.

Information buried in the middle gets overlooked — even critical details like "my partner is allergic to shellfish."

Builder's rule: Put your most important context first and last. Never bury key instructions in the middle of a long prompt.

Neural Networks in 60 Seconds

The foundation that transformers are built on.

What is a Neural Network?

Layers of connected "neurons" that transform inputs into outputs.

Input Hidden 1 Hidden 2 Output tokens patterns concepts meaning
  • Layers of connected "neurons" that transform inputs step by step
  • Each connection has a learned weightA number that controls how much one neuron influences the next — "billions of parameters" = billions of these weights
  • Deeper layers detect more abstract patternsTokens → word patterns → phrases → meaning
The analogy: Like a factory assembly line — raw materials go in, each station transforms them, and a finished product comes out the other end.

The "black box" problem

We can see inputs and outputs — but not what happens in between.

Input

"The capital of France is"

?

Billions of weights
doing... something

Output

"Paris"

What we know

  • Each weight is just a number But we don't know what concept it represents
  • The network works — often remarkably well But we can't fully explain why it gives a specific answer

Why it matters

  • Hard to predict when a model will fail Or why it made a particular mistake
  • Active research area: mechanistic interpretability Anthropic and others are working to reverse-engineer what's inside

Transformers Overview

A visual primer before we dive in.

From Neural Nets to Transformers

"Attention Is All You Need" — Vaswani et al., Google Brain, 2017. The paper that launched GPT, Claude, Gemini, and the entire modern AI era.

Traditional Neural Net

The cat sat ... One word at a time → Slow. Forgets long sequences.

Transformer

The cat sat ... All words at once ⚡ Every word attends to every other. Fast. Handles long context.
  • Traditional nets process words sequentiallySlow, and they forget the start of long sequences
  • Transformers process all words in parallelMassively faster and better at long context
  • The secret weapon: self-attentionEvery word can look at every other word

Simplified transformer block:

Input Text
Embedding
Self-Attention
+
Feed-Forward
repeat N times
Output Prediction

What a Transformer actually does

The core loop

  • Takes in a sequence of tokens
  • Every token looks at every other tokenThis is the "attention" mechanism
  • Computes relevance scores between all tokens"How much should I care about this word to predict the next one?"
  • Multi-head attention: runs many attention operations in parallelEach "head" captures a different relationship type — grammar, coreference, meaning
  • Outputs a probability distribution over ALL possible next tokens

Attention in action

"The cat sat on the mat because it was tired"

What does "it" attend to?

cat
0.82
mat
0.09
sat
0.04
the
0.02

The model learns that "it" refers to "cat" — not through grammar rules, but through statistical patterns.

Why hallucinations happen

They're not bugs — they're a fundamental property of the architecture.

The model will confidently:

  • Invent citations that don't existThe pattern of a citation is correct; the specific one is fabricated
  • State false facts with certaintyIt has no mechanism to distinguish "true" from "sounds true"
  • Create plausible but wrong codeSyntactically valid, logically flawed
  • Fabricate data, names, datesIf the pattern fits, it generates

Why this happens:

The model doesn't directly retrieve information.

It generates statistically likely text.

It has no internal fact-checker.

It only knows what sounds right based on patterns in training data.

Builder's takeaway: Always design verification layers into anything you build with AI.

See Appendix for more details on Transformers

Reducing hallucinations with RAG

Retrieval-Augmented Generation: before the LLM answers, a retriever searches your documents and injects the relevant results into the prompt.

1. USER QUERY "What were our Q3 earnings?" Not in training data 2. VECTORIZE [0.21, 0.87, −0.14…] Embedding model 3. VECTOR DATABASE PDFs, databases, company docs 4. RELEVANT CHUNKS Q3 revenue was $4.7B, up 12%... Net income grew to $890M in Q3... Operating margin expanded to 19%... Guidance for Q4 remains strong... 5. AUGMENTED PROMPT Context: Q3 revenue was $4.7B, up 12%... Net income grew to $890M in Q3... User Question: What were our Q3 earnings? LLM Still predicting the most probable next token 6. RESPONSE "Q3 earnings were $890M, up 12% from last year." More accurate, not guaranteed

Why RAG helps

  • Model does reading comprehension on real documents, not generation from memory
  • Answers are grounded in your actual data — PDFs, databases, company docs
  • Dramatically reduces hallucination rate in practice

Why hallucination still happens

  • The LLM step is still probabilities over tokens — still a probability machine
  • Can misinterpret retrieved docs or blend them with training data
  • Can generate plausible inferences the source doesn't actually support
Key insight: RAG changes what the model reads, not how it generates. The underlying mechanism is unchanged — you always need verification layers.
Part 2

The Training
Pipeline

Pre-training, Fine-tuning, and RLHF,
Reinforcement Learning from Human Feedback

Three stages build a model

Pre-Training

Learn language patterns
from massive text data

Fine-Tuning

Learn to follow
instructions & tasks

RLHF

Learn human preferences
& safety alignment

ChatGPT / Claude

The product
you interact with

The analogy:   SchoolJob trainingSocial skillsReady for work
Stage 1

Pre-Training

Read the internet. Learn patterns. Predict the next word.

What happens

  • Model reads trillions of tokens of textBooks, Wikipedia, code, forums, papers, news
  • Single objective: predict the next tokenGiven everything before, what comes next?
  • Learns grammar, facts, reasoning, code — all from this one task
  • Produces a "base model"Can continue text, but can't hold a conversation

By the numbers

15T+

tokens of training data

$100M+

estimated compute cost

Months

of training on thousands of GPUs

Key insight: The training data has a cutoff date. The model literally doesn't know about events after that date.
Stage 2

Fine-Tuning

Teaching the base model to follow instructions and be useful.

The problem it solves

Base model behavior:

User: What is the capital of France?
Model: What is the capital of Germany?
What is the capital of Spain?
What is the capital of Italy?...

It continues the pattern, not the conversation.

After fine-tuning:

User: What is the capital of France?
Model: The capital of France is Paris.

It answers the question helpfully.

How it works

  • Train on curated (instruction, response) pairsThousands of examples of good Q&A behavior
  • Model learns the "assistant" roleFollow instructions, stay on topic, be helpful
  • Much smaller dataset than pre-trainingQuality over quantity
  • Can be domain-specificMedical, legal, finance, customer service
For builders: Fine-tuning is how you customize models for YOUR specific use cases. We'll cover this in later sessions.
Stage 3

Reinforcement Learning from Human Feedback

RLHF — teaching the model what humans actually want.

The process

1. Generate

Model produces multiple responses to the same prompt

2. Rank

Human reviewers rank responses from best to worst

3. Learn

A reward model learns to predict human preferences

4. Optimize

Main model is tuned to maximize the reward score

What it teaches

  • Be helpful and follow instructions
  • Admit when uncertain"I'm not sure, but..." vs. confidently wrong
  • Decline dangerous or harmful requests
  • Match human communication style
Why models feel different: Claude, ChatGPT, and Gemini have different "personalities" because each company has different RLHF priorities and training approaches.
RLHF in Practice

What RLHF looks like

Human reviewers rank model outputs — the model learns from their preferences.

Prompt: "How do I pick a lock?"

Response A

"Sure! First, get a tension wrench and a rake pick. Insert the wrench into the bottom of the keyhole..."

Rank: 3rd (last)

Response B

"I can't help with that. Lock picking could be used for illegal purposes and I'm not able to provide instructions."

Rank: 2nd

Response C

"Lock picking is a legitimate skill used by locksmiths and security professionals. If you're locked out, here are your options..."

Rank: 1st ★

The reward model learns: Helpful + responsible > Refusal > Helpful + irresponsible. Thousands of rankings like this shape the model's "personality."

This is why

Claude ≠ ChatGPT

Different reviewers, different rankings

RLHF Side Effect

The Sycophancy Problem

RLHF teaches models to agree with you — even when you're wrong.

What happens

RLHF rewards responses humans prefer. Humans prefer responses that agree with them. So the model learns:

"Agreement = reward"

  • Validates wrong answers instead of correcting them
  • Gets worse with scale — smarter models flatter betterInverse scaling: larger models are more sycophantic (Sharma et al., Anthropic)
  • OpenAI rolled back a GPT-4o update (April 2025) for excessive agreeableness

How to defend against it

  • Be skeptical when the model agreesEspecially on opinions or decisions — enthusiastic agreement is a red flag
  • Ask it to argue the opposite"Now tell me why this is a bad idea" — this often reveals the real answer
  • Use system prompts to counteract"Prioritize accuracy over agreeableness. Push back when I'm wrong."
  • Never tell the model what you think"When Truth is Overridden" (Wang et al., 2025) — models are significantly more likely to agree with your stated beliefs
Builder's takeaway: The model that feels most helpful might be the least trustworthy. Design for truth, not comfort.

The full picture

Pre-Training

📚

School

Broad knowledge from reading the internet

Cost: $10M–$100M+
Data: trillions of tokens
Time: months

Fine-Tuning

🎯

Job Training

Learn to follow instructions and be useful

Cost: $10K–$1M
Data: thousands of examples
Time: days

RLHF

🤝

Social Skills

Learn what humans prefer and how to be safe

Cost: $1M–$10M
Data: human rankings
Time: weeks

The shortcut: model distillation

Why open-source caught up so fast — and why small models keep getting better.

Teacher Claude, GPT, Gemini Pre-trained + Fine-tuned + RLHF'd $100M+ to build Millions of prompt → response pairs Includes all the behaviors from fine-tuning + RLHF used as data to train Student Smaller, cheaper model Learns by imitating the teacher $1M or less to build SKIPPED: Fine-tuning RLHF Already baked into the teacher's outputs

The trade-off

Distilled models are compressed knowledge — like great notes vs. actually taking the class. Not quite as good as the teacher, but dramatically cheaper to build and run.

Why this matters

This is why the model landscape changes so fast. A new frontier model drops, and within weeks there are dozens of distilled variants optimized for different use cases.
  • Open-source explosionDeepSeek-R1-Distill, Alpaca, Vicuna, Orca, Phi — all distilled from frontier models at a fraction of the cost
  • 100× cheaperSkip the $100M pre-training and the army of human reviewers
  • Small models, big capabilityA 7B parameter model distilled from a frontier model can outperform a 70B model trained from scratch

Case study: DeepSeek R1 DeepSeek

When distillation, export controls, and geopolitics collided.

Jan 20, 2025 — The claim

DeepSeek releases R1, claiming performance comparable to OpenAI's best models — trained for just $5.6 million on older chips.

Jan 27, 2025 — The plummet

NVIDIA loses $589 billion in market cap in a single day — the largest one-day loss in stock market history.

If you don't need expensive chips, you don't need NVIDIA.

Feb 2025 — The accusations

OpenAI tells Congress it detected DeepSeek employees using obfuscated methods to extract outputs from ChatGPT for training.

White House AI czar David Sacks: "substantial evidence" of distillation.

Why this matters for builders: Distillation sits at the intersection of technology, intellectual property, and geopolitics. The legal and ethical boundaries are still being drawn — in real time.

Allegation 1: Illegal distillation

OpenAI accused DeepSeek of using their API outputs to train R1 — violating OpenAI's terms of service. Evidence presented to Congress but not made public.

Sources: Bloomberg, Feb 2026; Fortune, Jan 2025

Allegation 2: Chip smuggling

DeepSeek allegedly acquired thousands of banned NVIDIA Blackwell chips despite US export controls. DOJ separately busted a $160M smuggling ring moving H100/H200 chips to China.

NVIDIA called the DeepSeek-specific claims "far-fetched." Sources: CNBC, Dec 2025; WinBuzzer, Dec 2025

Part 3

AI Agents

From chatbots to autonomous systems that take action

What makes something an agent?

Agents don't just respond — they take action in a loop.

Chatbot

You ask a question
Model generates response
Waits for next input

One turn at a time. No autonomy.

Agent

You give a task
Think & plan
Use tools (search, code, files)
Observe result
loop until done

We'll go deeper on agents in Session 3

How they connect to tools, how to control them, and how to build reliable agent workflows.

For now, remember: Chatbots answer questions. Agents complete tasks. That distinction changes how you think about building with AI.
Part 4

Open vs.
Closed-Source

Navigating the model landscape — and choosing the right engine

The model landscape in 2025-2026

Closed-Source

You rent access. They control the model.

GPT-5.3 / o3 / o4-mini
OpenAI
Claude Opus 4.6 / Sonnet 4.5
Anthropic
Gemini 3 Pro
Google

Open-Source / Open-Weight

You can download, run, and modify them.

Llama 4
Meta
Mistral 3
Mistral AI
DeepSeek-V3.2
DeepSeek

The agent landscape (early 2026)

Closed-Source Agents

You rent access. They control the agent.

Claude Code — Anthropic

A coding assistant that runs on your computer. It can read your files, write code, and run commands for you. Your data stays on your machine.

Codex — OpenAI

A coding assistant that runs in the cloud. You give it a task and it works on it in the background, then delivers the result.

Open-Source Agents

You can download, modify, and run them yourself.

OpenClaw — formerly Clawdbot

A personal AI assistant that connects to your messaging apps (WhatsApp, Slack, etc.) and can take actions for you — send emails, manage your calendar, browse the web, and more. Went viral in Jan 2026 (100K+ users in one week).

Heads up: OpenClaw had a serious security flaw in its first week that could let hackers take control. Open-source tools move fast — but you're responsible for vetting them.

Where this is going: Agents are becoming the main way people use AI — not just chatting, but getting things done. We'll build with Claude Code in Session 3.

The tradeoffs

Dimension Closed-Source Open-Source (self-hosted) Open-Source (hosted)
Performance Frontier capability Effectively caught up (2026) Same models, same quality
Cost Model Pay per token (API) Infrastructure cost; zero per-token Cheaper API; often 50-90% less
Data Privacy Your data hits their servers Runs on your own infra Data hits provider servers
Customization Limited to prompts + some fine-tuning Full control: fine-tune, modify, merge Some fine-tuning supported
Setup Effort API key and go Need GPU infra API key and go
Vendor Risk They can change models, pricing Weights are yours forever Can switch providers easily
Providers OpenAI, Anthropic, Google Your own GPUs / cloud Together AI, Fireworks, Groq

What an LLM looks like in code

Simplified PyTorch — open source and closed source models all share this same structure.

Model Architecture

# 1. Convert token IDs → vectors
class Embedding(nn.Module):
  def forward(self, token_ids):
    return self.embed(token_ids) \
         + self.position(positions)

# 2. One transformer block
class TransformerBlock(nn.Module):
  def forward(self, x):
    attn = self.attention(x)  # Attend
    x = x + attn              # Residual
    x = x + self.ffn(x)      # Transform
    return x

# 3. The full model
class LLM(nn.Module):
  def __init__(self):
    self.embed  = Embedding()
    self.blocks = [TransformerBlock()
                   for _ in range(96)]
    self.output = Linear(hidden, vocab)

  def forward(self, token_ids):
    x = self.embed(token_ids)
    for block in self.blocks:
      x = block(x)
    logits = self.output(x)
    return softmax(logits / T)

Training Loop (Gradient Descent)

# Repeat trillions of times
for batch in training_data:

  # Forward pass — make predictions
  predictions = model(batch.input)

  # How wrong were we?
  loss = cross_entropy(
    predictions,
    batch.next_token
  )

  # Compute gradients
  # (how much did each weight
  #  contribute to the error?)
  loss.backward()

  # Update all weights by a tiny
  # amount to reduce the error
  # ← THIS IS LEARNING
  optimizer.step()
The insight: The core algorithm fits on one slide. The difference between open and closed source isn't the algorithm — it's who has access to the trained weights and the data that shaped them.

A decision framework

Start with closed-source when:

  • You need maximum capability
  • You're prototyping and want to move fast
  • Data privacy isn't a hard constraint
  • You don't want to manage infrastructure
  • You need multimodal (vision, audio, etc.)

Consider open-source when:

  • Data must stay on your infrastructure
  • You need deep customization / fine-tuning
  • Cost at scale is a concern (high volume)
  • You need to run offline or on-device
  • You can't accept vendor lock-in risk
Pro tip: Many production systems use both. Closed-source for hard reasoning tasks; open-source for high-volume, simpler tasks. The right answer is rarely "always one or the other."

Right-sizing your model

Bigger isn't always better. Match the model to the task.

Small Models

~1B–8B parameters

Fast, cheap, good for simple tasks

  • Classification
  • Extraction
  • Simple Q&A

GPT-5 Mini, Haiku 4.5, Llama 8B

Medium Models

~30B–100B parameters

Balanced performance and cost

  • Content generation
  • Code assistance
  • Analysis

Sonnet 4.5, Gemini Flash, Mistral 3

Frontier Models

Hundreds of billions+

Maximum capability, highest cost

  • Complex reasoning
  • Novel problem solving
  • Multi-step planning

GPT-5.3, Opus 4.6, Gemini 3 Pro

Builder's rule: Use the smallest model that can reliably do the job. Sending a classification task to Claude Opus is like hiring a PhD to sort mail.
Key Takeaways

What to remember from today

Homework

Before Session 2

1. Install a tool

First, install Node.js (needed for npm):

Mac: Install Homebrew first if you don't have it:

/bin/bash -c "$(curl -fsSL https://brew.sh/install.sh)"

Then run: brew install node

Windows: Go to nodejs.org, download the LTS installer, and run it. Accept defaults.

Then install an AI coding agent:

# Claude Code
npm install -g @anthropic-ai/claude-code

# OpenAI Codex
npm install -g @openai/codex

AI tip: Use your favorite LLM to help you debug the installation process!

2. Additional reading

Transformers:

Context & Hallucination:

RLHF & Training:

AI tip: Try NotebookLM to break down long papers!

3. Bring a use case

Think of a task from your own work that involves text:

  • Writing you do repeatedly
  • Data you analyze or summarize
  • Research or information gathering
  • Customer communications
  • Reports or documentation

We'll build a prompt system for it in Session 2.

AI tip: Ask your favorite LLM to help you find a good candidate! (Click to expand)
I want to identify one high-value AI use case from my real work. Phase 1 (prompt-only first): - Ask me 8–10 short questions (one at a time) to identify repeatable tasks (writing, analysis, research, customer communication, reporting, documentation, etc.). - Based only on my answers, give me my top 3 use cases ranked by impact and ease. - For each use case, include: why it fits, expected time savings, risks, and required inputs. - Recommend one "starter use case" and write: a) a first-draft prompt I can use immediately b) a refined prompt template with placeholders Phase 2 (optional agentic upgrade): - For the same starter use case, propose how agentic capabilities could be added later. - Specify what the agent could do, what tools/data access it would need, what approvals/human checkpoints are required, and key safety controls. - Keep this as an upgrade path; do not assume agentic behavior in the Phase 1 prompt. Keep outputs practical and specific to my workflow.
Quiz

Test yourself

Can you answer these from today's session?

Q1

What is the single core operation that an LLM performs?

Q2

Why do LLMs hallucinate? Explain in one sentence.

Q3

Name the three stages of the training pipeline and what each one teaches.

Q4

Does RAG eliminate hallucination? Why or why not?

Q5

What's one advantage of open-source models over closed-source models?

Q6

What makes an AI agent different from a standard chatbot?

Thank You

Questions, ideas, and "wait, what?" moments welcome.

Harper Carroll AI  ·  AI User to AI Builder  ·  Session 1: The Engine  ·  Cohort 1

Appendix

Technical
Deep Dives

Additional detail on transformer internals — for the curious.

The transformer block

Two steps, repeated dozens of times. Each layer builds deeper understanding.

Token embeddings in TRANSFORMER BLOCK ×1 Self-Attention Tokens share information with each other + normalize Feed-Forward Network Each token "thinks" independently + normalize Repeat ×96 layers Rich token representations out

What each layer learns

Early layers (1–20)

Syntax, grammar, word relationships
Which adjective modifies which noun?

Middle layers (20–60)

Semantic meaning, idioms, logical patterns
Understanding metaphor, cause and effect

Deep layers (60–96+)

Abstract reasoning, world knowledge, intent
What does the user actually want?

Analogy: Attention is the conversation step — tokens talk to each other. Feed-forward is the thinking step — each token digests what it heard.

Terms to know

Three concepts you need before we discuss temperature.

Input The cap ital of France embeddings Hidden layers (attention, feed-forward) Logits 4.2 1.8 0.3 -0.7 -1.2 ... -3.1 100K raw scores soft max Probs 62% 20% 8% 4% 2% ... 0% Sums to 100%

Logits

The raw scores from the final layer of the neural network — one number per possible next token. Not probabilities yet. Can be negative.

Softmax

The function that converts logits into probabilities. Makes them all positive and sum to 100%. Bigger logits get exponentially bigger probabilities.

Probability Distribution

The full set of probabilities across all possible next tokens.

Sharp

One token dominates

Flat

Probability spread evenly

Inside self-attention

Each token asks: "Which other tokens matter to me?"

TOKENS The cat sat ... it HOW RELEVANT ARE YOU TO ME? 0.05 0.82 0.09 0.04 mix by weight RESULT ENRICHED "it" Now contains 82% of "cat"'s information + traces of every other token "it" now effectively knows it refers to "cat" Not through grammar rules — through learned statistical patterns
The key insight: Every token checks every other token for relevance. The thicker the connection, the more information flows. This happens simultaneously across multiple "attention heads" — each one learning different relationship types (grammar, meaning, position). The combined result: each token builds a rich understanding of its context.

From layers to output

The final step: turn enriched representations back into words.

Input

"The cat sat on the mat because it was"

96 Layers

Attention + FFN ×96

Map to vocab

→ 100K logits

Softmax

→ probabilities

Sample

"tired"

The autoregressive loop

Step 1: "The cat sat" → "on"
Step 2: "The cat sat on" → "the"
Step 3: "The cat sat on the" → "mat"
Step 4: "The cat sat on the mat" → "because"
...each step runs the FULL pipeline again

The cost of generation

A 500-word response means the model runs this full pipeline ~700 times.

Each pass: all 96 layers, all attention heads, all feed-forward networks. Billions of calculations per token.

This is why: AI costs money per token, longer responses cost more, and output tokens are more expensive than input tokens.

Temperature reshapes the distribution

Same logits, same model, same prompt — temperature just reshapes the curve. Low T = sharp (deterministic). High T = flat (creative). This is why the same prompt gives different answers: it's sampling differently from a reshaped distribution.

P(token) = softmax( logit / T ) OpenAI/Gemini: 0–2  |  Claude: 0–1

T → 0

Deterministic

Prompt: "The cat sat on the ___"

mat
97%
floor
couch
moon

Spike. Always picks "mat."

Best for: code, facts, analysis

T = 0.7

Balanced

Prompt: "The cat sat on the ___"

mat
62%
floor
20%
couch
12%
moon

Softened. Top token likely, but others have a chance.

Best for: conversation, writing

T = 1.5

Creative

Prompt: "The cat sat on the ___"

mat
30%
floor
24%
couch
22%
moon
18%

Flat. Even unlikely tokens get real chances.

Best for: brainstorming, fiction