Session 1

The Engine

First Principles of Large Language Models

Harper Carroll AI · AI User to AI Builder · Cohort 1

The single most important shift you'll make today:

Treat AI like a person

Treat AI like a probabilistic machine

Every frustration you've had with AI comes from the gap
between these two mental models.

AI is Math

It's not magic. It's math.

Today's Roadmap

Four things you'll understand by the end

1. Transformers

How LLMs process language and why they hallucinate

2. Training Pipeline

Pre-training, fine-tuning, and RLHF

3. AI Agents

From chatbots to autonomous systems that take action

4. Open vs. Closed-Source

Navigating the model landscape and choosing the right engine

Part 1

Understanding
Transformers

How LLMs think — and why they hallucinate

What is a Large Language Model (LLM)?

A next-token prediction engine trained on massive text — not a reasoning mind.

        What it actually is
        A neural network trained on trillions of wordsBooks, websites, code, conversations, papers
One job: predict the most likely next tokenGiven everything before, what probably comes next?
"Large" = billions of learned parametersInternal dials tuned during training to capture patterns

      

What it is NOT (at its core)

Not a search engineIt doesn't look things up — though RAG and tool use can add retrieval
Not a databaseIt doesn't store or retrieve facts — though it can be connected to databases via agents
Not a reasoning mindIt doesn't "think" — though reasoning models add "thinking" tokens that simulate deliberation

These capabilities can be layered ON TOP of the core prediction engine — we'll cover RAG, agents, and reasoning models later.

The key insight: Every answer an LLM gives is a statistically likely continuation of your prompt — not a looked-up fact. This single idea explains almost everything about how these models behave.

It all starts with tokens

From text to the numbers a transformer actually processes.

Step 1 — Tokenize

"I love building apps"

→

I love build ing apps

Split text into chunks. Note: "building" → two tokens. 1 token ≈ 4 characters.

Step 2 — Token IDs

I love build ing apps

→

Each token maps to an integer via a fixed vocabulary table. Just a lookup.

Step 3 — Embeddings

→

              [0.21, −0.87, …]
              [0.93, 0.12, …]
            

Each ID maps to a high-dimensional vector.

The punchline: The transformer never sees text. It operates on embedding vectors — meaning encoded as geometry.

It's all next-token prediction

The model computes probabilities for every possible next word.

Prompt: "The capital of France is ___"

Paris

94.2%

a

2.8%

known

1.1%

not

0.3%

The punchline: LLMs are the world's most sophisticated autocomplete. They don't "know" anything. They predict what text is likely to come next based on patterns in their training data.

The Context Window

The model's working memory — everything it can "see" at once.

GPT-5.2

400K

tokens (~1,000 pages)

Claude Opus 4.6

1M

tokens (beta; 200K standard)

Gemini 3 Pro

1M

tokens (~2,500 pages)

The "Lost in the Middle" problem (Liu et al., Stanford, 2024)

▲ HIGH ATTENTION
You are an expert travel planner. I need a 10-day itinerary for Japan covering Tokyo, Kyoto, and Osaka.
The trip starts on March 15. We are two adults traveling on a moderate budget of about $200/day.
We love street food, temples, and nature walks. We prefer trains over buses for intercity travel.
Day 1 should include Shinjuku Gyoen and Meiji Shrine with dinner in Omoide Yokocho for yakitori.
For Day 2, focus on Akihabara in the morning and Asakusa with Senso-ji Temple in the afternoon.
On Day 3, take the Shinkansen to Kyoto. Hotel should be near Kyoto Station for easy access to buses.
Day 4 should cover Fushimi Inari early morning and Kinkaku-ji in the afternoon. Budget about ¥3000 for lunch.
Day 5 could be Arashiyama bamboo grove and monkey park. Consider renting bikes for the area near the river.
IMPORTANT: My partner is allergic to shellfish. Please make sure all restaurant recommendations account for this.
Day 6 in Nara to see the deer park and Todai-ji. This can be a half-day trip returning to Kyoto by evening.
Day 7 is the transfer to Osaka via local train. Check into hotel near Namba for the street food scene.
▼ LOW ATTENTION — information here gets overlooked ▼
Day 8 should feature Osaka Castle in the morning and Dotonbori in the evening for takoyaki and okonomiyaki.
Day 9 is a flex day. Options: day trip to Himeji Castle or explore Shinsekai and Tsutenkaku Tower area.
Day 10 is departure from KIX. Allow 2 hours for airport transfer. Morning could include last-minute shopping.
Keep a small budget reserve for souvenirs — about ¥10,000. Pack light layers for March weather variability.
We'll need pocket wifi or eSIM. Prefer an eSIM that covers the full 10 days with unlimited data if possible.
Please output the itinerary as a day-by-day table with columns for Date, Location, Morning, Afternoon, Evening.
Include estimated costs per day in USD. Flag any days where we might exceed the $200 budget.
Format the response in markdown. Use bold for must-see attractions and italic for optional activities.
▲ HIGH ATTENTION

Models pay the most attention to the beginning and end of the context window.

Information buried in the middle gets overlooked — even critical details like "my partner is allergic to shellfish."

Builder's rule: Put your most important context first and last. Never bury key instructions in the middle of a long prompt.

Neural Networks in 60 Seconds

The foundation that transformers are built on.

What is a Neural Network?

Layers of connected "neurons" that transform inputs into outputs.

Layers of connected "neurons" that transform inputs step by step
Each connection has a learned weightA number that controls how much one neuron influences the next — "billions of parameters" = billions of these weights
Deeper layers detect more abstract patternsTokens → word patterns → phrases → meaning

The analogy: Like a factory assembly line — raw materials go in, each station transforms them, and a finished product comes out the other end.

The "black box" problem

We can see inputs and outputs — but not what happens in between.

Input

"The capital of France is"

→

?

Billions of weights
doing... something

→

Output

"Paris"

What we know

Each weight is just a number But we don't know what concept it represents
The network works — often remarkably well But we can't fully explain why it gives a specific answer

Why it matters

Hard to predict when a model will fail Or why it made a particular mistake
Active research area: mechanistic interpretability Anthropic and others are working to reverse-engineer what's inside

Transformers Overview

A visual primer before we dive in.

From Neural Nets to Transformers

"Attention Is All You Need" — Vaswani et al., Google Brain, 2017. The paper that launched GPT, Claude, Gemini, and the entire modern AI era.

Traditional Neural Net

Transformer

Traditional nets process words sequentiallySlow, and they forget the start of long sequences
Transformers process all words in parallelMassively faster and better at long context
The secret weapon: self-attentionEvery word can look at every other word

Simplified transformer block:

Input Text

↓

Embedding

↓

Self-Attention

+

Feed-Forward

repeat N times

↓

Output Prediction

What a Transformer actually does

The core loop

Takes in a sequence of tokens
Every token looks at every other tokenThis is the "attention" mechanism
Computes relevance scores between all tokens"How much should I care about this word to predict the next one?"
Multi-head attention: runs many attention operations in parallelEach "head" captures a different relationship type — grammar, coreference, meaning
Outputs a probability distribution over ALL possible next tokens

Attention in action

"The cat sat on the mat because it was tired"

What does "it" attend to?

cat

0.82

mat

0.09

sat

0.04

the

0.02

The model learns that "it" refers to "cat" — not through grammar rules, but through statistical patterns.

Why hallucinations happen

They're not bugs — they're a fundamental property of the architecture.

The model will confidently:

Invent citations that don't existThe pattern of a citation is correct; the specific one is fabricated
State false facts with certaintyIt has no mechanism to distinguish "true" from "sounds true"
Create plausible but wrong codeSyntactically valid, logically flawed
Fabricate data, names, datesIf the pattern fits, it generates

Why this happens:

The model doesn't directly retrieve information.

It generates statistically likely text.

It has no internal fact-checker.

It only knows what sounds right based on patterns in training data.

Builder's takeaway: Always design verification layers into anything you build with AI.

See Appendix for more details on Transformers

Reducing hallucinations with RAG

Retrieval-Augmented Generation: before the LLM answers, a retriever searches your documents and injects the relevant results into the prompt.

Why RAG helps

Model does reading comprehension on real documents, not generation from memory
Answers are grounded in your actual data — PDFs, databases, company docs
Dramatically reduces hallucination rate in practice

Why hallucination still happens

The LLM step is still probabilities over tokens — still a probability machine
Can misinterpret retrieved docs or blend them with training data
Can generate plausible inferences the source doesn't actually support

Key insight: RAG changes what the model reads, not how it generates. The underlying mechanism is unchanged — you always need verification layers.

Part 2

The Training
Pipeline

Pre-training, Fine-tuning, and RLHF,
Reinforcement Learning from Human Feedback

Three stages build a model

Pre-Training

Learn language patterns
from massive text data

→

Fine-Tuning

Learn to follow
instructions & tasks

→

RLHF

Learn human preferences
& safety alignment

→

ChatGPT / Claude

The product
you interact with

The analogy: School → Job training → Social skills → Ready for work

Stage 1

Pre-Training

Read the internet. Learn patterns. Predict the next word.

What happens

Model reads trillions of tokens of textBooks, Wikipedia, code, forums, papers, news
Single objective: predict the next tokenGiven everything before, what comes next?
Learns grammar, facts, reasoning, code — all from this one task
Produces a "base model"Can continue text, but can't hold a conversation

By the numbers

15T+

tokens of training data

$100M+

estimated compute cost

Months

of training on thousands of GPUs

Key insight: The training data has a cutoff date. The model literally doesn't know about events after that date.

Stage 2

Fine-Tuning

Teaching the base model to follow instructions and be useful.

The problem it solves

Base model behavior:

User: What is the capital of France?
Model: What is the capital of Germany?
What is the capital of Spain?
What is the capital of Italy?...

It continues the pattern, not the conversation.

After fine-tuning:

User: What is the capital of France?
Model: The capital of France is Paris.

It answers the question helpfully.

How it works

Train on curated (instruction, response) pairsThousands of examples of good Q&A behavior
Model learns the "assistant" roleFollow instructions, stay on topic, be helpful
Much smaller dataset than pre-trainingQuality over quantity
Can be domain-specificMedical, legal, finance, customer service

For builders: Fine-tuning is how you customize models for YOUR specific use cases. We'll cover this in later sessions.

Stage 3

Reinforcement Learning from Human Feedback

RLHF — teaching the model what humans actually want.

The process

1. Generate

Model produces multiple responses to the same prompt

↓

2. Rank

Human reviewers rank responses from best to worst

↓

3. Learn

A reward model learns to predict human preferences

↓

4. Optimize

Main model is tuned to maximize the reward score

What it teaches

Be helpful and follow instructions
Admit when uncertain"I'm not sure, but..." vs. confidently wrong
Decline dangerous or harmful requests
Match human communication style

Why models feel different: Claude, ChatGPT, and Gemini have different "personalities" because each company has different RLHF priorities and training approaches.

RLHF in Practice

What RLHF looks like

Human reviewers rank model outputs — the model learns from their preferences.

Prompt: "How do I pick a lock?"

Response A

"Sure! First, get a tension wrench and a rake pick. Insert the wrench into the bottom of the keyhole..."

Rank: 3rd (last)

Response B

"I can't help with that. Lock picking could be used for illegal purposes and I'm not able to provide instructions."

Rank: 2nd

Response C

"Lock picking is a legitimate skill used by locksmiths and security professionals. If you're locked out, here are your options..."

Rank: 1st ★

The reward model learns: Helpful + responsible > Refusal > Helpful + irresponsible. Thousands of rankings like this shape the model's "personality."

This is why

Claude ≠ ChatGPT

Different reviewers, different rankings

RLHF Side Effect

The Sycophancy Problem

RLHF teaches models to agree with you — even when you're wrong.

What happens

RLHF rewards responses humans prefer. Humans prefer responses that agree with them. So the model learns:

"Agreement = reward"

Validates wrong answers instead of correcting them
Gets worse with scale — smarter models flatter betterInverse scaling: larger models are more sycophantic (Sharma et al., Anthropic)
OpenAI rolled back a GPT-4o update (April 2025) for excessive agreeableness

How to defend against it

Be skeptical when the model agreesEspecially on opinions or decisions — enthusiastic agreement is a red flag
Ask it to argue the opposite"Now tell me why this is a bad idea" — this often reveals the real answer
Use system prompts to counteract"Prioritize accuracy over agreeableness. Push back when I'm wrong."
Never tell the model what you think"When Truth is Overridden" (Wang et al., 2025) — models are significantly more likely to agree with your stated beliefs

Builder's takeaway: The model that feels most helpful might be the least trustworthy. Design for truth, not comfort.

The full picture

Pre-Training

📚

School

Broad knowledge from reading the internet

Cost: $10M–$100M+
Data: trillions of tokens
Time: months

Fine-Tuning

🎯

Job Training

Learn to follow instructions and be useful

Cost: $10K–$1M
Data: thousands of examples
Time: days

RLHF

🤝

Social Skills

Learn what humans prefer and how to be safe

Cost: $1M–$10M
Data: human rankings
Time: weeks

The shortcut: model distillation

Why open-source caught up so fast — and why small models keep getting better.

The trade-off

Distilled models are compressed knowledge — like great notes vs. actually taking the class. Not quite as good as the teacher, but dramatically cheaper to build and run.

Why this matters

This is why the model landscape changes so fast. A new frontier model drops, and within weeks there are dozens of distilled variants optimized for different use cases.

Open-source explosionDeepSeek-R1-Distill, Alpaca, Vicuna, Orca, Phi — all distilled from frontier models at a fraction of the cost
100× cheaperSkip the $100M pre-training and the army of human reviewers
Small models, big capabilityA 7B parameter model distilled from a frontier model can outperform a 70B model trained from scratch

Case study: DeepSeek R1

When distillation, export controls, and geopolitics collided.

Jan 20, 2025 — The claim

DeepSeek releases R1, claiming performance comparable to OpenAI's best models — trained for just $5.6 million on older chips.

Jan 27, 2025 — The plummet

NVIDIA loses $589 billion in market cap in a single day — the largest one-day loss in stock market history.

If you don't need expensive chips, you don't need NVIDIA.

Feb 2025 — The accusations

OpenAI tells Congress it detected DeepSeek employees using obfuscated methods to extract outputs from ChatGPT for training.

White House AI czar David Sacks: "substantial evidence" of distillation.

Why this matters for builders: Distillation sits at the intersection of technology, intellectual property, and geopolitics. The legal and ethical boundaries are still being drawn — in real time.

Allegation 1: Illegal distillation

OpenAI accused DeepSeek of using their API outputs to train R1 — violating OpenAI's terms of service. Evidence presented to Congress but not made public.

Sources: Bloomberg, Feb 2026; Fortune, Jan 2025

Allegation 2: Chip smuggling

DeepSeek allegedly acquired thousands of banned NVIDIA Blackwell chips despite US export controls. DOJ separately busted a $160M smuggling ring moving H100/H200 chips to China.

NVIDIA called the DeepSeek-specific claims "far-fetched." Sources: CNBC, Dec 2025; WinBuzzer, Dec 2025

Part 3

AI Agents

From chatbots to autonomous systems that take action

What makes something an agent?

Agents don't just respond — they take action in a loop.

Chatbot

You ask a question

↓

Model generates response

↓

Waits for next input

One turn at a time. No autonomy.

AgentYou give a task
↓Think & plan
↓Use tools (search, code, files)
↓Observe result
loop until done

We'll go deeper on agents in Session 3

How they connect to tools, how to control them, and how to build reliable agent workflows.

For now, remember: Chatbots answer questions. Agents complete tasks. That distinction changes how you think about building with AI.

Part 4

Open vs.
Closed-Source

Navigating the model landscape — and choosing the right engine

The model landscape in 2025-2026

Closed-Source

You rent access. They control the model.

GPT-5.3 / o3 / o4-mini
OpenAI

Claude Opus 4.6 / Sonnet 4.5
Anthropic

Gemini 3 Pro
Google

Open-Source / Open-Weight

You can download, run, and modify them.

Llama 4
Meta

Mistral 3
Mistral AI

DeepSeek-V3.2
DeepSeek

The agent landscape (early 2026)

Closed-Source Agents

You rent access. They control the agent.

Claude Code — Anthropic

A coding assistant that runs on your computer. It can read your files, write code, and run commands for you. Your data stays on your machine.

Codex — OpenAI

A coding assistant that runs in the cloud. You give it a task and it works on it in the background, then delivers the result.

Open-Source Agents

You can download, modify, and run them yourself.

OpenClaw — formerly Clawdbot

A personal AI assistant that connects to your messaging apps (WhatsApp, Slack, etc.) and can take actions for you — send emails, manage your calendar, browse the web, and more. Went viral in Jan 2026 (100K+ users in one week).

Heads up: OpenClaw had a serious security flaw in its first week that could let hackers take control. Open-source tools move fast — but you're responsible for vetting them.

Where this is going: Agents are becoming the main way people use AI — not just chatting, but getting things done. We'll build with Claude Code in Session 3.

The tradeoffs

Dimension	Closed-Source	Open-Source (self-hosted)	Open-Source (hosted)
Performance	Frontier capability	Effectively caught up (2026)	Same models, same quality
Cost Model	Pay per token (API)	Infrastructure cost; zero per-token	Cheaper API; often 50-90% less
Data Privacy	Your data hits their servers	Runs on your own infra	Data hits provider servers
Customization	Limited to prompts + some fine-tuning	Full control: fine-tune, modify, merge	Some fine-tuning supported
Setup Effort	API key and go	Need GPU infra	API key and go
Vendor Risk	They can change models, pricing	Weights are yours forever	Can switch providers easily
Providers	OpenAI, Anthropic, Google	Your own GPUs / cloud	Together AI, Fireworks, Groq

What an LLM looks like in code

Simplified PyTorch — open source and closed source models all share this same structure.

Model Architecture

# 1. Convert token IDs → vectors
class Embedding(nn.Module):
  def forward(self, token_ids):
    return self.embed(token_ids) \
         + self.position(positions)

# 2. One transformer block
class TransformerBlock(nn.Module):
  def forward(self, x):
    attn = self.attention(x)  # Attend
    x = x + attn              # Residual
    x = x + self.ffn(x)      # Transform
    return x

# 3. The full model
class LLM(nn.Module):
  def __init__(self):
    self.embed  = Embedding()
    self.blocks = [TransformerBlock()
                   for _ in range(96)]
    self.output = Linear(hidden, vocab)

  def forward(self, token_ids):
    x = self.embed(token_ids)
    for block in self.blocks:
      x = block(x)
    logits = self.output(x)
    return softmax(logits / T)

Training Loop (Gradient Descent)

# Repeat trillions of times
for batch in training_data:

  # Forward pass — make predictions
  predictions = model(batch.input)

  # How wrong were we?
  loss = cross_entropy(
    predictions,
    batch.next_token
  )

  # Compute gradients
  # (how much did each weight
  #  contribute to the error?)
  loss.backward()

  # Update all weights by a tiny
  # amount to reduce the error
  # ← THIS IS LEARNING
  optimizer.step()

The insight: The core algorithm fits on one slide. The difference between open and closed source isn't the algorithm — it's who has access to the trained weights and the data that shaped them.

A decision framework

        Start with closed-source when:
        You need maximum capability
You're prototyping and want to move fast
Data privacy isn't a hard constraint
You don't want to manage infrastructure
You need multimodal (vision, audio, etc.)

      

Consider open-source when:

Data must stay on your infrastructure
You need deep customization / fine-tuning
Cost at scale is a concern (high volume)
You need to run offline or on-device
You can't accept vendor lock-in risk

Pro tip: Many production systems use both. Closed-source for hard reasoning tasks; open-source for high-volume, simpler tasks. The right answer is rarely "always one or the other."

Right-sizing your model

Bigger isn't always better. Match the model to the task.

Small Models

~1B–8B parameters

Fast, cheap, good for simple tasks

Classification
Extraction
Simple Q&A

GPT-5 Mini, Haiku 4.5, Llama 8B

Medium Models

~30B–100B parameters

Balanced performance and cost

Content generation
Code assistance
Analysis

Sonnet 4.5, Gemini Flash, Mistral 3

Frontier Models

Hundreds of billions+

Maximum capability, highest cost

Complex reasoning
Novel problem solving
Multi-step planning

GPT-5.3, Opus 4.6, Gemini 3 Pro

Builder's rule: Use the smallest model that can reliably do the job. Sending a classification task to Claude Opus is like hiring a PhD to sort mail.

Key Takeaways

What to remember from today

LLMs are prediction machines, not knowledge databases They generate statistically likely text — they don't "know" or "think"
Attention is the mechanism that makes them powerful Every token considers every other token to understand context
Hallucinations are a feature of the architecture, not a bug Always build verification into your AI workflows
Three stages: Pre-training → Fine-tuning → RLHF Each stage solves a different problem and has different costs
Match the model to the task Closed vs. open, big vs. small — it depends on your constraints

Homework

Before Session 2

1. Install a tool

First, install Node.js (needed for npm):

Mac: Install Homebrew first if you don't have it:

/bin/bash -c "$(curl -fsSL https://brew.sh/install.sh)"

Then run: brew install node

Windows: Go to nodejs.org, download the LTS installer, and run it. Accept defaults.

Then install an AI coding agent:

          # Claude Code

          npm install -g @anthropic-ai/claude-code

          # OpenAI Codex

          npm install -g @openai/codex

AI tip: Use your favorite LLM to help you debug the installation process!

2. Additional reading

Transformers:

3Blue1Brown: "Transformers" (video, 27min)
The Illustrated Transformer
Transformer Explainer (interactive)

Context & Hallucination:

"Lost in the Middle" — Liu et al.
"Towards Understanding Sycophancy" — Sharma et al., Anthropic
"When Truth Is Overridden" — Wang et al.

RLHF & Training:

Illustrating RLHF — Hugging Face Blog
Claude's Constitution — Anthropic

AI tip: Try NotebookLM to break down long papers!

3. Bring a use case

Think of a task from your own work that involves text:

Writing you do repeatedly
Data you analyze or summarize
Research or information gathering
Customer communications
Reports or documentation

We'll build a prompt system for it in Session 2.

AI tip: Ask your favorite LLM to help you find a good candidate! (Click to expand)

I want to identify one high-value AI use case from my real work.

Phase 1 (prompt-only first):
- Ask me 8–10 short questions (one at a time) to identify repeatable tasks (writing, analysis, research, customer communication, reporting, documentation, etc.).
- Based only on my answers, give me my top 3 use cases ranked by impact and ease.
- For each use case, include: why it fits, expected time savings, risks, and required inputs.
- Recommend one "starter use case" and write:
  a) a first-draft prompt I can use immediately
  b) a refined prompt template with placeholders

Phase 2 (optional agentic upgrade):
- For the same starter use case, propose how agentic capabilities could be added later.
- Specify what the agent could do, what tools/data access it would need, what approvals/human checkpoints are required, and key safety controls.
- Keep this as an upgrade path; do not assume agentic behavior in the Phase 1 prompt.

Keep outputs practical and specific to my workflow.

Quiz

Test yourself

Can you answer these from today's session?

Q1

What is the single core operation that an LLM performs?

Q2

Why do LLMs hallucinate? Explain in one sentence.

Q3

Name the three stages of the training pipeline and what each one teaches.

Q4

Does RAG eliminate hallucination? Why or why not?

Q5

What's one advantage of open-source models over closed-source models?

Q6

What makes an AI agent different from a standard chatbot?

Thank You

Questions, ideas, and "wait, what?" moments welcome.

Harper Carroll AI · AI User to AI Builder · Session 1: The Engine · Cohort 1

Appendix

Technical
Deep Dives

Additional detail on transformer internals — for the curious.

The transformer block

Two steps, repeated dozens of times. Each layer builds deeper understanding.

What each layer learns

Early layers (1–20)

Syntax, grammar, word relationships
Which adjective modifies which noun?

Middle layers (20–60)

Semantic meaning, idioms, logical patterns
Understanding metaphor, cause and effect

Deep layers (60–96+)

Abstract reasoning, world knowledge, intent
What does the user actually want?

Analogy: Attention is the conversation step — tokens talk to each other. Feed-forward is the thinking step — each token digests what it heard.

Terms to know

Three concepts you need before we discuss temperature.

Logits

The raw scores from the final layer of the neural network — one number per possible next token. Not probabilities yet. Can be negative.

Softmax

The function that converts logits into probabilities. Makes them all positive and sum to 100%. Bigger logits get exponentially bigger probabilities.

Probability Distribution

The full set of probabilities across all possible next tokens.

Sharp

One token dominates

Flat

Probability spread evenly

Inside self-attention

Each token asks: "Which other tokens matter to me?"

The key insight: Every token checks every other token for relevance. The thicker the connection, the more information flows. This happens simultaneously across multiple "attention heads" — each one learning different relationship types (grammar, meaning, position). The combined result: each token builds a rich understanding of its context.

From layers to output

The final step: turn enriched representations back into words.

Input

"The cat sat on the mat because it was"

→

96 Layers

Attention + FFN ×96

→

Map to vocab

→ 100K logits

→

Softmax

→ probabilities

→

Sample

"tired"

The autoregressive loop

Step 1: "The cat sat" → "on"
Step 2: "The cat sat on" → "the"
Step 3: "The cat sat on the" → "mat"
Step 4: "The cat sat on the mat" → "because"
...each step runs the FULL pipeline again

The cost of generation

A 500-word response means the model runs this full pipeline ~700 times.

Each pass: all 96 layers, all attention heads, all feed-forward networks. Billions of calculations per token.

This is why: AI costs money per token, longer responses cost more, and output tokens are more expensive than input tokens.

Temperature reshapes the distribution

Same logits, same model, same prompt — temperature just reshapes the curve. Low T = sharp (deterministic). High T = flat (creative). This is why the same prompt gives different answers: it's sampling differently from a reshaped distribution.

P(token) = softmax( logit / T ) OpenAI/Gemini: 0–2 | Claude: 0–1

T → 0

Deterministic

Prompt: "The cat sat on the ___"

mat

97%

floor

couch

moon

Spike. Always picks "mat."

Best for: code, facts, analysis

T = 0.7

Balanced

Prompt: "The cat sat on the ___"

mat

62%

floor

20%

couch

12%

moon

Softened. Top token likely, but others have a chance.

Best for: conversation, writing

T = 1.5

Creative

Prompt: "The cat sat on the ___"

mat

30%

floor

24%

couch

22%

moon

18%

Flat. Even unlikely tokens get real chances.

Best for: brainstorming, fiction

Session 1: The Engine

The Engine

Treat AI like a person Treat AI like a probabilistic machine

AI is Math

Four things you'll understand by the end

1. Transformers

2. Training Pipeline

3. AI Agents

4. Open vs. Closed-Source

UnderstandingTransformers

What is a Large Language Model (LLM)?

What it actually is

What it is NOT (at its core)

It all starts with tokens

It's all next-token prediction

The Context Window

GPT-5.2

Claude Opus 4.6

Gemini 3 Pro

Neural Networks in 60 Seconds

What is a Neural Network?

The "black box" problem

What we know

Why it matters

Transformers Overview

From Neural Nets to Transformers

Traditional Neural Net

Transformer

What a Transformer actually does

The core loop

Attention in action

Why hallucinations happen

The model will confidently:

Why this happens:

Reducing hallucinations with RAG

Why RAG helps

Why hallucination still happens

The TrainingPipeline

Three stages build a model

Pre-Training

Fine-Tuning

RLHF

ChatGPT / Claude

Pre-Training

What happens

By the numbers

Fine-Tuning

The problem it solves

Base model behavior:

After fine-tuning:

How it works

Reinforcement Learning from Human Feedback

The process

1. Generate

2. Rank

3. Learn

4. Optimize

What it teaches

What RLHF looks like

Response A

Response B

Response C

The Sycophancy Problem

What happens

How to defend against it

The full picture

Pre-Training

Fine-Tuning

RLHF

The shortcut: model distillation

The trade-off

Why this matters

Case study: DeepSeek R1 DeepSeek

Allegation 1: Illegal distillation

Allegation 2: Chip smuggling

AI Agents

What makes something an agent?

Chatbot

Agent

We'll go deeper on agents in Session 3

Treat AI like a person

Treat AI like a probabilistic machine

Understanding
Transformers

The Training
Pipeline

Case study: DeepSeek R1

Open vs.
Closed-Source

Technical
Deep Dives