Research · 14 min read

What the DeepSeek R1 paper actually says

March 28, 2026

The DeepSeek R1 paper dropped in January 2025 and broke the AI discourse for two weeks. Most of the takes were wrong in both directions — both the "China beats OpenAI" framing and the defensive dismissals.

Here's what the paper actually demonstrates.

What R1 is

DeepSeek R1 is a reasoning model trained primarily through reinforcement learning from outcomes, not the standard RLHF pipeline. The key result: you can train competitive reasoning behavior with RL alone, skipping the expensive supervised fine-tuning stage.

This matters because SFT data — high-quality chain-of-thought examples — is hard to produce at scale. If you can get RL to find good reasoning traces on its own, you reduce a major bottleneck.

The GRPO result

The paper introduces Group Relative Policy Optimization (GRPO), a variant of PPO that doesn't require a separate value network. The simplified objective makes training more stable and cheaper.

The headline benchmark numbers (matching o1 on AIME, MATH, and coding benchmarks) are real. But benchmark matching doesn't mean architectural parity — it means the model found different paths to similar outputs.

What the paper doesn't show

What it actually changes

The real contribution is demonstrating that RL-only fine-tuning can produce emergent chain-of-thought behavior — the model learns to reason step-by-step without being explicitly taught to do so.

That's a meaningful result. It suggests the reasoning capability is more latent in base models than we realized, and that the training pipeline is less of a blocker than the base model quality.

The competitive landscape point

The paper is notable evidence that capable reasoning models can be trained outside the OpenAI/Anthropic/Google infrastructure cluster. The cost advantage is real, even if overstated in the press. That matters for the field, regardless of where you think the geopolitical stakes land.


The full paper is worth reading — it's unusually clear for a research release. Section 4 on the RL training details is where the interesting technical decisions are.