What the DeepSeek R1 paper actually says
March 28, 2026
The DeepSeek R1 paper dropped in January 2025 and broke the AI discourse for two weeks. Most of the takes were wrong in both directions — both the "China beats OpenAI" framing and the defensive dismissals.
Here's what the paper actually demonstrates.
What R1 is
DeepSeek R1 is a reasoning model trained primarily through reinforcement learning from outcomes, not the standard RLHF pipeline. The key result: you can train competitive reasoning behavior with RL alone, skipping the expensive supervised fine-tuning stage.
This matters because SFT data — high-quality chain-of-thought examples — is hard to produce at scale. If you can get RL to find good reasoning traces on its own, you reduce a major bottleneck.
The GRPO result
The paper introduces Group Relative Policy Optimization (GRPO), a variant of PPO that doesn't require a separate value network. The simplified objective makes training more stable and cheaper.
The headline benchmark numbers (matching o1 on AIME, MATH, and coding benchmarks) are real. But benchmark matching doesn't mean architectural parity — it means the model found different paths to similar outputs.
What the paper doesn't show
- The training cost is not $6M. That figure refers to the pre-training run of DeepSeek-V3. R1 is a fine-tune on top of V3, not a model trained from scratch.
- It's not "the same as o1." R1 is slower at inference, more verbose, and behaves differently on distribution-shifted inputs. The benchmarks overlap; the models are distinct.
- The distilled models are good, but they're distilled. R1-Distill-Qwen-7B is impressive for its size because it learned from R1's traces, not because small models have become brilliant.
What it actually changes
The real contribution is demonstrating that RL-only fine-tuning can produce emergent chain-of-thought behavior — the model learns to reason step-by-step without being explicitly taught to do so.
That's a meaningful result. It suggests the reasoning capability is more latent in base models than we realized, and that the training pipeline is less of a blocker than the base model quality.
The competitive landscape point
The paper is notable evidence that capable reasoning models can be trained outside the OpenAI/Anthropic/Google infrastructure cluster. The cost advantage is real, even if overstated in the press. That matters for the field, regardless of where you think the geopolitical stakes land.
The full paper is worth reading — it's unusually clear for a research release. Section 4 on the RL training details is where the interesting technical decisions are.