21 Comments

Thank you for the great visualization!

Would it be okay for me to translate it and share it with the Korean community?

I’ve translated several of Maarten’s posts in the past.

Thank you!

Expand full comment

Of course, go right ahead. Send me the link after, would love to link it

Expand full comment

Hi!

Here is the first draft:

https://tulip-phalange-a1e.notion.site/DeepSeek-R1-189c32470be2801c94b6e5648735447d?pvs=4

I will polish and update as you add more detail to it,

Thank you~

Expand full comment

Wonderful! Thanks! Added.

Expand full comment

Thanks! It was received well in Korean AI community as well.

Will update the post as you add more details :)

Expand full comment

Thank you for the sharing! I've translated it to Turkish because I think it's a great read for both ML community and other enthusiasts: https://gist.github.com/gsamil/0a5ca3bf44e979151e6c5d33345ede16

Expand full comment

Brilliant! Thank you! Added.

Expand full comment

Many thx for your great sharing. here is an issue: accroding to the R1 paper, the model which goes through Supervised Fine-Tuning (SFT) should be the interim reasoning model (in your context),not the original deepseek-v3-base model.

"We fine-tune DeepSeek-V3-Base for two epochs... " this narration sure cause some confusion.

here is the evidence: "When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round.", here SUBSEQUENT implies the sft follows the RL, which means the authors use the output model in the first round as the input. "To further align the model with human preferences, we implement a secondary reinforcement learning stage". SECONDARY also indicates the first RL is the primary stage.

Most importantly , if the SFT were conducted on the v3-base, the final v1 would certainly lose the reasoning quality of v1-zero.

So I think the four steps(cold start, Reasoning-oriented Reinforcement Learning, SFT, and all scenarios RL ) are sequential process

Expand full comment

Great point! It's not how I first read but it's totally plausible now that you mention it!

Expand full comment

This question puzzles me a lot. Look at this reproduction https://github.com/huggingface/open-r1. Their processes just like the author said, finetuning the base Model.

Expand full comment

As always, thank you Jay for such amazing material. it might be helpful to mention that the 200K examples in your last figure are non-reasoning data points generated with Deep-Seek-V3. I had to go back to the paper to remember where that number came from.

In total, 600K CoT examples + 200K non-reasoning examples = 800K SFT examples.

Expand full comment

Loved it Jay

Expand full comment

Thak You Jay for this amazing information.

Expand full comment

Thank you, Jay, for explaining it in detail. It helps a lot.

Expand full comment

Gteat read.

Expand full comment

Great read, Jay! Thank you for putting this together

Expand full comment

Great article! Thank you Jay

Been reading and going through the DeepSeek V3, and R1 papers, and this came just in time!

Expand full comment

"DeepSeek-R1 generates one token at a time" , yes, but with a twist. DS-R1 ( via DS-V3 ) has multi-token prediction training objective.

Expand full comment

The article provides an excellent high-level summary but simplifies some technical nuances. While it is useful for non-expert readers, the DeepSeek-R1 technical report clarifies several details:

1. Misrepresentation of SFT and RL Training Order

Jay’s claim: The model follows a standard LLM pipeline, with SFT happening first, then RL.

Technical Report: DeepSeek-R1-Zero is trained without any SFT, relying entirely on RL at first. SFT is introduced later in DeepSeek-R1’s development, after an initial RL phase, to enhance usability and align with human preferences​.

Observation: While Jay acknowledges RL’s role, he presents SFT as preceding RL, whereas the official report states RL was first applied purely (DeepSeek-R1-Zero), then SFT was introduced to enhance readability and general usability.

2. Overstatement of the Role of "Thinking Tokens"

Jay’s claim: DeepSeek-R1 generates explicit “thinking tokens” that explain its chain of thought.

Technical Report: While the model does generate extended reasoning traces, there is no explicit mention of unique "thinking tokens" as a distinct training artifact. The reasoning process is structured but follows standard RL-guided CoT (Chain of Thought) reasoning​.

Observation: The model does develop longer reasoning chains naturally, but the phrase "thinking tokens" may be an oversimplification of how the model structures its outputs.

3. Role of the Interim Reasoning Model

Jay’s claim: The interim reasoning model exists primarily to generate reasoning SFT data.

Technical Report: The interim reasoning model is a direct result of reinforcement learning with cold-start data, which was necessary to stabilize the RL.

Observation: While Jay correctly states that an interim model was used for generating reasoning SFT data, he does not emphasize that the cold-start data was introduced to mitigate instability in early RL training.

Expand full comment

Thank you very much for sharing. I translated this article into Chinese according to my own understanding. The address is https://zhuanlan.zhihu.com/p/21175143007

Expand full comment

Great! Adding.

Expand full comment