Many thx for your great sharing. here is an issue: accroding to the R1 paper, the model which goes through Supervised Fine-Tuning (SFT) should be the interim reasoning model (in your context),not the original deepseek-v3-base model.
"We fine-tune DeepSeek-V3-Base for two epochs... " this narration sure cause some confusion.
here is the evidence: "When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round.", here SUBSEQUENT implies the sft follows the RL, which means the authors use the output model in the first round as the input. "To further align the model with human preferences, we implement a secondary reinforcement learning stage". SECONDARY also indicates the first RL is the primary stage.
Most importantly , if the SFT were conducted on the v3-base, the final v1 would certainly lose the reasoning quality of v1-zero.
So I think the four steps(cold start, Reasoning-oriented Reinforcement Learning, SFT, and all scenarios RL ) are sequential process
This question puzzles me a lot. Look at this reproduction https://github.com/huggingface/open-r1. Their processes just like the author said, finetuning the base Model.
As always, thank you Jay for such amazing material. it might be helpful to mention that the 200K examples in your last figure are non-reasoning data points generated with Deep-Seek-V3. I had to go back to the paper to remember where that number came from.
The article provides an excellent high-level summary but simplifies some technical nuances. While it is useful for non-expert readers, the DeepSeek-R1 technical report clarifies several details:
1. Misrepresentation of SFT and RL Training Order
Jay’s claim: The model follows a standard LLM pipeline, with SFT happening first, then RL.
Technical Report: DeepSeek-R1-Zero is trained without any SFT, relying entirely on RL at first. SFT is introduced later in DeepSeek-R1’s development, after an initial RL phase, to enhance usability and align with human preferences.
Observation: While Jay acknowledges RL’s role, he presents SFT as preceding RL, whereas the official report states RL was first applied purely (DeepSeek-R1-Zero), then SFT was introduced to enhance readability and general usability.
2. Overstatement of the Role of "Thinking Tokens"
Jay’s claim: DeepSeek-R1 generates explicit “thinking tokens” that explain its chain of thought.
Technical Report: While the model does generate extended reasoning traces, there is no explicit mention of unique "thinking tokens" as a distinct training artifact. The reasoning process is structured but follows standard RL-guided CoT (Chain of Thought) reasoning.
Observation: The model does develop longer reasoning chains naturally, but the phrase "thinking tokens" may be an oversimplification of how the model structures its outputs.
3. Role of the Interim Reasoning Model
Jay’s claim: The interim reasoning model exists primarily to generate reasoning SFT data.
Technical Report: The interim reasoning model is a direct result of reinforcement learning with cold-start data, which was necessary to stabilize the RL.
Observation: While Jay correctly states that an interim model was used for generating reasoning SFT data, he does not emphasize that the cold-start data was introduced to mitigate instability in early RL training.
Thank you very much for sharing. I translated this article into Chinese according to my own understanding. The address is https://zhuanlan.zhihu.com/p/21175143007
Thank you for the great visualization!
Would it be okay for me to translate it and share it with the Korean community?
I’ve translated several of Maarten’s posts in the past.
Thank you!
Of course, go right ahead. Send me the link after, would love to link it
Hi!
Here is the first draft:
https://tulip-phalange-a1e.notion.site/DeepSeek-R1-189c32470be2801c94b6e5648735447d?pvs=4
I will polish and update as you add more detail to it,
Thank you~
Wonderful! Thanks! Added.
Thanks! It was received well in Korean AI community as well.
Will update the post as you add more details :)
Thank you for the sharing! I've translated it to Turkish because I think it's a great read for both ML community and other enthusiasts: https://gist.github.com/gsamil/0a5ca3bf44e979151e6c5d33345ede16
Brilliant! Thank you! Added.
Many thx for your great sharing. here is an issue: accroding to the R1 paper, the model which goes through Supervised Fine-Tuning (SFT) should be the interim reasoning model (in your context),not the original deepseek-v3-base model.
"We fine-tune DeepSeek-V3-Base for two epochs... " this narration sure cause some confusion.
here is the evidence: "When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round.", here SUBSEQUENT implies the sft follows the RL, which means the authors use the output model in the first round as the input. "To further align the model with human preferences, we implement a secondary reinforcement learning stage". SECONDARY also indicates the first RL is the primary stage.
Most importantly , if the SFT were conducted on the v3-base, the final v1 would certainly lose the reasoning quality of v1-zero.
So I think the four steps(cold start, Reasoning-oriented Reinforcement Learning, SFT, and all scenarios RL ) are sequential process
Great point! It's not how I first read but it's totally plausible now that you mention it!
This question puzzles me a lot. Look at this reproduction https://github.com/huggingface/open-r1. Their processes just like the author said, finetuning the base Model.
As always, thank you Jay for such amazing material. it might be helpful to mention that the 200K examples in your last figure are non-reasoning data points generated with Deep-Seek-V3. I had to go back to the paper to remember where that number came from.
In total, 600K CoT examples + 200K non-reasoning examples = 800K SFT examples.
Loved it Jay
Thak You Jay for this amazing information.
Thank you, Jay, for explaining it in detail. It helps a lot.
Gteat read.
Great read, Jay! Thank you for putting this together
Great article! Thank you Jay
Been reading and going through the DeepSeek V3, and R1 papers, and this came just in time!
"DeepSeek-R1 generates one token at a time" , yes, but with a twist. DS-R1 ( via DS-V3 ) has multi-token prediction training objective.
The article provides an excellent high-level summary but simplifies some technical nuances. While it is useful for non-expert readers, the DeepSeek-R1 technical report clarifies several details:
1. Misrepresentation of SFT and RL Training Order
Jay’s claim: The model follows a standard LLM pipeline, with SFT happening first, then RL.
Technical Report: DeepSeek-R1-Zero is trained without any SFT, relying entirely on RL at first. SFT is introduced later in DeepSeek-R1’s development, after an initial RL phase, to enhance usability and align with human preferences.
Observation: While Jay acknowledges RL’s role, he presents SFT as preceding RL, whereas the official report states RL was first applied purely (DeepSeek-R1-Zero), then SFT was introduced to enhance readability and general usability.
2. Overstatement of the Role of "Thinking Tokens"
Jay’s claim: DeepSeek-R1 generates explicit “thinking tokens” that explain its chain of thought.
Technical Report: While the model does generate extended reasoning traces, there is no explicit mention of unique "thinking tokens" as a distinct training artifact. The reasoning process is structured but follows standard RL-guided CoT (Chain of Thought) reasoning.
Observation: The model does develop longer reasoning chains naturally, but the phrase "thinking tokens" may be an oversimplification of how the model structures its outputs.
3. Role of the Interim Reasoning Model
Jay’s claim: The interim reasoning model exists primarily to generate reasoning SFT data.
Technical Report: The interim reasoning model is a direct result of reinforcement learning with cold-start data, which was necessary to stabilize the RL.
Observation: While Jay correctly states that an interim model was used for generating reasoning SFT data, he does not emphasize that the cold-start data was introduced to mitigate instability in early RL training.
Thank you very much for sharing. I translated this article into Chinese according to my own understanding. The address is https://zhuanlan.zhihu.com/p/21175143007
Great! Adding.