25 Comments
User's avatar
Michael Shin's avatar

Thank you for the great visualization!

Would it be okay for me to translate it and share it with the Korean community?

I’ve translated several of Maarten’s posts in the past.

Thank you!

Jay Alammar's avatar

Of course, go right ahead. Send me the link after, would love to link it

Michael Shin's avatar

Hi!

Here is the first draft:

https://tulip-phalange-a1e.notion.site/DeepSeek-R1-189c32470be2801c94b6e5648735447d?pvs=4

I will polish and update as you add more detail to it,

Thank you~

Jay Alammar's avatar

Wonderful! Thanks! Added.

Michael Shin's avatar

Thanks! It was received well in Korean AI community as well.

Will update the post as you add more details :)

Abdullah Güser's avatar

Thank you for the sharing! I've translated it to Turkish because I think it's a great read for both ML community and other enthusiasts: https://gist.github.com/gsamil/0a5ca3bf44e979151e6c5d33345ede16

Jay Alammar's avatar

Brilliant! Thank you! Added.

Shawn's avatar

Many thx for your great sharing. here is an issue: accroding to the R1 paper, the model which goes through Supervised Fine-Tuning (SFT) should be the interim reasoning model (in your context),not the original deepseek-v3-base model.

"We fine-tune DeepSeek-V3-Base for two epochs... " this narration sure cause some confusion.

here is the evidence: "When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round.", here SUBSEQUENT implies the sft follows the RL, which means the authors use the output model in the first round as the input. "To further align the model with human preferences, we implement a secondary reinforcement learning stage". SECONDARY also indicates the first RL is the primary stage.

Most importantly , if the SFT were conducted on the v3-base, the final v1 would certainly lose the reasoning quality of v1-zero.

So I think the four steps(cold start, Reasoning-oriented Reinforcement Learning, SFT, and all scenarios RL ) are sequential process

Jay Alammar's avatar

Great point! It's not how I first read but it's totally plausible now that you mention it!

Oscar Liu's avatar

This question puzzles me a lot. Look at this reproduction https://github.com/huggingface/open-r1. Their processes just like the author said, finetuning the base Model.

Omar U. Florez's avatar

As always, thank you Jay for such amazing material. it might be helpful to mention that the 200K examples in your last figure are non-reasoning data points generated with Deep-Seek-V3. I had to go back to the paper to remember where that number came from.

In total, 600K CoT examples + 200K non-reasoning examples = 800K SFT examples.

Mahmoud's avatar

Loved it Jay

Jaimin Mungalpara's avatar

Thak You Jay for this amazing information.

Vipin Thazhissery's avatar

Thank you, Jay, for explaining it in detail. It helps a lot.

Sankarshan's avatar

Gteat read.

Shahriar Hooshmand's avatar

Great read, Jay! Thank you for putting this together

Alex Razvant's avatar

Great article! Thank you Jay

Been reading and going through the DeepSeek V3, and R1 papers, and this came just in time!

Mahmoud's avatar

الله ينور عليك يا باشا

Prashant's avatar

Hi Jay

Great information. I have your book too. Question: Can I use this information in my pptx (giving credits to you?)

Jay Alammar's avatar

Yes, please do!

Prashant's avatar

Thank you Jay Sir

Ranko Mosic's avatar

"DeepSeek-R1 generates one token at a time" , yes, but with a twist. DS-R1 ( via DS-V3 ) has multi-token prediction training objective.

Atif Khan's avatar

The article provides an excellent high-level summary but simplifies some technical nuances. While it is useful for non-expert readers, the DeepSeek-R1 technical report clarifies several details:

1. Misrepresentation of SFT and RL Training Order

Jay’s claim: The model follows a standard LLM pipeline, with SFT happening first, then RL.

Technical Report: DeepSeek-R1-Zero is trained without any SFT, relying entirely on RL at first. SFT is introduced later in DeepSeek-R1’s development, after an initial RL phase, to enhance usability and align with human preferences​.

Observation: While Jay acknowledges RL’s role, he presents SFT as preceding RL, whereas the official report states RL was first applied purely (DeepSeek-R1-Zero), then SFT was introduced to enhance readability and general usability.

2. Overstatement of the Role of "Thinking Tokens"

Jay’s claim: DeepSeek-R1 generates explicit “thinking tokens” that explain its chain of thought.

Technical Report: While the model does generate extended reasoning traces, there is no explicit mention of unique "thinking tokens" as a distinct training artifact. The reasoning process is structured but follows standard RL-guided CoT (Chain of Thought) reasoning​.

Observation: The model does develop longer reasoning chains naturally, but the phrase "thinking tokens" may be an oversimplification of how the model structures its outputs.

3. Role of the Interim Reasoning Model

Jay’s claim: The interim reasoning model exists primarily to generate reasoning SFT data.

Technical Report: The interim reasoning model is a direct result of reinforcement learning with cold-start data, which was necessary to stabilize the RL.

Observation: While Jay correctly states that an interim model was used for generating reasoning SFT data, he does not emphasize that the cold-start data was introduced to mitigate instability in early RL training.

Real's avatar

Thank you very much for sharing. I translated this article into Chinese according to my own understanding. The address is https://zhuanlan.zhihu.com/p/21175143007