(1)DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and
(2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of
long Chain-of-Thought (CoT) examples.
Distill the reasoning capability from DeepSeek-R1 to small dense models.