Deepseek-R1

we present:

(1)DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples.

  1. Distill the reasoning capability from DeepSeek-R1 to small dense models.