DeepSeek-R1 is a revolutionary reasoning model built upon DeepSeek-V3 which brought two main contributions to the field of model training:

  • In post-training, they discovered that directly applying GRPO (from DeepSeekMath) to the base model without any SFT allows the model to explore chain-of-thought (CoT) for solving complex problems. This discovery resulted in the development of DeepSeek-R1-Zero.
  • In distillation, they discovered that performing SFT on smaller models using data generated by larger reasoning models yielded better results than directly applying the same reinforcement learning pipeline to the smaller model.

References