Summary

The Gemma 2 models range from 2B to 27B and have a context window up to 8k.

Compared to Gemma, Gemma 2 uses:

It also adopted some techniques that had become popular in other models:

  • Post-norm and pre-norm RMSNorm to normalise the input and output of each transformer block, the attention layer, and the feedforward layer to stabilise training
  • GQA with 2 groups to increase speed at inference time while maintaining downstream performance
  • Knowledge distillation