Summary
The Gemma 2 models range from 2B to 27B and have a context window up to 8k.
Compared to Gemma, Gemma 2 uses:
- uses interleaved SWA and SDPA
- logit-soft-capping
- Model merging
It also adopted some techniques that had become popular in other models:
- Post-norm and pre-norm RMSNorm to normalise the input and output of each transformer block, the attention layer, and the feedforward layer to stabilise training
- GQA with 2 groups to increase speed at inference time while maintaining downstream performance
- Knowledge distillation