ML Notes

❯

❯

❯

Gemma 2

07 Apr 20251 min read

Summary

The Gemma 2 models range from 2B to 27B and have a context window up to 8k.

Compared to Gemma, Gemma 2 uses:

uses interleaved SWA and SDPA
logit-soft-capping
Model merging

It also adopted some techniques that had become popular in other models:

Post-norm and pre-norm RMSNorm to normalise the input and output of each transformer block, the attention layer, and the feedforward layer to stabilise training
GQA with 2 groups to increase speed at inference time while maintaining downstream performance
Knowledge distillation

Graph View

Backlinks

Gemma 3
Google

Created with Quartz v4.5.0 © 2025