Megatron-LM Update

This paper builds upon the tensor parallel scheme introduced by Megatron-LM by adding two additional techniques:

References

Reducing Activation Recomputation in Large Transformer Models

ML Notes