Since releasing Gemma 4 two months in the past, we have been constantly working to broaden its capabilities. First, we launched Multi-Token Prediction (MTP) to speed up inference, and simply a few days in the past, we launched a 12B model to bridge the hole between our E4B and 26B MOE fashions.
At the moment, we’re releasing new checkpoints optimized with Quantization-Conscious Coaching (QAT) to make Gemma 4 much more environment friendly, so you possibly can run fashions domestically on on a regular basis edge gadgets and client GPUs.
By simulating quantization throughout coaching, QAT minimizes high quality loss when the mannequin is compressed. This launch consists of QAT checkpoints for the favored Q4_0 quantization format in addition to a novel quantization format specialised for cellular use instances. Utilizing this cellular format, we’ve lowered the reminiscence footprint of Gemma 4 E2B to 1GB. Collectively, these dramatically scale back reminiscence necessities whereas preserving the capabilities and high quality you count on from Gemma 4.
Preserving mannequin high quality whereas making them smaller
Quantization is a key know-how to run fashions on client {hardware} by lowering their reminiscence footprint whereas additionally accelerating decode velocity. Nevertheless, commonplace Submit-Coaching Quantization (PTQ) usually results in efficiency degradation. As an alternative of merely quantizing the mannequin after coaching, QAT integrates the quantization course of straight into coaching. Whereas PTQ is already efficient at preserving high quality, our QAT outcomes yield even greater total high quality in comparison with commonplace PTQ baselines.
We utilized this QAT recipe to the favored Q4_0 format to maximise efficiency for all of the fashions. For the sting fashions (E2B and E4B), we rethought how we strategy quantization with a particular mobile-specialized quantization schema.
Saving on VRAM and Storage
Beneath are the approximate reminiscence necessities indicating how a lot VRAM is required to load the fashions:
