Fine-Tune Gemma 4 with Ertas

Google's April 2026 open-weight model family — the first Gemma generation released under Apache 2.0, spanning a dense 31B flagship, a 26B-A3.8B mixture-of-experts variant, and edge-optimized 4B and 2B models, all with native multimodal capabilities.

2B (e2b)4B (e4b)26B-A3.8B31BGoogle

Overview

Gemma 4, released April 2, 2026, is Google's most significant open-weight release to date and a major shift in licensing posture. Where prior Gemma generations shipped under the custom Gemma License (which included usage restrictions and disallowed certain applications), Gemma 4 is released under Apache 2.0 — the most permissive standard open-source license. This brings Gemma into licensing parity with Qwen, Mistral, and OLMo, and removes a major friction point for commercial integration.

The family spans four sizes: a dense 31B flagship model targeting workstation and small-server deployment; a 26B-A3.8B mixture-of-experts variant designed for consumer GPU inference at large-model quality; a 4B effective-parameter (e4b) edge model; and a 2B effective-parameter (e2b) model targeting on-device deployment on phones and laptops. All four variants share a common multimodal architecture — text, images, and short-form audio inputs are supported across the entire family.

Gemma 4 builds on Gemma 3's multilingual training (140+ languages) and 128K context window while substantially improving on reasoning, coding, and instruction following. The MoE variant in particular is positioned as Google's response to the Qwen 3 / DeepSeek V3 line of efficient MoE models — combining sparse activation efficiency with the engineering and safety work that distinguishes the Gemma series.

Weights are available on Hugging Face under `google/gemma-4-31b`, `google/gemma-4-26b-moe`, `google/gemma-4-e4b`, and `google/gemma-4-e2b`. Quantized GGUF builds, MLX builds (for Apple Silicon), and ONNX exports are widely available, reflecting Google's investment in cross-platform deployment.

Key Features

Apache 2.0 licensing is the headline change. For commercial users, this removes the usage-policy uncertainty that had limited Gemma 3 adoption in regulated industries and use cases that the prior license restricted. Gemma 4 weights, derivatives, and fine-tuned variants can be used commercially without the restrictive terms that distinguished the Gemma License from standard open-weight releases.

The 26B-A3.8B MoE variant is engineered specifically for consumer hardware deployment. With only 3.8B parameters active per token, inference speed is dominated by the active count — comparable to a 4B dense model — while the model's effective quality approaches the 31B dense variant on most benchmarks. This makes high-quality local inference practical on a single 24GB consumer GPU, which is the deployment sweet spot for self-hosted developer tools and on-premise applications.

Native multimodal support across all four sizes is unusual. Most model families restrict multimodal capability to a flagship variant, leaving smaller models text-only. Gemma 4's e2b variant — only 2B effective parameters — accepts image input, making it the smallest credible multimodal open-weight model and unlocking on-device patterns like OCR, screen-reading assistants, and camera-based augmented reality applications that previously required server-side inference.

The 128K context window is consistent across the family, and Gemma 4 includes Google's standard safety stack: an updated ShieldGemma classifier, content-safety post-training, and PaliGemma-style structured output support for high-fidelity tool use. These additions make Gemma 4 particularly attractive for production deployments where safety review is part of the integration cycle.

Fine-Tuning with Ertas

Gemma 4's family of sizes covers nearly every fine-tuning scenario in Ertas Studio. The e2b and e4b edge models can be fine-tuned with QLoRA on consumer GPUs with 6-12GB VRAM, making them ideal for rapid iteration and small-scale specialization. The 26B-A3.8B MoE variant is particularly well-suited for fine-tuning given its low active parameter count — QLoRA fits comfortably on a 24GB GPU with full sequence lengths, training at speeds substantially faster than equivalently-sized dense models.

The 31B dense flagship requires more memory for fine-tuning. QLoRA at typical sequence lengths (4K tokens) needs approximately 28-40GB VRAM, fitting on a single 48GB GPU or two 24GB GPUs with model parallelism. Full-parameter fine-tuning is impractical on single-GPU setups but supported in Ertas Studio's multi-GPU configurations.

For multimodal fine-tuning, Ertas Studio supports interleaved text and image training data formats native to Gemma 4. This is particularly valuable for domain adaptation in visual reasoning tasks — fine-tuning on annotated medical images, technical diagrams, retail product catalogs, or industry-specific document layouts. After training, models export to GGUF (with multimodal projector preservation) or MLX for Apple Silicon deployment, with single-click compatibility for Ollama, llama.cpp, and LM Studio.

Use Cases

The 31B dense variant is positioned for high-quality on-premise deployment in enterprise applications: regulated industry chat assistants, internal knowledge retrieval, document analysis, and code assistance for engineering teams. The Apache 2.0 license combined with strong multilingual capabilities makes it a natural fit for companies that previously chose Llama or Mistral primarily for licensing reasons.

The 26B-A3.8B MoE variant excels in cost-sensitive production serving. Customer support automation, content moderation pipelines, and document processing workflows all benefit from the 4B-class inference speed combined with quality competitive with the 31B dense model. For teams running self-hosted inference and watching token-cost economics, the MoE variant is often the right default choice.

The e4b and e2b edge models target on-device deployment patterns: mobile chat assistants with privacy-by-design (no data leaving the device), browser-based AI tools, smart home device integration, and field deployment scenarios where connectivity is unreliable. The native multimodal support across these small sizes makes them particularly valuable for camera-based and screen-reading applications.

Hardware Requirements

The Gemma 4 e2b model at Q4_K_M quantization requires approximately 1.5GB of memory, runnable on phones, laptops, and any GPU with 4GB+ VRAM. The e4b at Q4_K_M needs approximately 2.5GB, suitable for any modern consumer device.

The 26B-A3.8B MoE variant requires loading all expert weights — approximately 16GB at Q4_K_M and 28GB at Q8_0. A 24GB consumer GPU (RTX 4090, RTX 5090) is the deployment sweet spot. Inference speed is dominated by the 3.8B active parameter count, so token generation runs at approximately 4B-class speeds, making this variant unusually fast for its memory footprint.

The dense 31B model at Q4_K_M needs approximately 18-20GB of VRAM, fitting on a single 24GB GPU with margin for context. At Q8_0, expect approximately 33GB. For fine-tuning in Ertas Studio: e2b/e4b need 6-12GB VRAM, the 26B-A3.8B MoE needs 20-24GB, and the 31B dense needs 28-40GB at typical training sequence lengths.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

llama.cpp

LM Studio

MLX

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →