Fine-Tune Gemma 3 with Ertas

Google's latest open-weight model family built on Gemini technology, available in 1B, 4B, 12B, and 27B sizes with native multimodal vision-language capabilities and a 128K token context window.

1B4B12B27BGoogle

Overview

Gemma 3 is Google's third-generation open-weight model family, released in March 2025. Built on the same research and technology that powers Google's Gemini models, Gemma 3 delivers state-of-the-art performance across its size range. The family includes four sizes — 1B, 4B, 12B, and 27B parameters — with the 4B and larger models supporting native multimodal inputs (text and images).

The 27B flagship model is particularly notable, matching or exceeding Llama 3 70B on many benchmarks despite having less than half the parameters. This efficiency comes from architectural innovations, training data quality, and Google's extensive experience with transformer optimization. The model supports a 128K token context window across all sizes.

Gemma 3 uses a dense transformer architecture with several Google-specific innovations including logit soft-capping to improve training stability, interleaved local and global attention layers for efficient long-context processing, and a SentencePiece tokenizer with a 262K vocabulary size. The large vocabulary provides exceptional tokenization efficiency across languages.

All models are released under the Gemma license, which permits commercial use with lightweight responsible-use restrictions. Google provides optimized versions for multiple frameworks including JAX, PyTorch, and Keras, and the models are well-supported by the broader ecosystem including Ollama, llama.cpp, and LM Studio.

Key Features

Native multimodal capability is a standout feature of Gemma 3 (available in 4B, 12B, and 27B sizes). The models can process interleaved text and image inputs, enabling visual question answering, image-based reasoning, chart and document understanding, and multimodal content generation. This is powered by a SigLIP vision encoder integrated directly into the model architecture, not bolted on as an afterthought.

The interleaved local-global attention mechanism is an architectural innovation that alternates between local sliding-window attention (for efficient processing of nearby context) and global full attention (for capturing long-range dependencies). This hybrid approach achieves near-full-attention quality while significantly reducing the computational and memory cost of processing long sequences.

Gemma 3 demonstrates particularly strong performance on instruction following, safety alignment, and factual accuracy. Google's training process includes extensive reinforcement learning from human feedback and carefully designed safety evaluations, producing models that are well-calibrated and resistant to common jailbreak techniques while remaining helpful.

Fine-Tuning with Ertas

Gemma 3 models are excellent candidates for fine-tuning in Ertas Studio. The 1B model can be fine-tuned with full LoRA on GPUs with just 4-6GB VRAM, the 4B model requires 8-10GB with QLoRA, the 12B needs 12-16GB, and the 27B requires 16-24GB with 4-bit QLoRA. The 27B model's ability to match 70B-class quality makes it an exceptional value for fine-tuning — you get near-frontier performance at a very manageable training cost.

For multimodal fine-tuning, Ertas Studio supports image-text datasets with Gemma 3. Upload paired image-text examples, and the platform handles the vision encoder integration and data preprocessing. This enables creating custom visual AI models — for instance, a fine-tuned Gemma 3 12B that can identify specific product defects from images, read specialized medical imagery, or process industry-specific document formats.

After training, Ertas Studio exports to GGUF format with optimized quantization. Gemma 3 27B at Q4_K_M produces a model of approximately 16GB that runs well on consumer hardware, delivering quality that would typically require a 40GB+ model file from other families. Deploy through Ollama or llama.cpp for immediate local inference.

Use Cases

Gemma 3 is ideal for applications requiring a balance of high quality and moderate resource usage. The 27B model is particularly well-positioned for organizations that want near-frontier quality without the cost of running 70B+ models. It excels at complex instruction following, analytical writing, code generation, and multi-step reasoning tasks.

The multimodal capabilities open up rich application possibilities: document processing pipelines that can read and reason about forms, invoices, and contracts; visual QA systems for accessibility; product catalog enrichment from images; and automated quality inspection in manufacturing. Fine-tuned Gemma 3 4B or 12B models offer an excellent cost-quality tradeoff for domain-specific vision tasks.

The 1B model serves as a fast, efficient option for simple tasks: text classification, entity extraction, sentiment analysis, and basic question answering. It runs on virtually any hardware and can handle high-throughput workloads cost-effectively.

Hardware Requirements

Gemma 3 1B at Q4_K_M requires approximately 800MB of RAM, suitable for edge devices and mobile deployment. The 4B model needs about 2.5GB, the 12B about 7.5GB, and the 27B about 16GB at Q4_K_M. At Q8_0, the 27B model requires approximately 29GB, fitting on a single A6000 48GB or systems with 32GB+ RAM.

Full FP16 inference for the 27B model requires approximately 54GB VRAM, suitable for A100 80GB or dual A6000 setups. Consumer GPU deployment at Q4_K_M is comfortable on RTX 4090 24GB, and even runs well on M-series MacBooks with 32GB unified memory at around 15-25 tokens per second.

For fine-tuning in Ertas Studio, the 27B model with QLoRA needs 16-24GB VRAM (single RTX 4090 or A5000), while the 12B needs 12-16GB and the 4B needs 8-10GB. The smaller models allow rapid iteration on consumer hardware before scaling to the 27B for production quality.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →