The strongest open-weight models that fit in under 10GB of VRAM at standard Q4_K_M quantization — for laptop GPUs, RTX 3060/4060 12GB cards, and any deployment where memory is the binding constraint.
By HardwareUpdated 2026-04-305 picks
Introduction
Under-10GB VRAM is the practical sweet spot for laptop deployment, consumer GPUs (RTX 3060 12GB, RTX 4060 8GB, gaming laptops), and embedded systems where memory is the binding constraint. The 2025-2026 generation of small models has become substantially more capable than their predecessors — a 7-14B model in 2026 can handle workloads that required 30B+ models a year earlier, thanks to better training data, more efficient architectures, and improved quantization techniques.
This ranking covers models that fit in under 10GB of VRAM at standard Q4_K_M quantization (approximately the same as 8GB at Q3_K_M for those even more constrained). We weight three factors: capability at the parameter scale, ecosystem maturity for consumer/laptop deployment, and licensing for commercial use.
Microsoft's Phi-4 (14B dense) at Q4_K_M is approximately 8.5GB — fitting comfortably under the 10GB threshold while delivering exceptional capability per parameter. Phi-4 was specifically engineered to punch above its weight class through careful curation of synthetic training data, and it competes with much larger general-purpose models on math, code, and reasoning benchmarks. MIT licensing makes it the strongest commercially-deployable choice in this VRAM tier.
Strengths
MIT license — fully commercially permissive
14B parameters at ~8.5GB Q4_K_M leaves headroom for context
Strong math and code reasoning for parameter count
Phi-4-mini (3.8B) and Phi-4-multimodal (5.6B) variants for tighter constraints
Trade-offs
Heavy synthetic training data introduces artifacts in informal language
Behind larger models on broad multilingual capability
Llama 3 8B at Q4_K_M is approximately 4.5GB — leaving substantial headroom for context and KV cache even on 6-8GB cards. The mature ecosystem of community fine-tunes, deployment guides, and integrations makes it the lowest-friction path to a working local LLM under 10GB. For most laptop and entry-GPU deployments, Llama 3 8B is the workhorse choice that handles general chat, summarization, and basic code completion reliably.
Strengths
4.5GB at Q4_K_M leaves headroom on 6-8GB GPUs
Massive ecosystem of community fine-tunes
Mature deployment across Ollama, llama.cpp, vLLM
Llama Guard 3 safety classifier available as companion
Trade-offs
Llama Community License usage caps and attribution requirements
Behind 2026 frontier 8B-class models on capability
Gemma 4's edge variants (e4b ~2.5GB at Q4_K_M, e2b ~1.5GB at Q4_K_M) are exceptional small-VRAM options. The new Apache 2.0 licensing combined with native multimodal support across both variants makes them uniquely capable in this size class. For deployments under 4GB VRAM (where Llama 3 8B and Phi-4 don't fit), Gemma 4 e2b/e4b are the strongest choices available — particularly when image input is a requirement.
Strengths
e2b at 1.5GB fits on integrated graphics and 4GB+ GPUs
Native multimodal — only credible small multimodal option
Apache 2.0 license (new in Gemma 4)
Strong MLX/llama.cpp deployment support
Trade-offs
Below 4GB scale, capability is genuinely limited vs. larger models
Qwen 3's smaller variants (4B at ~2.5GB, 8B at ~5GB Q4_K_M) cover the under-10GB tier cleanly. Apache 2.0 licensing combined with broad 119-language multilingual coverage makes Qwen 3 the strongest small-VRAM choice for international deployments. The hybrid thinking mode at 4B+ adds reasoning capability that vanilla 4B-class models lack. For deployments serving non-English users on consumer hardware, Qwen 3 is often the better pick than Llama 3 8B.
Strengths
Apache 2.0 license — fully commercial
119-language multilingual coverage at small scales
Hybrid thinking mode in 4B+ variants
Native Qwen-Agent integration with MCP and tool support
Trade-offs
Smaller MLX/community ecosystem than Llama 3
8B variant slightly larger than Llama 3 8B at equivalent quantization
TII's Falcon H1R-7B at Q4_K_M is approximately 4.5GB and delivers outstanding math reasoning — scoring 83.1% on AIME 2025, competitive with reasoning models 5-7x its size. The hybrid Mamba+Transformer architecture provides better long-context efficiency than pure-transformer alternatives at the same parameter count, supporting 256K context on 16GB+ devices. For under-10GB deployment specifically targeting math, science, or reasoning workloads, H1R is uniquely capable in its size class.
Strengths
AIME 2025 score of 83.1% — exceptional for 7B parameters
256K context window via hybrid Mamba+Transformer architecture
Strong long-context efficiency at small scale
Falcon LLM License (commercial-permissive)
Trade-offs
Falcon LLM License is not Apache 2.0 (review for commercial fit)
Strengths concentrated in math/reasoning rather than general chat
We evaluated models on Q4_K_M memory footprint (the standard quantization for consumer deployment), capability at that quantization level (some models degrade more than others at Q4_K_M), inference speed on laptop-class GPUs, and licensing for commercial deployment. We deliberately weighted real-world consumer deployment patterns (Ollama, llama.cpp, LM Studio) over theoretical benchmark scores — a model that scores well in research but is unsupported by mainstream consumer tools is not useful in this category.
Bottom Line
For most under-10GB deployments, Phi-4 is the strongest commercial pick — MIT license, exceptional capability per parameter, and 14B-class reasoning at 8.5GB. Llama 3 8B is the workhorse choice when ecosystem maturity matters more than peak capability. Gemma 4 e2b/e4b are the right picks for deployments under 4GB or where multimodal input is required. Qwen 3 4B/8B are the multilingual specialists. Falcon H1R-7B is uniquely capable for math/reasoning workloads at 7B scale. Whichever model you choose, fine-tuning in Ertas Studio with QLoRA fits comfortably on the same hardware as inference, making continued model improvement accessible without requiring server-class infrastructure.