What is AWQ?

Activation-aware Weight Quantization — a 4-bit quantization method that protects salient weights based on activation magnitude, producing higher-quality compressed models than naive quantization at the same bit-width.

Definition

AWQ (Activation-aware Weight Quantization) is a post-training quantization technique that compresses model weights to 4-bit precision while preserving substantially more quality than naive uniform quantization. The core insight: not all weights are equally important to model output. Weights that operate on high-magnitude activations have outsized influence on predictions, while weights operating on near-zero activations contribute little. AWQ identifies the top ~1% of 'salient' weight channels using activation statistics from a small calibration dataset, then scales those channels to protect them from quantization error.

The practical result is that an AWQ-quantized 4-bit model typically retains 95-99% of the original model's accuracy on standard benchmarks while using approximately 4x less memory than the FP16 original. This makes AWQ a popular choice for inference deployments where memory is the constraint — particularly for serving large models on consumer GPUs or for fitting more concurrent requests on server hardware.

Why It Matters

Quantization is the difference between a model that fits on your hardware and one that doesn't. A 70B-parameter model in FP16 needs ~140 GB of memory; the same model with AWQ 4-bit fits in ~40 GB. AWQ produces higher-quality 4-bit quantized models than older methods like RTN (round-to-nearest) and is competitive with or better than GPTQ for many model families. For inference frameworks like vLLM and TensorRT-LLM, AWQ has become a standard quantization option alongside GPTQ.

Key Takeaways

AWQ is a post-training 4-bit weight quantization method — no fine-tuning required
Identifies and protects ~1% of salient weights based on activation magnitude
Typically retains 95-99% of FP16 accuracy at ~4x memory reduction
Supported by vLLM, TensorRT-LLM, llama.cpp, and other major inference frameworks
Common alternative to GPTQ — different methods can win on different model families

How Ertas Helps

After fine-tuning a model in Ertas Studio, AWQ is one of the export targets supported alongside GGUF (which uses different quantization formats like Q4_K_M) and other formats. Choosing between AWQ, GPTQ, and GGUF quantizations comes down to your inference framework: vLLM and TensorRT-LLM prefer AWQ/GPTQ; Ollama and llama.cpp prefer GGUF. Ertas Studio's export pipeline handles all three.