What is DeepSeek Sparse Attention (DSA)?

A learned sparse attention mechanism introduced in DeepSeek-V3.2 and continued in V4 that routes each query token to a subset of key tokens rather than attending to all of them, dramatically reducing the compute cost of long-context inference.

Definition

DeepSeek Sparse Attention (DSA) is a learned attention sparsification mechanism that reduces the quadratic compute cost of standard transformer attention by having each query token attend to only a learned subset of key tokens rather than the full sequence. The selection is itself learned — the model decides during training which keys are relevant for each query, producing an attention pattern that is sparse but task-aware.

DSA was introduced with DeepSeek-V3.2 in late 2025 and continued as a core architectural feature in DeepSeek V4 (April 2026). It is a key reason DeepSeek V4 can support a 1 million token context window practically — naive attention at that context length would be prohibitively expensive in both compute and KV-cache memory. With DSA, long-context inference cost grows substantially slower than quadratic in sequence length while maintaining usable retrieval quality.

Why It Matters

Sparse attention is a fundamental architectural lever for making long-context models economical. Models that support 1M tokens are useless if inference is too expensive to actually use that context. DSA makes long-context inference tractable enough that production deployments can use it routinely — full-codebase code review, long-document analysis, multi-document synthesis. As long-context use cases proliferate, learned sparse attention mechanisms are likely to become standard rather than exotic.

Key Takeaways

DSA is a learned sparse attention mechanism — the model decides which keys to attend to
Substantially reduces long-context inference cost compared to dense attention
Introduced in DeepSeek-V3.2, continued in DeepSeek V4 (1M context support)
Different from heuristic sparse attention (e.g., sliding window) — task-aware rather than position-aware
Architectural pattern likely to spread as 1M+ context windows become mainstream

How Ertas Helps

DeepSeek V4 — both V4 Pro and V4 Flash variants — uses DSA, and is supported in Ertas Studio's fine-tuning pipeline. Fine-tuning a DSA model preserves the sparse attention behavior, but training-time and inference-time compute are dominated by the active attention pattern, so the cost savings vs. dense attention carry over to fine-tuning workflows as well.