What is DeepSeek Sparse Attention (DSA)?
A learned sparse attention mechanism introduced in DeepSeek-V3.2 and continued in V4 that routes each query token to a subset of key tokens rather than attending to all of them, dramatically reducing the compute cost of long-context inference.
Definition
DeepSeek Sparse Attention (DSA) is a learned attention sparsification mechanism that reduces the quadratic compute cost of standard transformer attention by having each query token attend to only a learned subset of key tokens rather than the full sequence. The selection is itself learned — the model decides during training which keys are relevant for each query, producing an attention pattern that is sparse but task-aware.
DSA was introduced with DeepSeek-V3.2 in late 2025 and continued as a core architectural feature in DeepSeek V4 (April 2026). It is a key reason DeepSeek V4 can support a 1 million token context window practically — naive attention at that context length would be prohibitively expensive in both compute and KV-cache memory. With DSA, long-context inference cost grows substantially slower than quadratic in sequence length while maintaining usable retrieval quality.
Why It Matters
Sparse attention is a fundamental architectural lever for making long-context models economical. Models that support 1M tokens are useless if inference is too expensive to actually use that context. DSA makes long-context inference tractable enough that production deployments can use it routinely — full-codebase code review, long-document analysis, multi-document synthesis. As long-context use cases proliferate, learned sparse attention mechanisms are likely to become standard rather than exotic.
Key Takeaways
- DSA is a learned sparse attention mechanism — the model decides which keys to attend to
- Substantially reduces long-context inference cost compared to dense attention
- Introduced in DeepSeek-V3.2, continued in DeepSeek V4 (1M context support)
- Different from heuristic sparse attention (e.g., sliding window) — task-aware rather than position-aware
- Architectural pattern likely to spread as 1M+ context windows become mainstream
How Ertas Helps
DeepSeek V4 — both V4 Pro and V4 Flash variants — uses DSA, and is supported in Ertas Studio's fine-tuning pipeline. Fine-tuning a DSA model preserves the sparse attention behavior, but training-time and inference-time compute are dominated by the active attention pattern, so the cost savings vs. dense attention carry over to fine-tuning workflows as well.
Related Resources
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.