Fine-Tune DeepSeek V3.2 with Ertas

DeepSeek's late-2025 release that introduced DeepSeek Sparse Attention (DSA) — a learned sparse attention mechanism enabling efficient long-context inference, paired with a unified thinking mode toggle. Direct predecessor to DeepSeek V4. MIT-style license.

671B-A37BDeepSeek

Overview

DeepSeek V3.2, released in late 2025, is the architectural predecessor to DeepSeek V4 and introduced two innovations that became central to the DeepSeek lineage: DeepSeek Sparse Attention (DSA) and the unified thinking mode that integrates reasoning capability into a standard chat checkpoint. The model uses the same 671B-A37B mixture-of-experts architecture as DeepSeek V3 but with substantially improved long-context performance via DSA and operational simplification via unified thinking mode.

DSA is a learned sparse attention mechanism that routes each query token to a subset of key tokens rather than attending to all of them. This dramatically reduces the compute cost of long-context inference and was the architectural breakthrough that enabled the 1M token context window in DeepSeek V4. While V3.2 itself does not match V4's 1M context, DSA in V3.2 produces measurably better long-context retrieval quality than dense-attention models at equivalent context lengths.

The unified thinking mode in V3.2 replaced the prior separate-deployment pattern (V3 chat + R1 reasoning) with a single checkpoint that toggles between modes. The same V3.2 weights serve both fast direct-response and extended-reasoning queries via a runtime control parameter — a pattern that has now become standard in the 2026 generation of flagship models. V3.2 is released under the DeepSeek License, an MIT-style commercial-permissive license.

DeepSeek V3.2 was superseded as the DeepSeek flagship by V4 in April 2026, but V3.2 remains widely deployed in production environments where teams want the operational simplicity and architectural innovations without the multi-GPU footprint of V4 Pro. The DeepSeek-V3.2-Exp variant continues to be referenced in research and production deployments testing the DSA architecture.

Key Features

DeepSeek Sparse Attention (DSA) is V3.2's defining architectural innovation. By learning which key tokens are relevant for each query, DSA reduces long-context attention compute substantially below the quadratic cost of dense attention while maintaining usable retrieval quality. This was the architectural foundation that V4 built on to support 1M context.

Unified thinking mode in V3.2 was the first major implementation of the now-standard pattern. Instead of maintaining separate R1 (reasoning) and V3 (chat) deployments with cross-model routing, V3.2 ships both behaviors in a single checkpoint. Operationally, this dramatically simplifies production agent infrastructure — most queries get fast direct responses, and only the harder subset that benefits from reasoning consumes the extended-reasoning compute.

The 671B-A37B MoE architecture is inherited from V3 and remains an excellent quality-to-compute trade-off. With 37B active parameters, generation runs at speeds comparable to a 37B dense model while accessing the knowledge of the full 671B. For multi-GPU server deployments with the infrastructure to host the model, V3.2 delivers strong reasoning and code performance.

The MIT-style DeepSeek License combined with the model's operational simplicity made V3.2 a popular production choice through early 2026 for teams that wanted DeepSeek capability without committing to the larger V4 Pro infrastructure footprint.

Fine-Tuning with Ertas

DeepSeek V3.2 is at the upper end of practical fine-tuning. Ertas Studio supports QLoRA fine-tuning on multi-GPU server configurations (8x A100 80GB or 8x H100 80GB), with approximately 380-450GB of total VRAM required at typical sequence lengths.

For most teams without 8-GPU server access, the recommended pattern is to use V3.2 as a teacher for synthetic data generation, then fine-tune one of the DeepSeek-R1 distilled variants (Qwen 7B-32B or Llama 70B distilled) on that data. This produces a domain-specialized model at single-GPU deployment cost while inheriting V3.2's reasoning and coding patterns via distillation.

When fine-tuning V3.2 directly, Ertas Studio handles the DSA architecture's training-time considerations automatically — including expert routing stability and sparse attention pattern preservation during low-rank adaptation. After training, Ertas Studio exports to GGUF format. The Q4_K_M quantization of V3.2 is approximately 360GB, requiring multi-GPU server deployment.

Use Cases

DeepSeek V3.2 excels at workloads that benefit from V4-quality capability but where V4 Pro infrastructure (8-GPU server) is not available. Production deployments running on 4-6 GPU configurations often choose V3.2 over V4 Pro for the lower hardware footprint, especially when 1M context isn't a hard requirement.

The unified thinking mode makes V3.2 well-suited for adaptive agent deployments — fast direct responses for routine tickets, escalation to reasoning mode for complex queries. This pattern delivers substantial cost savings vs. running pure reasoning-mode inference uniformly while maintaining quality on the queries that actually benefit from extended thinking.

For teams running DeepSeek-R1 distilled variants in production, V3.2 is also a strong teacher model for ongoing distillation cycles — generating new synthetic training data as your domain evolves and refreshing the smaller deployed models with that data.

Hardware Requirements

DeepSeek V3.2 at Q4_K_M requires approximately 360GB of memory, fitting on an 8x A100 80GB or 8x H100 80GB server, or a CPU inference host with 512GB+ RAM. Active parameter count of 37B determines token generation throughput once loaded.

For smaller deployments, Q3_K_M quantization (approximately 270GB) trades modest quality for reduced memory, fitting on a 4x H100 80GB server with margin. Going below Q3 is not recommended for production deployments — quality degradation on long-context retrieval becomes noticeable, particularly on the DSA-dependent benchmarks where V3.2's competitive edge originates.

For fine-tuning in Ertas Studio: V3.2 QLoRA needs approximately 380-450GB total VRAM (multi-GPU server). For most teams, distillation onto smaller bases (R1-Distill-Qwen-32B, R1-Distill-Llama-70B) via teacher-generated synthetic data is the more practical path.

Supported Quantizations

Q3_K_MQ4_0Q4_K_MQ5_K_MQ6_KQ8_0

Related Resources

Integration

llama.cpp

Integration

Ollama

Integration

vLLM

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →