Fine-Tune StepFun Step-3.5-Flash with Ertas

StepFun's February 2026 small-giant release — a 196-billion parameter mixture-of-experts with 11B active parameters, outperforming Kimi K2.5 (1T) and DeepSeek V3.2 (671B) on agentic, reasoning, and coding benchmarks at 3-5× smaller scale. Apache 2.0 with 100 tok/sec at 128K context on Hopper GPUs.

196B-A11BStepFun

Overview

StepFun Step-3.5-Flash, released by StepFun on February 1 2026, is one of the most architecturally efficient open-weight releases of the year — a 196-billion parameter mixture-of-experts with only 11B active parameters per token that punches substantially above its weight class on benchmarks. The model outperforms Kimi K2.5 (1T total parameters, 32B active) and DeepSeek V3.2 (671B total, 37B active) on multiple agentic, reasoning, and coding evaluations, while being 3-5× smaller in total parameter count and requiring substantially less inference cost.

The headline efficiency claim is 100 tokens per second at 128K context on Hopper GPUs (H100/H200) — approximately 3× faster than DeepSeek V3.2's 33 tok/sec on equivalent hardware. This dramatic throughput improvement reflects both the smaller active parameter count and StepFun's specific architectural and inference-optimization investments. For production serving where token-cost economics matter, Step-3.5-Flash is among the most attractive 2026 options.

Apache 2.0 licensing combined with the small-giant inference economics makes Step-3.5-Flash particularly compelling for self-hosted production deployment. The license has no usage restrictions, attribution requirements, or commercial caps — straightforward commercial deployment at any scale. The 196B total parameter count fits on a 2-GPU server (2x A100 80GB or 2x H100 80GB) at Q4 quantization, making it accessible to substantially smaller deployment teams than the trillion-parameter alternatives.

StepFun has historically been a less-prominent Chinese AI lab compared to DeepSeek, Qwen, and Kimi, but Step-3.5-Flash establishes the company as a serious competitor on the architectural-efficiency axis. While the model doesn't dominate any specific benchmark category against the absolute frontier, the combination of strong capability and exceptional inference economics produces a particularly attractive cost-quality tradeoff. Weights are available on Hugging Face under `stepfun-ai/Step-3.5-Flash`.

Key Features

The 17.8:1 total-to-active parameter ratio (196B / 11B) is more aggressive than most contemporaries and contributes substantially to the inference cost advantages. Combined with carefully-optimized expert routing and inference-time optimizations, Step-3.5-Flash achieves token generation throughput substantially better than alternatives at equivalent benchmark quality.

The '3-5× smaller while outperforming' positioning against Kimi K2.5 and DeepSeek V3.2 is the headline benchmark claim. While different benchmark categories produce different specific results — and Step-3.5-Flash doesn't claim absolute leaderboard dominance — the consistent pattern across multiple agentic, reasoning, and coding evaluations is that Step-3.5-Flash matches or exceeds models with substantially more inference cost. For production deployment economics, this translates directly to lower per-request costs.

100 tok/sec at 128K context on Hopper GPUs is a specific operational claim that translates well to production serving. Most open-weight models at equivalent quality serve at 30-50 tok/sec on the same hardware. The throughput advantage compounds at high request volumes — at sufficient scale, Step-3.5-Flash can serve the same user load on substantially fewer GPUs than competing flagships.

Apache 2.0 licensing positions Step-3.5-Flash favorably for commercial deployment. Unlike some Chinese-lab releases with custom licensing terms that require legal review, Step-3.5-Flash uses the standard permissive open-source license that commercial deployment teams can deploy without licensing-review overhead.

Fine-Tuning with Ertas

Step-3.5-Flash's 11B active parameter count makes it particularly efficient to fine-tune in Ertas Studio. QLoRA training fits comfortably on a single 80GB GPU at typical sequence lengths, or splits across two 48GB GPUs with model parallelism. Training step throughput is dominated by the active parameter count, so training proceeds at approximately 11B-class speeds despite the 196B total parameter footprint.

For the MoE architecture, Ertas Studio handles expert routing stability during low-rank adaptation automatically. Training data formats with multi-turn conversations, agentic execution traces, and reasoning examples all work natively. The aggressive total-to-active ratio means fine-tuning can effectively specialize specific experts for domain-specific patterns without affecting the broader model's general capability.

For most teams interested in domain specialization, Step-3.5-Flash is among the most attractive base choices in the 2026 ecosystem — combining strong base capability with accessible fine-tuning hardware requirements and Apache 2.0 licensing for commercial deployment of the resulting fine-tuned variant.

After training, Ertas Studio exports to GGUF format with full Step-3.5-Flash chat template preservation. The Q4_K_M quantization is approximately 110GB — fitting on a 2-GPU server (2x A100 80GB or 2x H100 80GB) — with the 11B active parameter count delivering substantially better throughput than alternatives at equivalent memory footprint.

Use Cases

High-throughput production API serving is Step-3.5-Flash's most natural use case. The combination of strong cross-domain capability and exceptional inference economics makes it particularly attractive for customer support automation, content generation pipelines, document processing systems, and similar workloads where token-cost matters significantly at scale. Teams running on per-request pricing models or comparing API costs to self-hosted alternatives find Step-3.5-Flash among the most economically attractive options.

For agentic deployments where reasoning capability matters but full trillion-parameter inference cost is prohibitive, Step-3.5-Flash provides a particularly favorable tradeoff. The model handles multi-step reasoning, tool use, and structured output adherence at competitive quality with substantially better economics than larger alternatives.

For smaller deployment teams, Step-3.5-Flash's accessibility relative to trillion-parameter alternatives is structurally significant. Where DeepSeek V4, Kimi K2.6, and similar require 8-GPU server configurations for full-quality deployment, Step-3.5-Flash works on 2-GPU configurations — opening up frontier-tier capability to teams with substantially smaller infrastructure budgets.

Hardware Requirements

Step-3.5-Flash at Q4_K_M quantization requires approximately 110GB of memory, fitting on a 2x A100 80GB or 2x H100 80GB server, or a CPU inference host with 192GB+ RAM. Active parameter count of 11B determines token generation throughput — combined with StepFun's inference optimizations, this delivers the headline 100 tok/sec at 128K context claim on Hopper GPU configurations.

For smaller deployments, Q3_K_M quantization (approximately 85GB) trades modest quality for reduced memory, fitting on a single 80GB GPU with margin. The 11B active parameter count means inference speed advantages persist even at lower quantization tiers — a particularly attractive characteristic for cost-sensitive production deployments.

For fine-tuning in Ertas Studio: Step-3.5-Flash QLoRA needs approximately 60-90GB total VRAM, fitting on a single 80GB GPU at typical sequence lengths. Training step throughput at 11B active parameters is substantially faster than fine-tuning equivalent-quality dense or larger-active alternatives. Long-context fine-tuning (32K-64K sequences) is tractable on 80GB GPUs with gradient checkpointing.

Supported Quantizations

Q3_K_MQ4_0Q4_K_MQ5_K_MQ6_KQ8_0

Related Resources

Integration

llama.cpp

Integration

Ollama

Integration

vLLM

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →