Fine-Tune StarCoder with Ertas

An open-access code generation model trained on permissively licensed source code, available in 3B, 7B, and 15B sizes with transparent training data governance and strong multi-language programming support.

3B7B15BBigCode / HuggingFace

Overview

StarCoder is a family of code-focused language models developed by BigCode, a collaboration between Hugging Face and ServiceNow Research. The project is distinguished by its commitment to responsible and transparent AI development — the training data consists exclusively of permissively licensed code from The Stack v2 dataset, with clear data governance practices including an opt-out mechanism for code authors.

The StarCoder 2 series (the current generation) includes three sizes: 3B, 7B, and 15B parameters. The 15B model was trained on approximately 4 trillion tokens from over 600 programming languages, making it one of the most linguistically diverse code models available. Despite its moderate size, StarCoder 2 15B achieves competitive performance with larger models on code generation benchmarks like HumanEval and MBPP.

Architecturally, StarCoder uses a decoder-only transformer with multi-query attention (MQA) in the 15B variant and grouped-query attention (GQA) in the 3B and 7B variants. All models support fill-in-the-middle (FIM) for code completion tasks and have an 8K token context window (extendable through fine-tuning). The models use a custom tokenizer trained specifically on code, achieving efficient tokenization of programming language syntax, whitespace patterns, and common code constructs.

StarCoder models are released under the BigCode OpenRAIL-M license, which permits commercial use with responsible-use restrictions. The transparent training data provenance makes StarCoder particularly attractive for organizations with strict compliance requirements regarding training data licensing.

Key Features

Training data transparency is StarCoder's strongest differentiator. Every file in the training dataset comes from a permissively licensed repository, and BigCode provides the Am I In The Stack tool that allows code authors to check whether their code was included and opt out if desired. This level of transparency is unique among code models and makes StarCoder defensible from a legal and compliance perspective — important for enterprise adoption.

Multi-language support extends far beyond mainstream languages. While most code models focus on Python, JavaScript, Java, and a handful of others, StarCoder covers 600+ programming languages including specialized languages like VHDL, SystemVerilog, Solidity, Julia, R, MATLAB, Fortran, and many more. This breadth makes it valuable for teams working with niche or domain-specific languages.

The fill-in-the-middle (FIM) capability supports three modes: prefix-suffix-middle (PSM), suffix-prefix-middle (SPM), and a combination mode. This flexibility in FIM formatting improves integration compatibility with different IDE backends and code completion services. StarCoder also supports repository-level context, where file paths and other files from the same project are included as context for more accurate completions.

Fine-Tuning with Ertas

StarCoder models are excellent candidates for fine-tuning in Ertas Studio, especially for organizations with specific programming language needs or coding standards. The 3B model requires only 4-6GB VRAM with QLoRA, making it trainable on virtually any modern GPU. The 7B model needs 8-10GB, and the 15B model requires 10-14GB — all well within consumer hardware capabilities.

For organizations concerned about training data provenance, fine-tuning StarCoder provides a defensible foundation. Start with a model trained exclusively on permissively licensed code, then fine-tune on your own proprietary codebase. This creates a clean chain of data provenance that legal and compliance teams can verify.

Ertas Studio supports code-specific fine-tuning formats including instruction-response pairs, FIM triples, and repository-context examples. After training, export to GGUF and deploy through Ollama or llama.cpp. StarCoder's moderate size means fine-tuned GGUF models are highly portable — a Q4_K_M quantized 15B model is approximately 9GB, suitable for distribution as part of development tools or CI/CD pipelines.

Use Cases

StarCoder is the preferred choice for organizations with strict requirements around training data licensing. Legal tech companies, government agencies, and enterprises with strong compliance cultures choose StarCoder because they can verify the entire training data pipeline. Fine-tuned on internal codebases, StarCoder becomes a fully defensible code assistant.

The broad programming language coverage makes StarCoder valuable for teams working in specialized domains: hardware description languages (VHDL, SystemVerilog), scientific computing (Julia, R, MATLAB), blockchain development (Solidity), and embedded systems (C, Rust). Many of these languages have limited representation in general-purpose code models.

StarCoder's moderate size makes it ideal for integration into development tooling: VS Code extensions, JetBrains plugins, Vim/Neovim plugins, and command-line tools. The 3B model is fast enough for real-time code completion on CPU, while the 7B and 15B models provide higher quality for offline tasks like code review, documentation generation, and test case creation.

Hardware Requirements

StarCoder 3B at Q4_K_M requires approximately 2GB of RAM, running on virtually any development machine. The 7B needs about 4.3GB, and the 15B about 9GB. These modest requirements mean StarCoder models can run alongside a developer's normal workload without requiring dedicated hardware. CPU inference is practical for the 3B model, providing 10-20 tokens per second on modern laptop CPUs.

For GPU inference, the 15B model at Q4_K_M runs on any GPU with 10GB+ VRAM, achieving 40-60 tokens per second on RTX 4070 Ti and above. Full FP16 inference requires approximately 6.5GB (3B), 14.5GB (7B), or 30GB (15B), all achievable on consumer or workstation GPUs.

For fine-tuning in Ertas Studio, the 3B model needs 4-6GB VRAM (QLoRA), the 7B needs 8-10GB, and the 15B needs 10-14GB. The small sizes enable rapid experimentation: train multiple variants on different datasets, compare results, and iterate quickly before settling on a production configuration.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →