Code completion

    Fine-tune a 3B-class code model on a niche language or in-house framework so completions inside a creator app or IDE plugin understand syntax that base models botch.

    Base code models know Python, JavaScript, TypeScript, Java, C++, Rust, and Go very well. They know everything else somewhere between "passably" and "wrong." A fine-tune for code completion is most useful when your users write in a language or framework that sits in the second bucket: an internal DSL, an in-house framework with strict conventions, or a niche language base models default to the wrong dialect of.

    This recipe uses a worked example: imagine Roblox wants to add AI completions for Luau inside Roblox Studio. Luau is Roblox's typed dialect of Lua 5.1 with a different standard library (task.wait instead of os.execute, no io, type-annotated functions, a strict opt-in mode) and 70 million-plus creators editing it monthly. Out of the box, a 3B-class code model will give them Lua 5.1 completions, often with the wrong API surface; a fine-tune teaches it Luau-the-language and Studio-the-environment. Roblox is the anchor because the scale and the language-fit gap make the trade-offs concrete; substitute any vertical with its own scripting layer.

    Recipe also fits

    The Roblox / Luau example is one of many. The same shape applies to:

    • Niche-language IDE plugins for Lean, Solidity, Forth, Idris, or other languages base models barely know.
    • In-house DSL completions for any team with a custom scripting layer used by hundreds of internal developers.
    • Vertical-language plugins for Salesforce Apex, HashiCorp HCL, SAP ABAP, or PL/SQL, where strong house conventions meet weak base-model coverage.
    • Domain-specific scripting in scientific tools (Mathematica, R, MATLAB) or shader languages (GLSL, HLSL).
    • Creator-platform scripting environments like Unity C# patterns, Unreal Blueprint scripts, or Godot GDScript.

    If your users edit code in an environment where the base model defaults to the wrong dialect or misses your house conventions, this is the recipe to start with.

    When this is the right fit

    This recipe is the right fit when:

    • The target language or framework is underserved by base models. Niche languages (Luau, Solidity dialects, GLSL variants, Lean), in-house DSLs, or older versions of mainstream languages (PL/SQL, COBOL, VB.NET) all qualify. Mainstream-current code (Python 3.12, TypeScript 5.6) does not; the base model is already better than you can fine-tune in this size class.
    • You have access to a non-trivial corpus of well-written code in the target. A million lines is comfortable; a hundred thousand is workable; ten thousand is hard. Open-source repos, internal codebases, generated examples all count.
    • You want completions to respect house conventions, not just be syntactically valid. Variable naming, function shape, error-handling idioms, comment style, type-annotation density: the fine-tune learns all of these.
    • The user surface is an editor pane with a cursor, and completions trigger on idle or explicit invocation. If the surface is chat, this recipe is the wrong shape; see Customer support bot for that pattern.

    It is not the right fit when:

    • You want agentic code editing (multi-file refactors, "fix this test"). A 3B fine-tune is too small for the planning step. Use this recipe for the completion side and a frontier API for the agent side, if at all.
    • You want completions in a dozen languages. Fine-tunes are most effective when scoped tightly. Train one model per language family.
    • The target language is well-served by frontier APIs and latency is not a concern. The fine-tune wins on offline, privacy, and per-character cost; if none of those matter for your users, do not bother.

    The dataset

    Code completion is a fill-in-the-middle (FIM) task. The model sees the code before the cursor (the prefix) and after the cursor (the suffix) and predicts what should sit in between (the middle). The training data has to mirror that shape.

    For the Luau scenario, 20,000 rows is a reasonable starting point. Tune up if quality lags or down if authoring at scale is hard. A typical mix of sources:

    SourceShareNotes
    Open-source Roblox games (GitHub, GitHub Search)50% (~10,000)Well-starred repos filtered for active maintenance and idiomatic style
    Roblox Creator Documentation code samples20% (~4,000)Authoritative examples for the Studio APIs and patterns the team wants to promote
    In-house style-guide examples15% (~3,000)Curated examples for conventions the team wants reinforced
    Synthetic FIM pairs from existing code15% (~3,000)Take canonical scripts and machine-generate FIM cuts at multiple cursor positions

    A "row" is one (prefix, suffix, middle) triple. From a single 500-line script you can derive 20 to 50 high-quality FIM rows by choosing different cursor positions.

    Generating FIM pairs

    The naive approach (pick a random character offset, cut there) produces a lot of low-signal rows because most cursor positions are inside identifiers or in whitespace. Cut at boundary points instead:

    • Start of a statement
    • After an opening brace or function keyword (model completes the body)
    • After a local x = (model completes the expression)
    • Inside a string literal where the literal is a known canonical value (model completes the string)
    • After a comment that describes intent (the comment is part of the prefix)

    A heuristic that works well: parse the file with the target language's parser, walk the AST, and emit FIM rows at every statement boundary, every function-body opener, and every right-hand side of an assignment. The Luau open-source parser is straightforward to run from a script.

    Dataset format

    Use the instruction/output schema. Encode the FIM shape into the instruction:

    {
      "instruction": "Complete the Luau code. Output only the middle, between the prefix and suffix.\n\nPrefix:\nlocal function teleportPlayer(player, position)\n    if not player or not player.Character then return end\n    local humanoidRoot = player.Character:FindFirstChild('HumanoidRootPart')\n    if not humanoidRoot then return end\n    \n\nSuffix:\nend\n\nMiddle:",
      "output": "humanoidRoot.CFrame = CFrame.new(position)"
    }

    Three details that matter:

    • The instruction is the same every row. The fine-tune is teaching the model that "Prefix / Suffix / Middle" means FIM; if the instruction phrasing varies, the model treats it as content. See Instruction tuning.
    • Strip the leak. If the prefix ends with humanoidRoot.CFrame = the answer is obvious from the prefix alone. Cut the prefix earlier so the model has to think. Drop rows where the middle is trivially deducible.
    • Keep the suffix. Without a suffix the task becomes "continue this code" rather than "fill this in." That is a different task and produces noticeably worse mid-line completions because the model does not anticipate how the surrounding code constrains the middle.

    Multi-line vs single-line middles

    About 60% of the dataset should have single-line middles (the cursor is partway through a line). These are the highest-frequency completion in a real editor.

    The remaining 40% should be 2-to-15-line middles (whole function bodies, conditional blocks, loops). These teach the model to scope multi-line completions sensibly. Anything longer than 15 lines is too speculative for a completion model and tends to drift; route those to a chat-style task instead.

    Do not include rows where the middle is >30 lines. The model will start producing 30-line completions in production, which is overwhelming in an editor. Cap it at 15 in the dataset and the model self-limits.

    The base model

    Pick Qwen 2.5 Coder 3B from the Ertas catalogue. The reasoning:

    • It is code-specialised. The base ships with FIM training already baked in, so the fine-tune is reinforcing a known capability rather than teaching a new one.
    • At Q4_K_M, the GGUF is about 2.1 GB. That fits a Roblox Studio install footprint comfortably.
    • The 32k context window is generous for repository-aware completion (you can stuff multiple related files into the prefix, see Integration).

    GPU tier: Qwen 2.5 Coder 3B trains on a T4, fitting the Free plan. The paid-plan upgrade is documented two paragraphs down (Qwen 2.5 Coder 7B on A10G).

    If you need to go smaller (older creator machines, web-based code editor target), Qwen 2.5 Coder 1.5B at Q4_K_M (~1 GB) is the next step down. Expect noticeably worse multi-line completions; single-line completions stay usable.

    If you need to go larger (you found that 3B misses too many house-convention details), Qwen 2.5 Coder 7B at Q4_K_M (~4.5 GB) is the next step up. The quality gain is real; the latency cost is also real (1.5 to 3 seconds versus 0.5 to 1.5 for the 3B). For an editor completion that fires every keystroke pause, that latency difference is the difference between "useful" and "annoying."

    Training config

    For 20,000 FIM rows, start with:

    SettingValueWhy
    Schemainstruction/outputFIM encoded into the instruction
    Epochs2Code overfits quickly; two epochs is usually enough
    Learning rate1e-4Code-specialised bases want a gentler LR than generic instruction tuning
    LoRA rank16Captures the language idioms and the API conventions
    LoRA alpha322x rank
    Batch size4Code sequences are shorter on average than prose
    Grad accumulation4Effective batch 16
    Warmup10% of stepsSlightly longer warmup helps stabilise on code
    Max sequence length4096Long enough for whole-function FIM with surrounding context

    Wall-clock time and credit cost depend on the GPU tier and dataset size. Ertas's Training Config picker shows an estimate before you press play; see Credits and usage for the current rates.

    A common mistake: using a higher learning rate (3e-4) because "more is better." Code models punish high LRs more than instruction-tuned chat models because the syntactic precision required is tight; one digit of overcooked weights and the model starts producing slightly-wrong identifiers. Stay conservative.

    Integration: Studio plugin via local Ollama

    Roblox Studio supports plugins written in Luau, including HTTP-out plugins. The integration calls a local Ollama instance over HTTP. The user installs the Ertas-shipped Ollama bundle once; the plugin talks to http://localhost:11434 thereafter.

    -- LuauCompletePlugin.lua (excerpt)
    local HttpService = game:GetService("HttpService")
    
    local function complete(prefix: string, suffix: string): string?
        local instruction = string.format(
            "Complete the Luau code. Output only the middle, between the prefix and suffix.\n\nPrefix:\n%s\n\nSuffix:\n%s\n\nMiddle:",
            prefix, suffix
        )
    
        local body = HttpService:JSONEncode({
            model = "roblox-luau-completer",
            prompt = instruction,
            stream = false,
            options = {
                temperature = 0.1,
                top_p = 0.95,
                num_predict = 120,
                stop = {"\n\nPrefix:", "\n\nSuffix:"},
            },
        })
    
        local ok, response = pcall(function()
            return HttpService:PostAsync(
                "http://localhost:11434/api/generate",
                body,
                Enum.HttpContentType.ApplicationJson
            )
        end)
    
        if not ok then return nil end
        local data = HttpService:JSONDecode(response)
        return data.response
    end

    Key choices:

    • Temperature 0.1: code completion wants near-deterministic output. The same prefix and suffix should produce essentially the same middle.
    • num_predict: 120: caps the completion at roughly 6 to 12 lines. The dataset caps middles at 15 lines, so this leaves a small buffer.
    • stop tokens: protect against the model continuing into a synthetic next-FIM-row. If the model sees a "Prefix:" or "Suffix:" marker in its output, stop immediately.
    • pcall wrapper: Studio plugins survive crashes silently; an HTTP failure should not interrupt the user's edit.

    Picking what to send

    The naive approach sends the file the user is editing as the prefix and an empty suffix. A better approach sends roughly 2,000 characters before the cursor as the prefix, 500 characters after as the suffix, and prefixes the prefix with the top-of-file local declarations so the model can see imported modules.

    local function buildContext(script: LuaSourceContainer, cursorPos: number): (string, string)
        local source = script.Source
        local headerEnd = source:find("\nlocal [A-Z]", 1, false) or 1
    
        local rawPrefix = source:sub(1, cursorPos)
        local suffix = source:sub(cursorPos + 1, cursorPos + 500)
    
        if cursorPos > 2000 then
            local header = source:sub(1, headerEnd - 1)
            local tail = source:sub(cursorPos - 1800, cursorPos)
            return header .. "\n\n-- ...\n\n" .. tail, suffix
        end
        return rawPrefix, suffix
    end

    The header trick is the easiest single quality improvement: most Luau completions need to know which Roblox services and modules are imported, and those declarations are always at the top of the file. Without the trick, completions occasionally invent service names; with it, they cite the right ones.

    Probe set

    Ten prompts that exercise the language coverage. Run them by pasting the prefix and suffix into your harness; the column on the right is what good looks like.

    #ScenarioPass criteria
    1Mid-statement: humanoid.Health = humanoid.MaxHealth (suffix: \nend)Completes to - takeDamage or similar arithmetic; not a new statement
    2Function body: local function isAlive(humanoid)\n (suffix: end)One-line body returning humanoid.Health > 0; not multi-line ceremony
    3Studio API call: local part = workspace:FindFirstChild(Closes with a name argument and recovery code; uses workspace, not game.Workspace
    4Wait pattern: task.wait(Completes with a numeric argument; does NOT suggest wait() (Lua 5.1 style)
    5Type annotation: local function move(player: Player, position: Completes with Vector3) or CFrame), not generic types
    6Module require: local Module = require(script.Parent.Completes with a plausible module name from the surrounding context
    7Event hookup: player.CharacterAdded:Connect(function(character)\nConnects a one-shot handler; uses character, not Character
    8Error-handling: local ok, result = pcall(function()\nSingle-line wrapped call inside; pcall not xpcall for a basic case
    9Strict-mode header: --!strict\n\n (cursor on a new line)Starts with local function ... not function ... (strict prefers local)
    10API-version drift: local players = game:GetService(Closes with "Players"), not the legacy game.Players global

    Most of these should pass cleanly on a properly trained model. Probe 4 (waiting in Lua 5.1 vs Luau syntax) is the canonical pass/fail: a base model gets it wrong; a properly fine-tuned model gets it right. Probe 10 (service-getting modern API) is the second-most-useful tell.

    Limits

    • No project-wide awareness. The model sees the current file and the small header you stitch in. It does not know about cross-file types, calls into other scripts, or what is in ServerScriptService. For multi-file refactors, escalate to a frontier model with proper retrieval.
    • No chat memory. Completions are stateless. If the user accepts a completion and edits it, the next completion does not know what the user edited. That is fine for completion; it is wrong if you try to use this model for chat-style help.
    • Style drift on rare patterns. Patterns that appear less than 100 times in the dataset will not be reliably learned. If a house convention is rare in real code, you may need to seed the dataset with extra synthetic examples of it.
    • Latency spikes on long contexts. The 2,000-character context is a good default; bumping it to 8,000 to fit more of the file roughly triples first-token latency and starts to feel slow. Tune for the user's hardware floor.
    • No symbol resolution. The model does not know what types exist in your project. It will sometimes fabricate type names that look plausible. Treat completions as suggestions, not authoritative; the user's eyes are the final filter.

    What's next