Picking a base model
How to choose between Llama, Mistral, Phi, Gemma, Qwen, and the rest of the catalog for your task and target device.
The base model is the single decision that most affects training time, on-device size, license terms, and final quality. Studio gives you two ways to pick one: the curated catalog in the model picker, or a direct Hugging Face URL.
This page is a practical decision guide. For a full feature-by-feature comparison, see the models index.
Open the model picker
On the canvas, click the + under the red Base Model leg of any Action Module. The picker opens.
You will see three regions:
- Search: type to filter the catalog by name, family, or description.
- HuggingFace Model: paste a Hugging Face URL or
org/modelidentifier to validate a model that is not in the catalog. - Model list: the curated catalog, sorted by family. Each card shows the family, parameter count, and a lock badge if your plan cannot run it.
Match the model to your task
Picking a base is mostly about three things: task fit, size, and license.
Task fit
| If you want to... | Try... | T4 or A10G |
|---|---|---|
| Build a customer-support agent or chat assistant | Llama 3.2 3B Instruct, Phi-3 mini, then Mistral 7B Instruct | T4 for the smaller two, A10G for 7B |
| Specialise on a non-English language | Gemma 3 2B (broad coverage), Qwen 2.5 7B for harder cases | T4 for small Gemma, A10G for 7B Qwen |
| Build a code-completion model | StarCoder 2 (3B), Code Llama 7B | T4 for the 3B, A10G for 7B |
| Build a long-context summariser | Qwen 2.5 / 3 (up to 128K), Llama 3.1 (128K) | A10G |
| Run on a phone or low-spec laptop | Phi-3 mini (3.8B), Gemma 3 2B / 4B, SmolLM, TinyLlama | T4 |
| Pick the smallest possible "good enough" model | Phi-3.5 mini, Gemma 3 1B, SmolLM 1.7B | T4 |
The picker description for each model summarises its strengths. Hover any card for license info before you commit.
Size and device fit
A fine-tuned model is shipped quantised to 4-bit in GGUF. As a rough guide:
| Parameter count | GGUF size (Q4_K_M) | Reasonable target devices |
|---|---|---|
| 1B to 2B | 0.6 to 1.5 GB | Phones, smartwatches, embedded |
| 3B to 4B | 1.5 to 2.5 GB | Phones, low-spec laptops |
| 7B to 8B | 4 to 5 GB | Mid-range laptops, M-series Macs, desktops |
| 13B to 14B | 7 to 9 GB | High-spec laptops, desktops, servers |
| 22B+ | 12+ GB | Desktops with discrete GPU, servers |
Pick the smallest model that gets your task done well. A well-tuned 3B often outperforms a generic 8B on a narrow task. Smaller models also train faster, cost fewer credits, and ship into more places.
License
Every catalog model has a license that survives fine-tuning. The model picker shows the license badge, and each model entry explains the terms. Verified summaries:
- Llama 3 / 3.1 / 3.3 / 4 (Meta): Llama Community License. Royalty-free commercial use. 700 million monthly-active-user threshold: products at this scale must request a separate license from Meta, who may grant or refuse at their discretion. Attribution ("Built with Llama") required.
- Mistral 7B, Mixtral 8x7B: Apache 2.0. Very permissive. Commercial use is unrestricted; redistributions must include the Apache 2.0 license text and attribute Mistral AI.
- Gemma 3: Gemma Terms of Use. Commercial use is allowed, but the custom license includes a Prohibited Use Policy with broad clauses (harm to minors, attacks on critical infrastructure, etc). Some enterprise legal teams flag the ambiguity.
- Gemma 4: Apache 2.0. Google moved Gemma 4 to a true open-source license, removing the Gemma 3 ambiguity.
- Phi-3 / 3.5 / 4 (Microsoft): MIT. As permissive as it gets.
- Qwen 2.5: Apache 2.0 for all sizes except 3B and 72B, which use a separate Qwen license that requires arrangement with Alibaba for commercial deployment.
- Qwen 3: Apache 2.0 across the entire lineup, no 3B/72B exception.
- Code Llama, StarCoder, DeepSeek, Falcon, and others: read the model card. Open-source code models tend to be permissive but have specific attribution and notice clauses.
If you are shipping a commercial product, read the license before training. Switching base models later is cheap; switching after release is expensive. The model entries page links the canonical license document for each family.
GPU tier and plan gating
Two GPU tiers are available:
- T4 (16 GB VRAM): handles 4-bit fine-tunes of models under 5B total parameters.
- A10G (24 GB VRAM): required for any model 5B or larger total parameters. A10G also supports longer contexts and larger batches.
Any model 5B or larger (total parameter count) cannot train on T4 due to the T4's 16 GB VRAM limit. The trainer will OOM before the first step.
Gemma 4 E2B is in the A10G-required bucket despite the "E2B" naming. The "E" stands for Effective: E2B has 2.3 billion effective compute parameters but 5.1 billion total parameters once its Per-Layer Embedding lookup tables are counted. The lookup tables sit in VRAM during training even though they contribute little compute at inference time, so the training memory footprint exceeds T4. Gemma 4 E4B has a larger total again and also requires A10G.
Studio enforces this at the picker level: catalog models in this bucket show a lock badge for plans without A10G access. If you bring such a model via Hugging Face URL on a T4-only plan, the run will be blocked or will fail with an OOM at start.
Each model in the catalog declares its minimum GPU tier. Models that require A10G show a lock badge with the minimum required plan for free-tier users. Clicking a locked card opens an upgrade prompt rather than queuing a job that would fail later.
The Training Config picker will only let you select a GPU tier that both your plan allows and the chosen model supports. If you change the base model after configuring the Training Config, Ertas normalises the tier upward when required.
Bring your own model from Hugging Face
If the catalog does not have the model you want, paste a Hugging Face URL into the HuggingFace Model box. Ertas validates the model in two steps:
Architecture check
Ertas fetches the model config from Hugging Face and confirms it is one of the architectures the training pipeline supports (Llama-style decoders, Mistral, Qwen, Phi, Gemma, and a few others).
Compatibility check
Ertas reports one of three outcomes: green check (compatible with fine-tuning), amber warning (architecture not recognised, but may still work with a generic ChatML template), or red error (cannot train).
If the validator returns an amber warning, you can still queue the run, but you must check the consent box first. Training failures on unverified models are not refunded in credits. Use a small, low-step run to sanity-check the architecture before committing a long training job.
When you add an HF model, Ertas passes the Hugging Face URL through to the training job. The base model node persists the URL in its data so subsequent runs reuse it without re-validating.
Common picking patterns
A few decision shortcuts that work well in practice:
- First-ever Ertas run on the Free plan: pick a model under 5B so it runs on T4. Phi-3 mini (3.8B), Llama 3.2 3B Instruct, or Gemma 3 2B are all strong defaults: cheap to train, well-documented, and shippable to phones.
- Phone-deployable assistant: pick Phi-3 mini or Gemma 3 2B. Both fit T4 and quantise to under 2.5 GB.
- Strict on-device latency budget (under 50 ms per token on M-class hardware): pick a 3B or smaller.
- Multi-lingual support agent on a paid plan: pick Qwen 2.5 7B Instruct on A10G.
- Code completion sidecar: StarCoder 2 (3B) on T4, or Code Llama 7B on A10G if you need more capability.
- You already have a fine-tune you want to improve: keep the same base model so adapter merges and evaluations stay comparable.
If you are unsure and on a paid plan, start with a 7B Instruct model on A10G. Most tasks land somewhere between "this 3B is plenty" and "we need to step up to 13B," and 7B is the cheapest place to discover which side of that line you are on.
If you are on the Free plan, start with Phi-3 mini. It is the most capable model that fits T4 and trains in well under an hour on small datasets.