Fine-Tune SmolLM with Ertas

HuggingFace's family of ultra-compact language models in 135M, 360M, and 1.7B sizes, trained on the high-quality Cosmopedia synthetic dataset and designed for on-device AI applications with minimal resource requirements.

135M360M1.7BHuggingFace

Overview

SmolLM is a family of compact language models developed by HuggingFace, specifically designed for deployment on edge devices, mobile phones, and resource-constrained environments. The family includes three sizes: 135M, 360M, and 1.7B parameters. Despite their tiny footprints, SmolLM models demonstrate surprisingly capable performance, outperforming many larger models on per-parameter efficiency metrics.

The models were trained on a carefully curated data mixture anchored by Cosmopedia, a massive synthetic dataset of textbook-style content generated by larger models. This educational content, combined with filtered web data and code, produces models with strong foundational knowledge relative to their size. SmolLM 2, the current generation, was trained on approximately 11 trillion tokens for the 1.7B model — an exceptionally high data-to-parameter ratio that maximizes the information density of the model's limited parameters.

Architecturally, SmolLM uses a standard dense transformer decoder scaled down to its target sizes. The 135M model has 12 layers with a hidden dimension of 576, the 360M has 32 layers with a hidden dimension of 640, and the 1.7B has 24 layers with a hidden dimension of 2048. All models use grouped-query attention and RoPE positional embeddings, supporting context windows up to 8K tokens.

SmolLM models are released under the Apache 2.0 license. HuggingFace provides the models in multiple formats including ONNX (for cross-platform deployment), CoreML (for Apple devices), and standard safetensors, making SmolLM one of the most deployment-flexible model families available.

Key Features

The Cosmopedia training dataset is a key differentiator for SmolLM. This synthetic dataset contains billions of tokens of textbook-quality educational content covering science, mathematics, history, technology, and general knowledge. By training on curated educational content rather than raw web text, SmolLM models develop more structured knowledge representations than similarly-sized models trained on unfiltered data, leading to better reasoning and factual accuracy.

Multi-format model availability makes SmolLM exceptionally easy to deploy across platforms. HuggingFace provides ONNX exports for cross-platform deployment, CoreML packages for iOS and macOS integration, TensorFlow Lite for Android, and WebAssembly builds for browser deployment. This means a single SmolLM model can be deployed on iOS apps, Android apps, desktop applications, web pages, and server backends using native runtime optimizations for each platform.

The 135M model is particularly noteworthy — at under 300MB at FP16, it is one of the smallest coherent language models available. It can run on devices with as little as 512MB of free RAM, opening up deployment scenarios on ultra-low-resource devices, feature phones, and deeply embedded systems. While its capabilities are limited compared to billion-parameter models, it handles focused tasks like classification, simple extraction, and template-based generation effectively.

Fine-Tuning with Ertas

SmolLM models are the fastest and most resource-efficient models to fine-tune in Ertas Studio. The 135M model can be fully fine-tuned (not just LoRA) with as little as 1-2GB VRAM — this runs on virtually any GPU, including older laptop GPUs. The 360M model requires 2-3GB for full fine-tuning, and the 1.7B model requires 3-5GB for QLoRA or 6-8GB for full fine-tuning.

The small model sizes enable a unique fine-tuning workflow: you can afford to try many configurations. Run 10-20 experiments in a single afternoon, varying dataset composition, learning rates, training duration, and LoRA rank. This rapid iteration produces a well-optimized model much faster than is possible with larger models where each training run takes hours.

After fine-tuning, Ertas Studio exports to GGUF format. SmolLM GGUF files are tiny: the 135M at Q4_K_M is approximately 100MB, the 360M is approximately 230MB, and the 1.7B is approximately 1GB. These can be bundled directly into applications, distributed through app stores, or included in container images with negligible size impact. Deploy through Ollama for local API access or integrate directly via llama.cpp's library interface.

Use Cases

SmolLM models are designed for on-device AI where the model must ship as part of the application. Mobile apps that need offline text processing, browser extensions with built-in AI features, desktop applications with integrated assistants, and IoT devices with local intelligence all benefit from SmolLM's minimal footprint. The model files are small enough to download over cellular connections and store on mobile devices without significant storage impact.

Focused NLP tasks are SmolLM's sweet spot: text classification, sentiment analysis, entity extraction, language detection, simple summarization, and template-based generation. Fine-tuned on task-specific data, SmolLM models can match the accuracy of much larger models on narrow tasks while running orders of magnitude faster and cheaper. Many production systems use SmolLM for high-throughput classification and routing tasks.

SmolLM is also valuable for privacy-sensitive applications where data cannot leave the device. On-device text analysis for health apps, financial apps, and messaging apps can use SmolLM to process sensitive information locally without any network communication. The model's small size means it can run as a background service without impacting the user experience.

Hardware Requirements

SmolLM 135M at Q4_K_M requires approximately 100MB of RAM — runnable on virtually any computing device manufactured in the last decade. The 360M model needs approximately 230MB, and the 1.7B needs approximately 1GB. Even at FP16, the requirements are minimal: 270MB (135M), 720MB (360M), and 3.4GB (1.7B). These are among the absolute lowest requirements for any language model capable of coherent generation.

Inference speed is exceptionally fast. The 135M model on a modern CPU generates 100+ tokens per second. The 1.7B model on an RTX 4090 generates 100+ tokens per second at Q4_K_M, and 30-50 tokens per second on CPU. On mobile devices, the 135M and 360M models provide real-time inference with latencies under 50ms per token.

For fine-tuning in Ertas Studio, the 135M needs 1-2GB VRAM (full fine-tuning), the 360M needs 2-3GB, and the 1.7B needs 3-5GB with QLoRA or 6-8GB for full fine-tuning. Complete training runs finish in minutes for the smaller models, enabling extremely rapid iteration.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →