Fine-Tune TinyLlama with Ertas

    A compact 1.1-billion parameter model trained on 3 trillion tokens — far more data than typical for its size — delivering surprisingly capable performance for edge deployment, mobile applications, and resource-constrained environments.

    1.1BTinyLlama Team

    Overview

    TinyLlama is a 1.1-billion parameter language model developed by the TinyLlama open-source project, led by Peiyuan Zhang at the Singapore University of Technology and Design. What makes TinyLlama unique is not its architecture but its training approach: the model was trained on approximately 3 trillion tokens, roughly the same amount of data used to train Llama 2 7B despite being nearly 7x smaller. This extensive training relative to model size follows the Chinchilla-optimal scaling philosophy and pushes the boundaries of what tiny models can achieve.

    TinyLlama uses the same architecture as Llama 2 but scaled down: 22 transformer layers, a hidden dimension of 2048, 32 attention heads, and grouped-query attention with 4 key-value heads. The model supports a 2K token context window and uses the Llama tokenizer with a 32K vocabulary. Despite its small size, TinyLlama demonstrates coherent text generation, basic reasoning, and useful instruction-following when fine-tuned.

    The project was designed to be practical and accessible. Training was completed on 16 A100-40GB GPUs over approximately 90 days, and the team released intermediate checkpoints throughout training so researchers could study the training dynamics. The model's small size means it can run on virtually any computing device, from smartphones to microcontrollers.

    TinyLlama is released under the Apache 2.0 license. The TinyLlama Chat variant, fine-tuned on a mix of conversational datasets, provides a compact chatbot that runs on minimal hardware. The project has inspired a wave of research into training tiny models with large data budgets.

    Key Features

    TinyLlama's most notable feature is its data-to-parameter ratio. At 3 trillion tokens for 1.1B parameters, the model has seen roughly 2,700 tokens per parameter during training — far more than the typical 20-100 tokens per parameter seen in most models. This extensive training pushes the model to extract maximum utility from its limited parameter count, producing a model that punches significantly above its weight class on many tasks.

    The model's extreme compactness enables deployment scenarios that are impossible for larger models. At full FP16 precision, TinyLlama requires only 2.2GB of RAM. With INT4 quantization, this drops to approximately 700MB. This means TinyLlama can run on Raspberry Pi devices, smartphones with 2GB+ RAM, WebAssembly environments in browsers, and even some microcontroller platforms with external RAM.

    TinyLlama serves as an important research artifact for studying scaling laws. The extensive training and publicly released intermediate checkpoints allow researchers to study how model capabilities emerge and evolve during training at a scale that is computationally accessible for academic labs. The project demonstrates that optimal training requires far more data than most small models receive.

    Fine-Tuning with Ertas

    TinyLlama is the most accessible model for fine-tuning in Ertas Studio. With QLoRA, fine-tuning requires as little as 2-4GB VRAM — this runs on virtually any GPU, including older models like the GTX 1060 6GB or integrated graphics with sufficient shared memory. Full LoRA (without quantization) requires only 3-5GB VRAM, and even full fine-tuning is feasible on consumer GPUs with 8GB+ VRAM.

    The model's small size enables uniquely fast iteration. A complete fine-tuning run on 10,000 examples typically completes in 15-30 minutes on a single consumer GPU. This allows for rapid experimentation with different datasets, hyperparameters, and training strategies. You can run dozens of experiments in a day, testing different approaches before settling on the best configuration.

    After fine-tuning, Ertas Studio exports to GGUF format. A Q4_K_M quantized TinyLlama is only about 670MB — small enough to distribute as part of a desktop application, embed in a mobile app, or include in a Docker container without significant size overhead. Deploy through Ollama for API access or llama.cpp for direct integration into your application.

    Use Cases

    TinyLlama's primary use case is on-device deployment where model size is the primary constraint. Mobile applications that need offline AI capability, embedded systems in IoT devices, browser-based AI features via WebAssembly, and edge computing scenarios all benefit from TinyLlama's minimal resource requirements. The model can run alongside other applications without competing for resources.

    The model excels as a fast preprocessing or classification layer in larger systems. Use it for intent detection, simple entity extraction, text classification, sentiment analysis, and query routing. While it lacks the sophistication for complex reasoning, it performs these focused tasks efficiently at minimal cost. Many production systems use TinyLlama as a lightweight first-pass filter.

    TinyLlama is also invaluable for educational purposes. Students learning about language models can fine-tune, experiment with, and deploy TinyLlama on personal laptops without any special hardware. The complete training pipeline — from data preparation through fine-tuning to deployment — can be experienced in hours rather than days, making it an ideal teaching tool.

    Hardware Requirements

    TinyLlama at Q4_K_M quantization requires approximately 670MB of RAM. This is small enough to run on Raspberry Pi 4 (4GB), most smartphones, and even browser environments via WebAssembly. At Q8_0, the requirement is approximately 1.2GB. Full FP16 requires approximately 2.2GB. These are among the lowest requirements for any coherent language model.

    Inference speed is remarkably fast due to the tiny model size. On an RTX 4090, expect 100+ tokens per second at Q4_K_M. On Apple M1 with 8GB, expect 30-50 tokens per second. Even CPU inference on a modern laptop yields 15-30 tokens per second, making TinyLlama responsive for interactive applications without any GPU.

    For fine-tuning in Ertas Studio, QLoRA requires 2-4GB VRAM, and full fine-tuning requires 4-8GB VRAM. The extremely low requirements mean TinyLlama can be fine-tuned on virtually any machine with a discrete GPU, and even some integrated GPU configurations. Training completes in minutes to tens of minutes for typical dataset sizes.

    Supported Quantizations

    Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.