Fine-Tune Falcon with Ertas

    The Technology Innovation Institute's open-weight model family in 7B, 40B, and 180B sizes, trained on the massive RefinedWeb dataset and pioneering the use of high-quality filtered web data for LLM training.

    7B40B180BTII Abu Dhabi

    Overview

    Falcon is a family of large language models developed by the Technology Innovation Institute (TII) in Abu Dhabi, United Arab Emirates. When Falcon 40B was released in May 2023, it briefly topped the Hugging Face Open LLM Leaderboard, demonstrating that carefully curated web data could produce models rivaling those trained on more expensive, manually curated datasets.

    The Falcon family includes three sizes: 7B, 40B, and 180B parameters. The models were trained primarily on RefinedWeb, a massive dataset of filtered web pages that TII created by applying extensive quality filtering, deduplication, and content extraction to Common Crawl data. The 180B model was trained on approximately 3.5 trillion tokens, making it one of the largest openly trained models at the time of its release.

    Architecturally, Falcon uses a decoder-only transformer with multi-query attention (a single key-value head shared across all query heads) in the 7B variant and grouped-query attention in the 40B and 180B variants. The models use a custom tokenizer with a vocabulary of approximately 65K tokens and support a 2K context window (extendable through fine-tuning and RoPE scaling).

    Falcon models are released under the Apache 2.0 license. While newer models have surpassed Falcon on most benchmarks, its contribution to demonstrating the viability of web-data-centric training was influential in shaping subsequent model development practices across the industry.

    Key Features

    The RefinedWeb dataset is Falcon's most significant contribution to the LLM ecosystem. TII demonstrated that with sufficiently rigorous filtering — including URL-based filtering, content extraction with trafilatura, exact and near-duplicate removal with MinHash, and quality scoring — web-crawled data alone can produce models competitive with those trained on curated datasets. This finding influenced the training data strategies of many subsequent models.

    Multi-query attention in Falcon 7B reduces the KV cache to a single head, providing exceptional inference throughput. This makes Falcon 7B particularly efficient for high-concurrency serving scenarios where memory bandwidth is the bottleneck. The 40B and 180B models use grouped-query attention for a balance of efficiency and model quality.

    Falcon's instruction-tuned variants (Falcon Instruct) were fine-tuned on a mix of chat and instruction data, demonstrating competent conversational ability. The models respond well to further fine-tuning, with the community producing numerous specialized variants for different domains and languages, particularly Arabic, given TII's connection to the UAE.

    Fine-Tuning with Ertas

    Falcon models are straightforward to fine-tune in Ertas Studio. The 7B model is particularly efficient, requiring only 6-10GB VRAM with QLoRA due to its multi-query attention reducing the memory overhead. The 40B model requires 24-32GB VRAM, fitting on a single A100 40GB or A6000 48GB. The 180B model requires multi-GPU setups for fine-tuning.

    Falcon responds well to fine-tuning on domain-specific data, and its RefinedWeb training provides a solid general-knowledge foundation. For Arabic-language applications, Falcon is a strong starting point — the RefinedWeb dataset includes Arabic content, and TII has released Arabic-specific variants. Fine-tuning on Arabic conversational or domain data in Ertas Studio can produce a capable Arabic AI assistant.

    After training, export to GGUF format for deployment. Note that Falcon's shorter default context window (2K) may require explicit RoPE scaling configuration if your application needs longer contexts. Ertas Studio includes options for context extension during fine-tuning, allowing you to extend Falcon's effective context length to 8K or 16K tokens.

    Use Cases

    Falcon 7B is a solid choice for applications requiring fast, efficient inference with good general quality. Its multi-query attention makes it one of the most throughput-efficient 7B models for API serving, and it performs well on standard NLP tasks: summarization, question answering, classification, and conversational AI.

    The 40B model is suitable for enterprise applications where quality matters but frontier-model performance is not required. It handles complex instruction following, content generation, and analytical tasks competently. Organizations that adopted Falcon early and have existing fine-tuned variants may find it cost-effective to continue with the Falcon ecosystem rather than migrating.

    Falcon has particular relevance for Arabic-language AI applications, given TII's ongoing investment in Arabic NLP. Fine-tuned Falcon models serve Arabic customer support, content generation, and translation tasks across the Middle East and North Africa region.

    Hardware Requirements

    Falcon 7B at Q4_K_M requires approximately 4.3GB of RAM, easily runnable on consumer hardware with 8GB+ RAM. The 40B model at Q4_K_M needs approximately 23GB, fitting on RTX 4090 24GB (tight) or A6000 48GB. The 180B at Q4_K_M requires approximately 103GB, necessitating multi-GPU setups or large-memory CPU inference.

    At Q8_0, the requirements are approximately 7.5GB (7B), 43GB (40B), and 190GB (180B). Full FP16 inference requires 14.5GB (7B), 80GB (40B), and 360GB (180B). The 7B model's multi-query attention provides excellent tokens-per-second performance, often 20-30% faster than comparable GQA models at the same parameter count.

    For fine-tuning in Ertas Studio, the 7B needs 6-10GB VRAM, the 40B needs 24-32GB, and the 180B needs 80-120GB with QLoRA. The 7B model's low requirements make it accessible for individual developers and small teams exploring custom model development.

    Supported Quantizations

    Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.