Fine-Tune InternLM with Ertas

Shanghai AI Laboratory's multilingual model series in 7B and 20B sizes, featuring strong Chinese-English capabilities, long-context support, and excellent performance on reasoning and tool-use benchmarks.

7B20BShanghai AI Lab

Overview

InternLM is a series of large language models developed by the Shanghai Artificial Intelligence Laboratory (Shanghai AI Lab) in collaboration with several Chinese universities and research institutions. The current generation, InternLM 2.5, is available in 7B and 20B parameter sizes and represents one of the strongest Chinese-developed open-weight model families.

The models are trained on a diverse corpus exceeding 2.6 trillion tokens, carefully curated to include high-quality Chinese and English text, code, mathematical content, and scientific literature. InternLM 2.5 demonstrates particularly strong performance on tasks requiring reasoning, tool use, and long-context understanding, frequently ranking among the top models on Chinese-language benchmarks while maintaining competitive English performance.

Architecturally, InternLM 2.5 uses a dense transformer decoder with grouped-query attention, SwiGLU activations, and RoPE positional embeddings. The models support a 1 million token context window through dynamic NTK-aware interpolation, one of the longest context windows available in any open-weight model. This enables processing of extremely long documents, entire codebases, and extensive conversation histories.

InternLM models are released under the Apache 2.0 license, supporting both research and commercial use. The Shanghai AI Lab also provides a comprehensive ecosystem around InternLM, including the XComposer multimodal model, the Math reasoning model, and the Lagent agent framework.

Key Features

InternLM 2.5's 1 million token context window is its most striking feature. While many models claim long context through RoPE scaling, InternLM demonstrates reliable performance across its full context range, maintaining coherent understanding and accurate retrieval even at extreme sequence lengths. This is achieved through a combination of dynamic NTK interpolation and specialized long-context training data.

Tool use and agent capabilities are another area where InternLM excels. The model was specifically trained with tool-use data, including code interpreter integration, web search, and function calling. InternLM serves as the backbone of the Lagent agent framework, demonstrating strong performance on agent benchmarks like AgentBench and T-Bench. The model can plan multi-step tool interactions, handle tool errors gracefully, and synthesize results from multiple tool calls.

InternLM demonstrates strong mathematical and scientific reasoning, with specialized training on mathematical proofs, scientific papers, and structured reasoning tasks. The InternLM-Math variant pushes this further, achieving competitive results on mathematical olympiad problems and graduate-level science questions.

Fine-Tuning with Ertas

InternLM models are well-suited for fine-tuning in Ertas Studio, particularly for applications requiring Chinese-English bilingual capability or agentic tool-use behavior. The 7B model requires 8-12GB VRAM with QLoRA, making it accessible on consumer GPUs like the RTX 4070 Ti or RTX 4080. The 20B model requires 14-20GB VRAM, fitting on an RTX 4090 or A5000.

For agent and tool-use fine-tuning, Ertas Studio supports training datasets that include tool call annotations. Structure your examples with natural language queries, the expected tool invocations, and the final synthesized response. InternLM's pre-existing tool-use capabilities mean it requires relatively little fine-tuning data to adapt to new tools and APIs — as few as 500-1000 examples can produce reliable tool-calling behavior for custom APIs.

After fine-tuning, export to GGUF format for local deployment. InternLM's long context capability is preserved through quantization, though extremely long contexts will require proportionally more RAM for the KV cache. Deploy through Ollama or llama.cpp for integration into your application stack.

Use Cases

InternLM is an excellent choice for building AI agents that need to interact with tools, APIs, and external data sources. Its strong tool-use training makes it reliable for function calling, code execution, web search integration, and multi-step task planning. Organizations building internal AI assistants that need to query databases, call internal APIs, and synthesize results from multiple sources find InternLM to be a strong foundation.

The 1 million token context window makes InternLM valuable for extreme long-context applications: processing entire books or document collections, analyzing large codebases in a single pass, and maintaining very long conversation histories for persistent AI assistants. This is particularly useful for legal document review, patent analysis, and comprehensive literature surveys.

Bilingual Chinese-English applications are another key use case. InternLM performs competitively with dedicated Chinese models like Yi and Qwen on Chinese tasks while maintaining strong English capability. Organizations serving markets in both China and English-speaking regions can use InternLM as a single model backbone for both languages.

Hardware Requirements

InternLM 7B at Q4_K_M quantization requires approximately 4.5GB of RAM for the model weights. However, the 1M context window means KV cache can consume significant additional memory for long sequences — processing 100K tokens may require an additional 8-12GB of RAM for the KV cache alone. Plan memory accordingly based on your expected context lengths.

The 20B model at Q4_K_M requires approximately 12GB for model weights, with similar KV cache scaling for long contexts. At Q8_0, the 20B model needs about 21GB for weights. Full FP16 inference requires approximately 14.5GB (7B) or 40GB (20B) for weights alone.

For fine-tuning in Ertas Studio, the 7B model needs 8-12GB VRAM with QLoRA, and the 20B needs 14-20GB. Training with long-context examples will require additional memory proportional to sequence length. For most fine-tuning tasks, a context length of 4K-8K tokens is sufficient and keeps memory requirements manageable.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →