Fine-Tune xLAM with Ertas

Salesforce's open-weight Large Action Model family — small models trained specifically to plan, call tools, and execute multi-step actions, with first-class support in vLLM, llama.cpp, and the Berkeley Function Calling Leaderboard ecosystem.

1B7B8x7B8x22BSalesforce AI Research

Overview

xLAM (Large Action Model) is Salesforce AI Research's open-weight model family designed specifically for agentic workflows: planning, tool calling, and multi-step task execution. The family spans dense small models (xLAM-1b, xLAM-7b) and mixture-of-experts variants (xLAM-8x7b-r, xLAM-8x22b-r), all trained on a curated corpus of function-call traces, agent rollouts, and structured action sequences. Where general-purpose instruction-tuned models pick up tool-calling competence as a side effect of broader training, xLAM is purpose-built around it from the start.

The family's defining trait is its consistency on the Berkeley Function Calling Leaderboard (BFCL). xLAM-1b at 1.35B parameters has held a top position among models its size class, repeatedly outperforming 3B–7B general-purpose alternatives on parallel function calls, nested calls, and multi-turn conversations with optional tool use. The 7B variant is competitive with frontier API models on standard agentic tasks despite being two orders of magnitude smaller.

xLAM is supported natively in vLLM with a dedicated tool-call parser, in llama.cpp via standard GGUF builds, and in the major agent frameworks (LangGraph, Pydantic AI, Smolagents) through OpenAI-compatible endpoints. Salesforce has been unusually thorough about documenting the recommended prompt format, which makes xLAM unusually easy to drop into an existing agent pipeline.

Key Features

xLAM is licensed under CC-BY-NC-4.0 for the dense variants and a Salesforce-specific research license for the MoE variants. This non-commercial restriction is a meaningful constraint — xLAM is well-suited for research, prototyping, and internal evaluation but requires a separate commercial agreement with Salesforce for revenue-generating deployments. Teams evaluating xLAM should plan around this from the start.

The model supports multiple JSON output styles (the xLAM team published evaluations on at least four common formats), and the vLLM tool-call parser handles all of them transparently. This flexibility is unusual — most function-calling models are sensitive to a specific schema convention — and makes xLAM particularly valuable when integrating with agent frameworks that have their own JSON conventions (Pydantic AI's strict typing, OpenAI's tool-call schema, LangGraph's custom dispatch formats).

xLAM's training data is publicly described in the APIGen-MT paper and includes synthetic agentic trajectories generated by larger models, then verified by execution. This data-generation methodology is itself influential — several other 2026 agent-specialist models cite the APIGen approach as the inspiration for their own training corpora.

Fine-Tuning with Ertas

xLAM is well-suited to Ertas Studio fine-tuning when the task involves multi-tool planning rather than single function calls. Where FunctionGemma is the right base for clean intent-to-invocation mapping, xLAM is the right base when the agent needs to chain multiple tool calls, recover from failed calls, or interleave reasoning with tool use.

The recommended Ertas workflow for xLAM-7B is QLoRA fine-tuning on agentic trajectories: each training example is a multi-turn conversation with embedded function calls and observations. Studio's data format supports this natively — JSONL with `messages` arrays containing user, assistant, tool_call, and tool_observation roles. A 12-16GB consumer GPU handles xLAM-7B QLoRA at 2048-token sequence lengths; the larger MoE variants need 24-48GB.

The non-commercial license affects the deployment story. Studio handles training and evaluation, but for production deployment teams should plan to either negotiate a commercial license with Salesforce, deploy in non-commercial contexts (research, internal tooling, education), or use the trained adapter as a teacher to distill into a permissively-licensed base (Llama 3, Qwen 3, Gemma 4) — Studio supports this distillation workflow.

Use Cases

xLAM's strongest fit is multi-step agentic workflows where the model needs to plan, execute, observe, and replan: customer-support agents that handle a ticket end-to-end through several CRM and database tools; research agents that browse, summarize, and cross-reference sources; coding agents that read files, run tests, and edit code in a loop. On these tasks, xLAM-7B routinely matches or exceeds general-purpose 14B–34B models, particularly on the multi-turn tool-use sub-benchmarks of BFCL v4.

For research teams and academic labs, xLAM is one of the strongest open baselines for agent-specific research — its training data methodology is documented, its evaluation set is published, and its results are reproducible. Teams building custom agentic benchmarks or new training-data generation pipelines often start with xLAM as the reference point.

For commercial mobile deployment, xLAM is not the right choice given the licensing constraint — fine-tuned Qwen 3 or Gemma 4 derivatives are usually the better path to production. xLAM's role is more often the upstream teacher in a knowledge-distillation pipeline that produces a deployable, permissively-licensed model with similar agentic capabilities.

Hardware Requirements

xLAM-1B at Q4_K_M quantization is approximately 700MB and runs comfortably on phones, laptops, and any GPU with 2GB+ VRAM. Inference throughput on a modern laptop CPU is 60–90 tokens per second; on consumer GPUs (RTX 3060 and above) it exceeds 200 tokens per second.

xLAM-7B at Q4_K_M is approximately 4.2GB. A 6-8GB consumer GPU is sufficient for inference; QLoRA fine-tuning fits on 12-16GB. Throughput on consumer GPUs is typically 60–100 tokens per second at standard context lengths.

The MoE variants (xLAM-8x7B and xLAM-8x22B) require loading all expert weights at inference time even though only a subset are active per token — 28GB and 90GB respectively at Q4_K_M. A 24GB consumer GPU handles xLAM-8x7B at the lower quantization tiers; xLAM-8x22B is a server-class deployment. For Studio fine-tuning, the dense xLAM-7B is the practical sweet spot.