On-Device AI in React Native with llama.rn

llama.rn is a React Native library that provides JavaScript bindings to llama.cpp. It runs GGUF language models natively on both iOS (Metal) and Android (CPU/Vulkan) through the same JavaScript API.

For React Native developers, this means on-device AI with one codebase, one API, and zero cloud dependency.

Installation

npm install llama.rn
# or
yarn add llama.rn

For iOS, run pod install:

cd ios && pod install

For Android, the native library is included automatically via autolinking.

Expo

If you are using Expo, you need a development build (not Expo Go) since llama.rn includes native code:

npx expo prebuild
npx expo run:ios  # or run:android

Loading a Model

import { initLlama, LlamaContext } from "llama.rn";

let context: LlamaContext | null = null;

async function loadModel(modelPath: string) {
  context = await initLlama({
    model: modelPath,
    n_ctx: 2048,        // Context window
    n_threads: 4,       // CPU threads
    n_gpu_layers: 99,   // Offload to GPU (Metal/Vulkan)
    use_mlock: true,    // Lock model in memory
  });

  console.log("Model loaded successfully");
}

Model Path

The model path must point to a local file on the device's filesystem. How you get the file there depends on your delivery strategy:

Bundled with app:

// iOS: Copy to app bundle, reference via RNFS
import RNFS from "react-native-fs";
const modelPath = `${RNFS.MainBundlePath}/model.gguf`;

// Android: Copy from assets to files directory on first launch
const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;

Downloaded post-install:

import RNFS from "react-native-fs";

const modelUrl = "https://cdn.example.com/model.gguf";
const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;

const download = RNFS.downloadFile({
  fromUrl: modelUrl,
  toFile: modelPath,
  progress: (res) => {
    const percentage = (res.bytesWritten / res.contentLength) * 100;
    setDownloadProgress(percentage);
  },
});

await download.promise;

Generating Text

Simple Generation

async function generate(prompt: string): Promise<string> {
  if (!context) throw new Error("Model not loaded");

  const result = await context.completion({
    prompt: prompt,
    n_predict: 256,
    temperature: 0.7,
    top_p: 0.9,
    stop: ["</s>", "<|eot_id|>"],  // Stop tokens
  });

  return result.text;
}

Streaming Generation

async function generateStream(
  prompt: string,
  onToken: (token: string) => void
): Promise<string> {
  if (!context) throw new Error("Model not loaded");

  const result = await context.completion(
    {
      prompt: prompt,
      n_predict: 256,
      temperature: 0.7,
      stop: ["</s>", "<|eot_id|>"],
    },
    (data) => {
      // Called for each generated token
      onToken(data.token);
    }
  );

  return result.text;
}

Chat Completion

For multi-turn conversations, format the prompt using the model's chat template:

interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

function formatChat(messages: Message[]): string {
  // Llama 3.2 chat template
  let prompt = "<|begin_of_text|>";

  for (const msg of messages) {
    prompt += `<|start_header_id|>${msg.role}<|end_header_id|>\n\n${msg.content}<|eot_id|>`;
  }

  prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n";
  return prompt;
}

async function chat(messages: Message[], onToken: (token: string) => void) {
  const prompt = formatChat(messages);
  return generateStream(prompt, onToken);
}

React Hook Pattern

import { useState, useCallback, useRef } from "react";
import { initLlama, LlamaContext } from "llama.rn";

export function useLlama(modelPath: string) {
  const contextRef = useRef<LlamaContext | null>(null);
  const [isLoaded, setIsLoaded] = useState(false);
  const [isGenerating, setIsGenerating] = useState(false);
  const [response, setResponse] = useState("");

  const load = useCallback(async () => {
    contextRef.current = await initLlama({
      model: modelPath,
      n_ctx: 2048,
      n_threads: 4,
      n_gpu_layers: 99,
    });
    setIsLoaded(true);
  }, [modelPath]);

  const generate = useCallback(async (prompt: string) => {
    if (!contextRef.current) return;
    setIsGenerating(true);
    setResponse("");

    await contextRef.current.completion(
      {
        prompt,
        n_predict: 256,
        temperature: 0.7,
        stop: ["</s>", "<|eot_id|>"],
      },
      (data) => {
        setResponse((prev) => prev + data.token);
      }
    );

    setIsGenerating(false);
  }, []);

  const unload = useCallback(() => {
    contextRef.current?.release();
    contextRef.current = null;
    setIsLoaded(false);
  }, []);

  return { load, generate, unload, isLoaded, isGenerating, response };
}

Usage in a Component

function AiChat() {
  const { load, generate, unload, isLoaded, isGenerating, response } =
    useLlama(modelPath);
  const [input, setInput] = useState("");

  useEffect(() => {
    load();
    return () => unload();
  }, []);

  return (
    <View style={styles.container}>
      <ScrollView style={styles.responseArea}>
        <Text>{response}</Text>
      </ScrollView>
      <View style={styles.inputRow}>
        <TextInput
          value={input}
          onChangeText={setInput}
          style={styles.input}
          editable={!isGenerating}
        />
        <Button
          title="Send"
          onPress={() => {
            generate(input);
            setInput("");
          }}
          disabled={!isLoaded || isGenerating}
        />
      </View>
    </View>
  );
}

Cross-Platform Considerations

Performance Parity

llama.rn runs native code on both platforms. The JavaScript bridge is only used for passing prompts and receiving tokens. Actual inference performance is identical to native Swift/Kotlin integration:

Device	1B Model (tok/s)	3B Model (tok/s)
iPhone 15 Pro	35-45	18-25
iPhone 14	25-32	14-18
Galaxy S24	35-45	18-25
Mid-range Android	18-25	8-12

The JS bridge adds under 1ms per token of overhead. Negligible.

Model File Path Differences

iOS and Android store files in different locations. Use react-native-fs to get platform-appropriate paths:

import RNFS from "react-native-fs";
import { Platform } from "react-native";

const modelDir = Platform.OS === "ios"
  ? RNFS.DocumentDirectoryPath
  : RNFS.DocumentDirectoryPath;  // Same API, different underlying path

Memory Management

React Native does not expose direct memory APIs. Use platform-specific checks via a native module if you need to verify available RAM before loading. Alternatively, catch load failures and show an appropriate message.

Model Delivery in React Native

Strategy 1: Bundle for Small Models

For 1B models (~600MB), bundling with the app is feasible:

iOS: Add to Xcode project as a resource
Android: Use Android Asset Delivery for files over 150MB

Strategy 2: Download for All Models

For 3B models (~1.7GB) or to keep the initial download small:

async function ensureModelReady(): Promise<string> {
  const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
  const exists = await RNFS.exists(modelPath);

  if (exists) return modelPath;

  // Download with progress
  await RNFS.downloadFile({
    fromUrl: MODEL_CDN_URL,
    toFile: modelPath,
    progress: (res) => {
      updateProgress(res.bytesWritten / res.contentLength);
    },
  }).promise;

  // Verify integrity
  const hash = await RNFS.hash(modelPath, "sha256");
  if (hash !== EXPECTED_HASH) {
    await RNFS.unlink(modelPath);
    throw new Error("Model download corrupted");
  }

  return modelPath;
}

Production Best Practices

Load model lazily. Only load when the user accesses the AI feature.
Unload on blur. Release model memory when the AI screen is not focused.
Handle errors gracefully. Model load can fail on low-memory devices. Show a clear message.
Verify downloads. SHA256 hash check after download. Corrupted models cause crashes.
Buffer tokens. Batch 2-3 tokens before UI update for smoother text display.
Cancel support. Allow users to stop generation mid-stream.

The model quality depends on fine-tuning. A base model gives generic responses. A model fine-tuned on your domain data (via Ertas or similar platforms) gives responses tailored to your app's specific use case, running at the same speed on the same hardware.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

On-Device AI in React Native with llama.rn

Installation

Expo

Loading a Model

Model Path

Generating Text

Simple Generation

Streaming Generation

Chat Completion

React Hook Pattern

Usage in a Component

Cross-Platform Considerations

Performance Parity

Model File Path Differences

Memory Management

Model Delivery in React Native

Strategy 1: Bundle for Small Models

Strategy 2: Download for All Models

Production Best Practices

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

AI in React Native: From Cloud APIs to On-Device Models

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared