Back to blog
    On-Device AI in React Native with llama.rn
    React Nativellama.cppon-device AIcross-platformllama.rnsegment:mobile-builder

    On-Device AI in React Native with llama.rn

    How to run language models directly on the user's phone in a React Native app. Setup, model loading, streaming generation, and cross-platform considerations using llama.rn.

    EErtas Team·

    llama.rn is a React Native library that provides JavaScript bindings to llama.cpp. It runs GGUF language models natively on both iOS (Metal) and Android (CPU/Vulkan) through the same JavaScript API.

    For React Native developers, this means on-device AI with one codebase, one API, and zero cloud dependency.

    Installation

    npm install llama.rn
    # or
    yarn add llama.rn
    

    For iOS, run pod install:

    cd ios && pod install
    

    For Android, the native library is included automatically via autolinking.

    Expo

    If you are using Expo, you need a development build (not Expo Go) since llama.rn includes native code:

    npx expo prebuild
    npx expo run:ios  # or run:android
    

    Loading a Model

    import { initLlama, LlamaContext } from "llama.rn";
    
    let context: LlamaContext | null = null;
    
    async function loadModel(modelPath: string) {
      context = await initLlama({
        model: modelPath,
        n_ctx: 2048,        // Context window
        n_threads: 4,       // CPU threads
        n_gpu_layers: 99,   // Offload to GPU (Metal/Vulkan)
        use_mlock: true,    // Lock model in memory
      });
    
      console.log("Model loaded successfully");
    }
    

    Model Path

    The model path must point to a local file on the device's filesystem. How you get the file there depends on your delivery strategy:

    Bundled with app:

    // iOS: Copy to app bundle, reference via RNFS
    import RNFS from "react-native-fs";
    const modelPath = `${RNFS.MainBundlePath}/model.gguf`;
    
    // Android: Copy from assets to files directory on first launch
    const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
    

    Downloaded post-install:

    import RNFS from "react-native-fs";
    
    const modelUrl = "https://cdn.example.com/model.gguf";
    const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
    
    const download = RNFS.downloadFile({
      fromUrl: modelUrl,
      toFile: modelPath,
      progress: (res) => {
        const percentage = (res.bytesWritten / res.contentLength) * 100;
        setDownloadProgress(percentage);
      },
    });
    
    await download.promise;
    

    Generating Text

    Simple Generation

    async function generate(prompt: string): Promise<string> {
      if (!context) throw new Error("Model not loaded");
    
      const result = await context.completion({
        prompt: prompt,
        n_predict: 256,
        temperature: 0.7,
        top_p: 0.9,
        stop: ["</s>", "<|eot_id|>"],  // Stop tokens
      });
    
      return result.text;
    }
    

    Streaming Generation

    async function generateStream(
      prompt: string,
      onToken: (token: string) => void
    ): Promise<string> {
      if (!context) throw new Error("Model not loaded");
    
      const result = await context.completion(
        {
          prompt: prompt,
          n_predict: 256,
          temperature: 0.7,
          stop: ["</s>", "<|eot_id|>"],
        },
        (data) => {
          // Called for each generated token
          onToken(data.token);
        }
      );
    
      return result.text;
    }
    

    Chat Completion

    For multi-turn conversations, format the prompt using the model's chat template:

    interface Message {
      role: "system" | "user" | "assistant";
      content: string;
    }
    
    function formatChat(messages: Message[]): string {
      // Llama 3.2 chat template
      let prompt = "<|begin_of_text|>";
    
      for (const msg of messages) {
        prompt += `<|start_header_id|>${msg.role}<|end_header_id|>\n\n${msg.content}<|eot_id|>`;
      }
    
      prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n";
      return prompt;
    }
    
    async function chat(messages: Message[], onToken: (token: string) => void) {
      const prompt = formatChat(messages);
      return generateStream(prompt, onToken);
    }
    

    React Hook Pattern

    import { useState, useCallback, useRef } from "react";
    import { initLlama, LlamaContext } from "llama.rn";
    
    export function useLlama(modelPath: string) {
      const contextRef = useRef<LlamaContext | null>(null);
      const [isLoaded, setIsLoaded] = useState(false);
      const [isGenerating, setIsGenerating] = useState(false);
      const [response, setResponse] = useState("");
    
      const load = useCallback(async () => {
        contextRef.current = await initLlama({
          model: modelPath,
          n_ctx: 2048,
          n_threads: 4,
          n_gpu_layers: 99,
        });
        setIsLoaded(true);
      }, [modelPath]);
    
      const generate = useCallback(async (prompt: string) => {
        if (!contextRef.current) return;
        setIsGenerating(true);
        setResponse("");
    
        await contextRef.current.completion(
          {
            prompt,
            n_predict: 256,
            temperature: 0.7,
            stop: ["</s>", "<|eot_id|>"],
          },
          (data) => {
            setResponse((prev) => prev + data.token);
          }
        );
    
        setIsGenerating(false);
      }, []);
    
      const unload = useCallback(() => {
        contextRef.current?.release();
        contextRef.current = null;
        setIsLoaded(false);
      }, []);
    
      return { load, generate, unload, isLoaded, isGenerating, response };
    }
    

    Usage in a Component

    function AiChat() {
      const { load, generate, unload, isLoaded, isGenerating, response } =
        useLlama(modelPath);
      const [input, setInput] = useState("");
    
      useEffect(() => {
        load();
        return () => unload();
      }, []);
    
      return (
        <View style={styles.container}>
          <ScrollView style={styles.responseArea}>
            <Text>{response}</Text>
          </ScrollView>
          <View style={styles.inputRow}>
            <TextInput
              value={input}
              onChangeText={setInput}
              style={styles.input}
              editable={!isGenerating}
            />
            <Button
              title="Send"
              onPress={() => {
                generate(input);
                setInput("");
              }}
              disabled={!isLoaded || isGenerating}
            />
          </View>
        </View>
      );
    }
    

    Cross-Platform Considerations

    Performance Parity

    llama.rn runs native code on both platforms. The JavaScript bridge is only used for passing prompts and receiving tokens. Actual inference performance is identical to native Swift/Kotlin integration:

    Device1B Model (tok/s)3B Model (tok/s)
    iPhone 15 Pro35-4518-25
    iPhone 1425-3214-18
    Galaxy S2435-4518-25
    Mid-range Android18-258-12

    The JS bridge adds under 1ms per token of overhead. Negligible.

    Model File Path Differences

    iOS and Android store files in different locations. Use react-native-fs to get platform-appropriate paths:

    import RNFS from "react-native-fs";
    import { Platform } from "react-native";
    
    const modelDir = Platform.OS === "ios"
      ? RNFS.DocumentDirectoryPath
      : RNFS.DocumentDirectoryPath;  // Same API, different underlying path
    

    Memory Management

    React Native does not expose direct memory APIs. Use platform-specific checks via a native module if you need to verify available RAM before loading. Alternatively, catch load failures and show an appropriate message.

    Model Delivery in React Native

    Strategy 1: Bundle for Small Models

    For 1B models (~600MB), bundling with the app is feasible:

    • iOS: Add to Xcode project as a resource
    • Android: Use Android Asset Delivery for files over 150MB

    Strategy 2: Download for All Models

    For 3B models (~1.7GB) or to keep the initial download small:

    async function ensureModelReady(): Promise<string> {
      const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
      const exists = await RNFS.exists(modelPath);
    
      if (exists) return modelPath;
    
      // Download with progress
      await RNFS.downloadFile({
        fromUrl: MODEL_CDN_URL,
        toFile: modelPath,
        progress: (res) => {
          updateProgress(res.bytesWritten / res.contentLength);
        },
      }).promise;
    
      // Verify integrity
      const hash = await RNFS.hash(modelPath, "sha256");
      if (hash !== EXPECTED_HASH) {
        await RNFS.unlink(modelPath);
        throw new Error("Model download corrupted");
      }
    
      return modelPath;
    }
    

    Production Best Practices

    1. Load model lazily. Only load when the user accesses the AI feature.
    2. Unload on blur. Release model memory when the AI screen is not focused.
    3. Handle errors gracefully. Model load can fail on low-memory devices. Show a clear message.
    4. Verify downloads. SHA256 hash check after download. Corrupted models cause crashes.
    5. Buffer tokens. Batch 2-3 tokens before UI update for smoother text display.
    6. Cancel support. Allow users to stop generation mid-stream.

    The model quality depends on fine-tuning. A base model gives generic responses. A model fine-tuned on your domain data (via Ertas or similar platforms) gives responses tailored to your app's specific use case, running at the same speed on the same hardware.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading