在 React Native 中使用 llama.rn 實現裝置端 AI

llama.rn 是一個 React Native 函式庫，提供連接 llama.cpp 的 JavaScript 繫結。它透過相同的 JavaScript API 在 iOS（Metal）和 Android（CPU/Vulkan）上原生運行 GGUF 語言模型。

對 React Native 開發者來說，這意味著一套程式碼、一個 API、零雲端依賴的裝置端 AI。

安裝

npm install llama.rn
# 或
yarn add llama.rn

iOS 需執行 pod install：

cd ios && pod install

Android 上，原生函式庫會透過 autolinking 自動包含。

Expo

如果您使用 Expo，由於 llama.rn 包含原生程式碼，需要開發建置版本（非 Expo Go）：

npx expo prebuild
npx expo run:ios  # 或 run:android

載入模型

import { initLlama, LlamaContext } from "llama.rn";

let context: LlamaContext | null = null;

async function loadModel(modelPath: string) {
  context = await initLlama({
    model: modelPath,
    n_ctx: 2048,        // 上下文視窗
    n_threads: 4,       // CPU 執行緒
    n_gpu_layers: 99,   // 卸載到 GPU（Metal/Vulkan）
    use_mlock: true,    // 鎖定模型在記憶體中
  });

  console.log("Model loaded successfully");
}

模型路徑

模型路徑必須指向裝置檔案系統上的本地文件。如何將文件放到裝置上取決於您的交付策略：

隨應用程式捆綁：

// iOS：複製到應用程式 bundle，透過 RNFS 參照
import RNFS from "react-native-fs";
const modelPath = `${RNFS.MainBundlePath}/model.gguf`;

// Android：首次啟動時從 assets 複製到 files 目錄
const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;

安裝後下載：

import RNFS from "react-native-fs";

const modelUrl = "https://cdn.example.com/model.gguf";
const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;

const download = RNFS.downloadFile({
  fromUrl: modelUrl,
  toFile: modelPath,
  progress: (res) => {
    const percentage = (res.bytesWritten / res.contentLength) * 100;
    setDownloadProgress(percentage);
  },
});

await download.promise;

生成文字

簡單生成

async function generate(prompt: string): Promise<string> {
  if (!context) throw new Error("Model not loaded");

  const result = await context.completion({
    prompt: prompt,
    n_predict: 256,
    temperature: 0.7,
    top_p: 0.9,
    stop: ["</s>", "<|eot_id|>"],  // 停止 token
  });

  return result.text;
}

串流生成

async function generateStream(
  prompt: string,
  onToken: (token: string) => void
): Promise<string> {
  if (!context) throw new Error("Model not loaded");

  const result = await context.completion(
    {
      prompt: prompt,
      n_predict: 256,
      temperature: 0.7,
      stop: ["</s>", "<|eot_id|>"],
    },
    (data) => {
      // 每生成一個 token 時呼叫
      onToken(data.token);
    }
  );

  return result.text;
}

聊天完成

對於多輪對話，使用模型的聊天範本格式化提示：

interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

function formatChat(messages: Message[]): string {
  // Llama 3.2 聊天範本
  let prompt = "<|begin_of_text|>";

  for (const msg of messages) {
    prompt += `<|start_header_id|>${msg.role}<|end_header_id|>\n\n${msg.content}<|eot_id|>`;
  }

  prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n";
  return prompt;
}

async function chat(messages: Message[], onToken: (token: string) => void) {
  const prompt = formatChat(messages);
  return generateStream(prompt, onToken);
}

React Hook 模式

import { useState, useCallback, useRef } from "react";
import { initLlama, LlamaContext } from "llama.rn";

export function useLlama(modelPath: string) {
  const contextRef = useRef<LlamaContext | null>(null);
  const [isLoaded, setIsLoaded] = useState(false);
  const [isGenerating, setIsGenerating] = useState(false);
  const [response, setResponse] = useState("");

  const load = useCallback(async () => {
    contextRef.current = await initLlama({
      model: modelPath,
      n_ctx: 2048,
      n_threads: 4,
      n_gpu_layers: 99,
    });
    setIsLoaded(true);
  }, [modelPath]);

  const generate = useCallback(async (prompt: string) => {
    if (!contextRef.current) return;
    setIsGenerating(true);
    setResponse("");

    await contextRef.current.completion(
      {
        prompt,
        n_predict: 256,
        temperature: 0.7,
        stop: ["</s>", "<|eot_id|>"],
      },
      (data) => {
        setResponse((prev) => prev + data.token);
      }
    );

    setIsGenerating(false);
  }, []);

  const unload = useCallback(() => {
    contextRef.current?.release();
    contextRef.current = null;
    setIsLoaded(false);
  }, []);

  return { load, generate, unload, isLoaded, isGenerating, response };
}

在元件中使用

function AiChat() {
  const { load, generate, unload, isLoaded, isGenerating, response } =
    useLlama(modelPath);
  const [input, setInput] = useState("");

  useEffect(() => {
    load();
    return () => unload();
  }, []);

  return (
    <View style={styles.container}>
      <ScrollView style={styles.responseArea}>
        <Text>{response}</Text>
      </ScrollView>
      <View style={styles.inputRow}>
        <TextInput
          value={input}
          onChangeText={setInput}
          style={styles.input}
          editable={!isGenerating}
        />
        <Button
          title="Send"
          onPress={() => {
            generate(input);
            setInput("");
          }}
          disabled={!isLoaded || isGenerating}
        />
      </View>
    </View>
  );
}

跨平台考量

效能一致性

llama.rn 在兩個平台上執行原生程式碼。JavaScript 橋接僅用於傳遞提示和接收 token。實際推論效能與原生 Swift/Kotlin 整合相同：

裝置	1B 模型（tok/s）	3B 模型（tok/s）
iPhone 15 Pro	35-45	18-25
iPhone 14	25-32	14-18
Galaxy S24	35-45	18-25
中階 Android	18-25	8-12

JS 橋接每個 token 增加不到 1 毫秒的開銷。可忽略不計。

模型文件路徑差異

iOS 和 Android 將文件存儲在不同位置。使用 react-native-fs 取得平台適當的路徑：

import RNFS from "react-native-fs";
import { Platform } from "react-native";

const modelDir = Platform.OS === "ios"
  ? RNFS.DocumentDirectoryPath
  : RNFS.DocumentDirectoryPath;  // 相同 API，不同底層路徑

記憶體管理

React Native 不公開直接的記憶體 API。如需在載入前驗證可用 RAM，可透過原生模組進行平台特定檢查。或者，捕獲載入失敗並顯示適當的訊息。

React Native 中的模型交付

策略 1：小模型捆綁

對於 1B 模型（約 600MB），隨應用程式捆綁是可行的：

iOS：作為資源新增到 Xcode 專案
Android：對超過 150MB 的文件使用 Android Asset Delivery

策略 2：所有模型下載

對於 3B 模型（約 1.7GB）或為了保持初始下載大小較小：

async function ensureModelReady(): Promise<string> {
  const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
  const exists = await RNFS.exists(modelPath);

  if (exists) return modelPath;

  // 帶進度的下載
  await RNFS.downloadFile({
    fromUrl: MODEL_CDN_URL,
    toFile: modelPath,
    progress: (res) => {
      updateProgress(res.bytesWritten / res.contentLength);
    },
  }).promise;

  // 驗證完整性
  const hash = await RNFS.hash(modelPath, "sha256");
  if (hash !== EXPECTED_HASH) {
    await RNFS.unlink(modelPath);
    throw new Error("Model download corrupted");
  }

  return modelPath;
}

正式環境最佳實踐

延遲載入模型。 僅在使用者存取 AI 功能時載入。
失焦時卸載。 AI 畫面不在焦點時釋放模型記憶體。
優雅處理錯誤。 模型載入在低記憶體裝置上可能失敗。顯示清楚的訊息。
驗證下載。 下載後進行 SHA256 雜湊檢查。損壞的模型會導致當機。
緩衝 token。 批次 2-3 個 token 後再更新 UI，以獲得更流暢的文字顯示。
取消支援。 允許使用者在生成中途停止。

模型品質取決於微調。基礎模型給出通用回應。在您的領域資料上微調的模型（透過 Ertas 或類似平台）給出針對應用程式特定使用場景的回應，在相同硬體上以相同速度運行。