AI in Flutter Apps: Cloud APIs, TFLite, and On-Device LLMs

Flutter developers building AI features have three distinct paths. Cloud APIs give you access to frontier models through HTTP calls. TensorFlow Lite handles classical ML tasks on-device. And llama.cpp brings full LLM text generation to the device via platform channels.

Each serves a different purpose. This guide compares them from a Dart developer's perspective.

Path 1: Cloud APIs

Flutter's http or dio packages make cloud API integration straightforward. The pattern works identically on iOS, Android, web, and desktop.

Basic Integration

import 'dart:convert';
import 'package:http/http.dart' as http;

Future<String> generateResponse(String prompt) async {
  final response = await http.post(
    Uri.parse('https://api.openai.com/v1/chat/completions'),
    headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer $apiKey',
    },
    body: jsonEncode({
      'model': 'gpt-4o-mini',
      'messages': [{'role': 'user', 'content': prompt}],
    }),
  );

  final data = jsonDecode(response.body);
  return data['choices'][0]['message']['content'];
}

Streaming with SSE

For chat interfaces that display tokens as they arrive:

import 'package:flutter_client_sse/flutter_client_sse.dart';

void streamResponse(String prompt, Function(String) onToken) {
  SSEClient.subscribeToSSE(
    url: 'https://api.openai.com/v1/chat/completions',
    header: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer $apiKey',
    },
    body: {
      'model': 'gpt-4o-mini',
      'messages': [{'role': 'user', 'content': prompt}],
      'stream': true,
    },
  ).listen((event) {
    if (event.data == '[DONE]') return;
    final parsed = jsonDecode(event.data!);
    final token = parsed['choices'][0]['delta']['content'];
    if (token != null) onToken(token);
  });
}

When to Use Cloud APIs

Cloud APIs are the right choice for prototyping, feature validation, and very low-volume apps. They require no native code, work across all Flutter platforms, and give you access to frontier models.

The trade-offs are the standard cloud API trade-offs: per-token cost scaling, network dependency, 500-3,000ms latency, and data leaving the device on every request.

Path 2: TensorFlow Lite

The tflite_flutter plugin provides TFLite support for Flutter. TFLite runs optimized ML models on-device for specific tasks.

What TFLite Does Well

Image classification and object detection
Text classification and sentiment analysis
On-device translation (pre-built models)
Pose estimation
Audio classification

Integration Pattern

import 'package:tflite_flutter/tflite_flutter.dart';

class TextClassifier {
  late Interpreter _interpreter;

  Future<void> loadModel() async {
    _interpreter = await Interpreter.fromAsset('model.tflite');
  }

  List<double> classify(List<int> tokenizedInput) {
    var output = List.filled(1 * numClasses, 0.0).reshape([1, numClasses]);
    _interpreter.run([tokenizedInput], output);
    return output[0];
  }
}

What TFLite Cannot Do

TFLite does not support large language models for open-ended text generation. There is no TFLite equivalent of ChatGPT or Claude. You cannot use TFLite for conversational AI, content drafting, summarization, or any task that requires generating natural language responses.

For those tasks, you need either a cloud API or an on-device LLM.

Cost

Free. TFLite runs entirely on-device. The models are small (typically 1-50MB) and bundled with the app.

Path 3: On-Device LLMs via llama.cpp

Run a full language model on the user's device. llama.cpp handles inference. GGUF models provide the intelligence. Flutter communicates through platform channels (method channels or FFI).

Integration Approaches

Platform channels: Write a thin native wrapper in Swift (iOS) and Kotlin (Android) that calls llama.cpp, then communicate from Dart via MethodChannel.

// Dart side
class OnDeviceLlm {
  static const _channel = MethodChannel('com.app/llm');

  Future<void> loadModel(String path) async {
    await _channel.invokeMethod('loadModel', {'path': path});
  }

  Future<String> generate(String prompt) async {
    return await _channel.invokeMethod('generate', {'prompt': prompt});
  }

  Stream<String> generateStream(String prompt) {
    const eventChannel = EventChannel('com.app/llm_stream');
    _channel.invokeMethod('startGeneration', {'prompt': prompt});
    return eventChannel.receiveBroadcastStream().map((e) => e as String);
  }
}

Dart FFI: Use dart:ffi to call llama.cpp's C API directly. This avoids the platform channel overhead but requires more setup:

import 'dart:ffi';

// Bind to llama.cpp shared library
final llamaLib = DynamicLibrary.open('libllama.so'); // Android
// DynamicLibrary.process() for iOS (statically linked)

typedef LlamaInitNative = Pointer Function(Pointer<Utf8>);
typedef LlamaInit = Pointer Function(Pointer<Utf8>);

final llamaInit = llamaLib
    .lookupFunction<LlamaInitNative, LlamaInit>('llama_load_model');

Model Delivery in Flutter

Bundled: Place the GGUF file in the platform-specific asset directories. For Android, use asset delivery for large files. For iOS, add to the Xcode project.

Downloaded: Use dio or http for background downloads with progress:

import 'package:dio/dio.dart';

Future<void> downloadModel() async {
  final dir = await getApplicationDocumentsDirectory();
  final modelPath = '${dir.path}/model.gguf';

  await Dio().download(
    modelCdnUrl,
    modelPath,
    onReceiveProgress: (received, total) {
      final progress = received / total;
      // Update UI with download progress
    },
  );
}

Performance

On-device inference performance is the same as native apps since llama.cpp runs natively, not through the Dart VM. The platform channel or FFI overhead is negligible (under 1ms per token).

Device	1B Model (tok/s)	3B Model (tok/s)
iPhone 15 Pro (A17)	35-45	18-25
Galaxy S24 (SD 8 Gen 3)	35-45	18-25
Pixel 9 (Tensor G4)	30-40	15-22
Mid-range 2024+	18-25	8-12

The Comparison

Factor	Cloud API	TFLite	On-Device LLM
Text generation	Yes	No	Yes
Image classification	Via API	Yes (optimized)	No
Offline support	No	Yes	Yes
Cost per inference	$0.0001-$0.01	$0	$0
Flutter integration	Native Dart	Plugin	Platform channel/FFI
Custom models	Via API selection	Custom TFLite	Any GGUF model
Model size	N/A (server-side)	1-50MB	600MB-1.7GB

Practical Decision Framework

Use cloud APIs when you are validating a feature, the user base is small, or you need frontier-model reasoning. The http package makes this trivial in Dart.

Use TFLite when you need image classification, object detection, text classification, or other classical ML tasks. Google's pre-built models cover many common use cases.

Use on-device LLMs when you need conversational AI, content generation, summarization, or any text-heavy AI feature. The zero per-inference cost, offline support, and privacy guarantees make this the right choice for production apps at scale.

The fine-tuning step is where you make the on-device model competitive with cloud APIs on your specific task. Platforms like Ertas handle the full workflow: upload training data, fine-tune with LoRA, export GGUF, deploy to any device. A fine-tuned 3B model typically outperforms prompted cloud models on domain-specific tasks.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

AI in Flutter Apps: Cloud APIs, TFLite, and On-Device LLMs

Path 1: Cloud APIs

Basic Integration

Streaming with SSE

When to Use Cloud APIs

Path 2: TensorFlow Lite

What TFLite Does Well

Integration Pattern

What TFLite Cannot Do

Cost

Path 3: On-Device LLMs via llama.cpp

Integration Approaches

Model Delivery in Flutter

Performance

The Comparison

Practical Decision Framework

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared

AI in React Native: From Cloud APIs to On-Device Models