
LangChain + 微调本地模型:零API费用构建管道
LangChain兼容任何OpenAI兼容API——包括Ollama。用微调本地模型替换LangChain管道中的API调用。相同的链式结构,零按token计费。
LangChain是构建AI管道最广泛使用的框架:文档处理、RAG、代理、链。大多数LangChain教程指向OpenAI的API,你的生产账单也反映了这一点。
LangChain通过Ollama支持本地模型——Ollama的OpenAI兼容接口意味着你可以用最少的代码更改替换LangChain管道中的AI后端。结合微调模型,你可以获得在领域任务上更快、大规模更便宜、且默认保护隐私的管道。
LangChain + Ollama集成选项
LangChain有两种Ollama集成路径:
选项1:ChatOllama(LangChain原生)
from langchain_ollama import ChatOllama
llm = ChatOllama(
model="your-fine-tuned-model",
base_url="http://localhost:11434",
temperature=0.3
)
# 用法与ChatOpenAI完全相同
response = llm.invoke("Generate a listing for this property: ...")
print(response.content)
选项2:ChatOpenAI 使用Ollama base URL(首选的直接替换方式)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="your-fine-tuned-model",
base_url="http://localhost:11434/v1",
api_key="ollama", # 必填字段,不做验证
temperature=0.3
)
# 与云端ChatOpenAI接口完全相同
选项2是最简洁的迁移方式:如果你已有使用ChatOpenAI的LangChain代码,唯一的更改是base_url和model。
常见管道模式
模式1:文档处理链
之前(GPT-4 API):
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(model="gpt-4o") # $0.005/1K输入token
template = PromptTemplate.from_template(
"Classify this support ticket:\n{ticket}\nOutput: category, priority, suggested_response"
)
chain = LLMChain(llm=llm, prompt=template)
result = chain.invoke({"ticket": ticket_text})
之后(微调本地模型):
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# 只改一行
llm = ChatOpenAI(
model="support-classifier-v3", # 你的微调分类器
base_url="http://localhost:11434/v1",
api_key="ollama"
)
# 链定义不变
template = PromptTemplate.from_template(
"Classify this support ticket:\n{ticket}\nOutput: category, priority, suggested_response"
)
chain = LLMChain(llm=llm, prompt=template)
result = chain.invoke({"ticket": ticket_text})
处理10,000张工单:
- 之前:10,000 x $0.005 = $50 API费用
- 之后:$50/月VPS费用,摊薄到所有处理中
模式2:使用微调阅读器的RAG管道
对于检索增强生成(RAG),你通常希望检索模型(嵌入)和阅读器模型(答案生 成)都针对你的领域进行校准。
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_chroma import Chroma
# 通过Ollama的本地嵌入(nomic-embed-text效果不错)
embeddings = OpenAIEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434/v1",
api_key="ollama"
)
# 微调阅读器模型
reader_llm = ChatOpenAI(
model="your-domain-reader-model",
base_url="http://localhost:11434/v1",
api_key="ollama"
)
# 从你的文档构建向量存储
vectorstore = Chroma.from_documents(
documents=your_docs,
embedding=embeddings,
persist_directory="./chroma_db"
)
# RAG链——推理时零云API调用
qa_chain = RetrievalQA.from_chain_type(
llm=reader_llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)
answer = qa_chain.invoke({"query": "What is our return policy for sale items?"})
嵌入和生成都在本地进行。检索模型理解你的领域术语。阅读器模型在你的领域问答上微调。零API调用。
模式3:使用本地工具执行器的LangGraph代理
LangGraph(LangChain的代理框架)兼容任何LangChain兼容的LLM:
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
# 代理使用本地模型进行编排
orchestrator = ChatOpenAI(
model="your-orchestrator-model",
base_url="http://localhost:11434/v1",
api_key="ollama"
)
# 工具执行器使用专用微调模型
domain_executor = ChatOpenAI(
model="your-domain-model",
base_url="http://localhost:11434/v1",
api_key="ollama"
)
def run_domain_task(state):
task = state["current_task"]
result = domain_executor.invoke(task)
return {"result": result.content}
# 构建图
graph = StateGraph(dict)
graph.add_node("domain_executor", run_domain_task)
# ... 添加编排逻辑
app = graph.compile()
LangChain + Ollama性能调优
批处理: 对于批量处理(分类5,000张工单),使用LangChain的batch方法来并行化调用:
# 并发处理100张工单
results = await chain.abatch(
[{"ticket": t} for t in tickets],
config={"max_concurrency": 10} # 10个并发Ollama调用
)
缓存: 启用LangChain的语义缓存以避免冗余模型调用:
from langchain.globals import set_llm_cache
from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
# 相同的提示返回缓存结果——无需Ollama调用
上下文长度: 7B模型通常支持4K-8K上下文。对于长文档处理,使用LangChain的文本分割器在传递给模型之前分块。
何时在LangChain管道中使用微调本地vs云端
| 任务 | 本地微调 | 云端(GPT-4) |
|---|---|---|
| 领域分类 | 更好更便宜 | 大材小用 |
| 领域生成 | 更好更便宜 | 大材小用 |
| 复杂推理链 | 需考虑 | 更好 |
| 时事/网络 | 不适用 | 必需 |
| 高量批处理 | 便宜得多 | 昂贵 |
| 一次性/低量 | 均可 | 均可 |
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
延伸阅读
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

MCP + Fine-Tuned Local Model: Connect Claude to Your Domain-Specific AI
Model Context Protocol (MCP) lets Claude Desktop talk to any server — including your own Ollama-hosted fine-tuned model. Here's the architecture and setup for routing Claude requests to a custom domain model.

Ollama's OpenAI-Compatible API: Drop Your Fine-Tuned Model Into Any OpenAI Integration
Ollama exposes an OpenAI-compatible REST API. Any code written for the OpenAI SDK — Langchain, LlamaIndex, your own app — works with your fine-tuned local model by changing one URL. Here's what to know.

Replit App AI Costs Exploding? Replace OpenAI with a Fine-Tuned Local Model
Replit's always-on deployment and easy AI integration create a specific API cost problem. Here's how to replace OpenAI with a fine-tuned local model and cut costs to flat rate.