LM Studio Server API + Ertas
Serve Ertas-trained models as local API endpoints using LM Studio's built-in server mode for application integration, development, and testing.
Overview
LM Studio is a desktop application for discovering, downloading, and running local language models. While it is widely known for its chat interface, LM Studio's server mode is equally powerful — it turns any loaded model into a fully functional OpenAI-compatible API server running on localhost. This local server mode exposes /v1/chat/completions, /v1/completions, and /v1/embeddings endpoints that are drop-in compatible with the OpenAI SDK, making it trivial to point any application from a cloud API to a local model.
LM Studio's server mode is particularly valuable for development and testing workflows. Instead of burning API credits while iterating on prompts and application logic, developers can run their fine-tuned model locally through LM Studio and test against the same API contract they will use in production. The server provides request logging, performance metrics, and GPU utilization monitoring — giving developers visibility into how their model performs under different load patterns and context lengths. For teams that need a user-friendly way to serve models locally without managing Docker containers or CLI tools, LM Studio Server provides a one-click solution.
How Ertas Integrates
After fine-tuning a model in Ertas Studio, you download the GGUF file and load it directly into LM Studio. From there, enabling server mode is a single toggle — LM Studio immediately starts serving the model on a configurable port with full OpenAI API compatibility. Any application, framework, or tool that supports the OpenAI API can connect to your Ertas-trained model without code changes beyond updating the base URL.
This integration path is especially useful during the development phase of AI applications. Teams can fine-tune multiple model variants in Ertas Studio — different base models, different LoRA configurations, different quantization levels — and quickly switch between them in LM Studio to compare outputs. LM Studio's conversation view lets you test the model interactively while the server mode simultaneously serves it to your application. Once you have identified the best model configuration, you can deploy it to a production inference server like vLLM or Ertas Cloud while keeping LM Studio as your local development and debugging tool.
Getting Started
- 1
Export your model from Ertas Studio
Download the fine-tuned model in GGUF format from Ertas Studio. Choose the quantization level that balances quality and speed for your hardware.
- 2
Load the model in LM Studio
Open LM Studio and load your GGUF file. Configure the context length, GPU layers, and other inference parameters in the model settings panel.
- 3
Enable server mode
Toggle the server mode in LM Studio's server tab. The API server starts on localhost:1234 by default, exposing OpenAI-compatible endpoints.
- 4
Connect your application
Point your application to http://localhost:1234/v1 as the base URL. Use any OpenAI SDK or HTTP client — the API contract is identical to OpenAI's.
- 5
Monitor and iterate
Use LM Studio's built-in logging and metrics to monitor request latency, token throughput, and GPU utilization. Swap models without restarting the server to compare performance.
import OpenAI from "openai";
// Connect to LM Studio's local server running your Ertas-trained model
const client = new OpenAI({
baseURL: "http://localhost:1234/v1",
apiKey: "lm-studio", // LM Studio doesn't require a real key
});
async function analyzeContract(text: string) {
const response = await client.chat.completions.create({
model: "ertas-legal-7b",
messages: [
{ role: "system", content: "You are a contract analyst. Extract key terms and obligations." },
{ role: "user", content: `Analyze this contract clause:\n\n${text}` },
],
temperature: 0.1,
max_tokens: 1024,
});
return response.choices[0].message.content;
}
// Works identically to calling OpenAI's API
const analysis = await analyzeContract("The Licensee shall pay...");
console.log(analysis);Benefits
- One-click server mode with zero CLI or Docker configuration
- Full OpenAI API compatibility for seamless application integration
- Built-in request logging and performance metrics for debugging
- Hot-swap models without restarting the server during development
- GPU layer offloading controls for optimal performance on any hardware
- Interactive chat and API server running simultaneously for testing
Related Resources
Fine-Tuning
GGUF
Inference
LoRA
Quantization
Running AI Models Locally: The Complete Guide to Local LLM Inference
Getting Started with Ertas: Fine-Tune and Deploy Custom AI Models
Privacy-Conscious AI Development: Fine-Tune in the Cloud, Run on Your Terms
Self-Hosted AI for Indie Apps: Replace GPT-4 with Your Own Model
How to Fine-Tune an LLM: The Complete 2026 Guide
Jan
llama.cpp
LM Studio
Ollama
vLLM
Ertas for SaaS Product Teams
Ertas for Legal
Ertas for Code Generation
Ertas for Indie Developers & Vibe-Coded Apps
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.