ShareGPT Format Format Guide
Multi-turn conversation format for chat model training
ConversationSpecification
The ShareGPT format is a JSON/JSONL conversation dataset structure that captures multi-turn dialogues between users and AI assistants. Named after the ShareGPT website that enabled users to share ChatGPT conversations, the format represents each conversation as an object containing a list of messages, where each message has a "from" field (indicating the speaker) and a "value" field (containing the message content). This turn-based structure naturally captures the flow of real conversations, making it ideal for training chat models that need to handle multi-turn context.
The standard ShareGPT format uses the speaker identifiers "human" and "gpt" for user and assistant messages respectively, with an optional "system" role for system prompts. Some variants use "user" and "assistant" instead. Each conversation is a self-contained dialogue that begins with a human message and typically alternates between human and assistant turns. The format supports conversations of arbitrary length, from simple single-turn Q&A to extended multi-turn dialogues with dozens of exchanges.
ShareGPT format has been widely adopted by the open-source LLM community, particularly through datasets like ShareGPT-Vicuna, OpenAssistant Conversations, and WizardLM training data. Training frameworks including Axolotl, LLaMA-Factory, FastChat, and the Hugging Face TRL library all support ShareGPT as a first-class input format. The format's popularity stems from its natural alignment with how users interact with chat models — it captures not just individual responses but the conversational dynamics that chat models need to learn.
When to Use ShareGPT Format
Use ShareGPT format when your training data consists of multi-turn conversations and you want to train a chat model that handles context across multiple exchanges. It is the natural choice for training data sourced from real conversational interactions, customer support transcripts, tutoring sessions, or any dialogue where the assistant's response depends on the preceding conversation history. If your model needs to reference earlier parts of the conversation when generating responses, ShareGPT format captures this context explicitly.
Choose ShareGPT over Alpaca format when your training examples involve multiple turns of conversation. A single ShareGPT example can contain an entire conversation thread, while Alpaca is limited to single instruction-response pairs. ShareGPT also supports system prompts within the conversation structure, enabling you to train models with different personas or behavioral guidelines embedded in the training data.
ShareGPT format is less suitable when your training data is purely single-turn instruction following (use Alpaca format for simplicity), when you need standardized chat template formatting (use ChatML which has a more rigid structure), or when your training framework specifically requires a different conversation format. Note that conversion between ShareGPT and other conversation formats is straightforward with most training frameworks.
Schema / Structure
// ShareGPT format schema
interface ShareGPTMessage {
from: "system" | "human" | "gpt"; // Speaker role
value: string; // Message content
}
interface ShareGPTConversation {
conversations: ShareGPTMessage[]; // Ordered list of turns
id?: string; // Optional conversation ID
source?: string; // Optional data source
}
// Alternative variant used by some frameworks
interface ShareGPTVariant {
messages: {
role: "system" | "user" | "assistant";
content: string;
}[];
}Example Data
[
{
"id": "conv_001",
"conversations": [
{"from": "system", "value": "You are a helpful Python programming tutor. Explain concepts clearly with examples."},
{"from": "human", "value": "What's the difference between a list and a tuple in Python?"},
{"from": "gpt", "value": "Lists and tuples are both sequence types in Python, but they have key differences:\n\n**Lists** are mutable — you can add, remove, or change elements after creation:\n```python\nfruits = ['apple', 'banana']\nfruits.append('cherry') # Works fine\nfruits[0] = 'avocado' # Also works\n```\n\n**Tuples** are immutable — once created, they cannot be modified:\n```python\ncoords = (10, 20)\ncoords[0] = 15 # TypeError!\n```\n\nUse lists when your data needs to change, and tuples when it should stay constant."},
{"from": "human", "value": "When would I actually use a tuple over a list?"},
{"from": "gpt", "value": "Great question! Here are the main scenarios for tuples:\n\n1. **Dictionary keys**: Tuples can be dictionary keys, lists cannot\n```python\nlocations = {(40.7, -74.0): 'NYC', (51.5, -0.1): 'London'}\n```\n\n2. **Function return values**: Returning multiple values\n```python\ndef get_dimensions():\n return (1920, 1080) # width, height\n```\n\n3. **Data integrity**: When you want to guarantee data won't be accidentally modified\n\n4. **Performance**: Tuples are slightly faster and use less memory than lists\n\nA good rule of thumb: if the data represents a fixed collection of related items (like x,y coordinates), use a tuple. If it's a collection that may grow or change, use a list."}
]
}
]Ertas Support
Ertas Data Suite fully supports ShareGPT format for conversational training data preparation. You can import ShareGPT JSON/JSONL datasets, apply PII redaction across all conversation turns while maintaining dialogue coherence, validate conversation structure (ensuring proper turn alternation and required fields), and export processed datasets for fine-tuning. The platform preserves conversation threading during all transformations, ensuring that multi-turn context relationships remain intact.
Ertas Studio accepts ShareGPT-format datasets for chat model fine-tuning, automatically applying the appropriate chat template for your target model architecture during tokenization.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.