CoNLL Format Guide
Column-based annotation format for NER and POS tagging
AnnotationSpecification
CoNLL (Conference on Natural Language Learning) format is a family of column-based text annotation formats used primarily for sequence labeling tasks in NLP, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, syntactic chunking, and dependency parsing. The format represents one token per line with tab or space-separated columns, where blank lines delimit sentence boundaries. Different CoNLL shared tasks introduced slightly different column schemas, with CoNLL-2003 (for NER) and CoNLL-U (for Universal Dependencies) being the most widely used variants today.
In the CoNLL-2003 NER format, each line contains four columns: the token (word), POS tag, syntactic chunk tag, and named entity tag. Entity tags use the IOB2 (Inside-Outside-Beginning) tagging scheme where B-PER marks the beginning of a person entity, I-PER continues a person entity, and O marks tokens outside any entity. Other common entity types include ORG (organization), LOC (location), and MISC (miscellaneous). The BIO tagging scheme is critical for handling multi-token entities like "New York City" where the first token gets B-LOC and subsequent tokens get I-LOC.
CoNLL-U, the format for Universal Dependencies treebanks, extends the column structure to ten fields: ID, FORM (word), LEMMA, UPOS (universal POS), XPOS (language-specific POS), FEATS (morphological features), HEAD (dependency head), DEPREL (dependency relation), DEPS (enhanced dependencies), and MISC. CoNLL-U files begin with comment lines prefixed by # containing metadata such as the sentence ID and the original untokenized text. This format has become the standard for multilingual NLP annotation and is used by over 200 treebanks across 100+ languages.
When to Use CoNLL
Use CoNLL format when training sequence labeling models for NER, POS tagging, chunking, or dependency parsing. Most NLP frameworks including spaCy, Flair, Hugging Face Transformers (via the datasets library), and Stanford NLP accept CoNLL-formatted input. If you are training a token classification model, CoNLL is likely the expected input format. It is also the standard format for NER evaluation benchmarks and shared tasks.
Choose CoNLL format when your annotation task requires token-level labels that align with whitespace-tokenized text. The one-token-per-line structure makes it easy to calculate inter-annotator agreement at the token level, identify annotation errors through visual inspection, and apply simple text-processing scripts for data analysis. CoNLL is also the natural choice when your annotation workflow produces output from tools like BRAT, Prodigy, or Label Studio that support CoNLL export.
CoNLL format is less suitable for tasks that require character-level or span-level annotations with arbitrary boundaries (use standoff annotation formats instead), for documents where sentence boundaries are ambiguous or irrelevant, or for tasks that combine token labels with document-level metadata or cross-sentence relations. For very large datasets, the verbose one-token-per-line format results in larger file sizes compared to JSON-based formats that represent annotations as spans.
Schema / Structure
CoNLL-2003 NER Format (4 columns):
<token> <POS> <chunk> <NER_tag>
Tagging scheme: IOB2 (BIO)
B-XXX = Beginning of entity type XXX
I-XXX = Inside (continuation) of entity type XXX
O = Outside any entity
Common entity types:
PER = Person, ORG = Organization
LOC = Location, MISC = Miscellaneous
CoNLL-U Format (10 columns):
<ID> <FORM> <LEMMA> <UPOS> <XPOS> <FEATS> <HEAD> <DEPREL> <DEPS> <MISC>
Sentence boundaries: blank lines
Comment lines: start with #Example Data
# CoNLL-2003 NER example
John B-NNP B-NP B-PER
Smith I-NNP I-NP I-PER
works VBZ B-VP O
at IN B-PP O
Google B-NNP B-NP B-ORG
in IN B-PP O
Mountain B-NNP B-NP B-LOC
View I-NNP I-NP I-LOC
, , O O
California B-NNP B-NP B-LOC
. . O O
He PRP B-NP O
joined VBD B-VP O
in IN B-PP O
2019 CD B-NP O
. . O O
# CoNLL-U example
# sent_id = 1
# text = The cat sat on the mat.
1 The the DET DT Definite=Def 2 det _ _
2 cat cat NOUN NN Number=Sing 3 nsubj _ _
3 sat sit VERB VBD Tense=Past 0 root _ _
4 on on ADP IN _ 6 case _ _
5 the the DET DT Definite=Def 6 det _ _
6 mat mat NOUN NN Number=Sing 3 obl _ SpaceAfter=No
7 . . PUNCT . _ 3 punct _ _Ertas Support
Ertas Data Suite supports CoNLL format import and export for NER and sequence labeling training data. You can import CoNLL-annotated datasets, apply PII redaction at the entity level (automatically updating BIO tags when entities are masked), validate tag consistency (checking for I-tags without preceding B-tags), and export cleaned datasets in CoNLL format ready for model training. The data lineage system tracks annotations through the complete preparation pipeline.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.