CoNLL Format Guide

Column-based annotation format for NER and POS tagging

Annotation

Specification

CoNLL (Conference on Natural Language Learning) format is a family of column-based text annotation formats used primarily for sequence labeling tasks in NLP, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, syntactic chunking, and dependency parsing. The format represents one token per line with tab or space-separated columns, where blank lines delimit sentence boundaries. Different CoNLL shared tasks introduced slightly different column schemas, with CoNLL-2003 (for NER) and CoNLL-U (for Universal Dependencies) being the most widely used variants today.

In the CoNLL-2003 NER format, each line contains four columns: the token (word), POS tag, syntactic chunk tag, and named entity tag. Entity tags use the IOB2 (Inside-Outside-Beginning) tagging scheme where B-PER marks the beginning of a person entity, I-PER continues a person entity, and O marks tokens outside any entity. Other common entity types include ORG (organization), LOC (location), and MISC (miscellaneous). The BIO tagging scheme is critical for handling multi-token entities like "New York City" where the first token gets B-LOC and subsequent tokens get I-LOC.

CoNLL-U, the format for Universal Dependencies treebanks, extends the column structure to ten fields: ID, FORM (word), LEMMA, UPOS (universal POS), XPOS (language-specific POS), FEATS (morphological features), HEAD (dependency head), DEPREL (dependency relation), DEPS (enhanced dependencies), and MISC. CoNLL-U files begin with comment lines prefixed by # containing metadata such as the sentence ID and the original untokenized text. This format has become the standard for multilingual NLP annotation and is used by over 200 treebanks across 100+ languages.

When to Use CoNLL

Use CoNLL format when training sequence labeling models for NER, POS tagging, chunking, or dependency parsing. Most NLP frameworks including spaCy, Flair, Hugging Face Transformers (via the datasets library), and Stanford NLP accept CoNLL-formatted input. If you are training a token classification model, CoNLL is likely the expected input format. It is also the standard format for NER evaluation benchmarks and shared tasks.

Choose CoNLL format when your annotation task requires token-level labels that align with whitespace-tokenized text. The one-token-per-line structure makes it easy to calculate inter-annotator agreement at the token level, identify annotation errors through visual inspection, and apply simple text-processing scripts for data analysis. CoNLL is also the natural choice when your annotation workflow produces output from tools like BRAT, Prodigy, or Label Studio that support CoNLL export.

CoNLL format is less suitable for tasks that require character-level or span-level annotations with arbitrary boundaries (use standoff annotation formats instead), for documents where sentence boundaries are ambiguous or irrelevant, or for tasks that combine token labels with document-level metadata or cross-sentence relations. For very large datasets, the verbose one-token-per-line format results in larger file sizes compared to JSON-based formats that represent annotations as spans.

Schema / Structure

text

CoNLL-2003 NER Format (4 columns):
<token> <POS> <chunk> <NER_tag>

Tagging scheme: IOB2 (BIO)
  B-XXX  = Beginning of entity type XXX
  I-XXX  = Inside (continuation) of entity type XXX
  O      = Outside any entity

Common entity types:
  PER = Person, ORG = Organization
  LOC = Location, MISC = Miscellaneous

CoNLL-U Format (10 columns):
<ID> <FORM> <LEMMA> <UPOS> <XPOS> <FEATS> <HEAD> <DEPREL> <DEPS> <MISC>

Sentence boundaries: blank lines
Comment lines: start with #

CoNLL-2003 and CoNLL-U format specifications with column definitions and tagging scheme

Example Data

text

# CoNLL-2003 NER example
John B-NNP B-NP B-PER
Smith I-NNP I-NP I-PER
works VBZ B-VP O
at IN B-PP O
Google B-NNP B-NP B-ORG
in IN B-PP O
Mountain B-NNP B-NP B-LOC
View I-NNP I-NP I-LOC
, , O O
California B-NNP B-NP B-LOC
. . O O

He PRP B-NP O
joined VBD B-VP O
in IN B-PP O
2019 CD B-NP O
. . O O

# CoNLL-U example
# sent_id = 1
# text = The cat sat on the mat.
1	The	the	DET	DT	Definite=Def	2	det	_	_
2	cat	cat	NOUN	NN	Number=Sing	3	nsubj	_	_
3	sat	sit	VERB	VBD	Tense=Past	0	root	_	_
4	on	on	ADP	IN	_	6	case	_	_
5	the	the	DET	DT	Definite=Def	6	det	_	_
6	mat	mat	NOUN	NN	Number=Sing	3	obl	_	SpaceAfter=No
7	.	.	PUNCT	.	_	3	punct	_	_

CoNLL-2003 NER annotation and CoNLL-U dependency parsing annotation examples

Ertas Support

Ertas Data Suite supports CoNLL format import and export for NER and sequence labeling training data. You can import CoNLL-annotated datasets, apply PII redaction at the entity level (automatically updating BIO tags when entities are masked), validate tag consistency (checking for I-tags without preceding B-tags), and export cleaned datasets in CoNLL format ready for model training. The data lineage system tracks annotations through the complete preparation pipeline.

Ship AI that runs on your users' devices.

Free plan with 30 credits/mo, no card required. Paid plans from $25/mo USD.

or view pricing →