The IOB format (short for inside, outside, beginning), also commonly referred to as the BIO format, is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).[1] It was presented by Ramshaw and Marcus in their paper "Text Chunking using Transformation-Based Learning", 1995[2] The I- prefix before a tag indicates that the tag is inside a chunk. An O tag indicates that a token belongs to no chunk. The B- prefix before a tag indicates that the tag is the beginning of a chunk that immediately follows another chunk of the same type without O tags between them. It is used only in that case: when a chunk comes after an O tag, the first token of the chunk takes the I- prefix.
Another similar format which is widely used is IOB2 format, which is the same as the IOB format except that the B- tag is used in the beginning of every chunk (i.e. all chunks start with the B- tag).
A readable introduction to entity tagging is given in Bob Carpenter's blog post, "Coding Chunkers as Taggers".[3]
An example with IOB format:
Alex I-PER
is O
going O
to O
Los I-LOC
Angeles I-LOC
in O
California I-LOC
Notice how "Alex", "Los" and "California", although first tokens of their chunk, have the "I-" prefix.
Alex I-PER
going O
Los I-LOC
Angeles I-LOC
California B-LOC
Notice how "California" now has the "B-" prefix, because it immediately follows another LOC chunk.
The same example with IOB2 format (with tagging unaffected by stop word filtering):
Alex B-PER
is O
going O
to O
Los B-LOC
Angeles I-LOC
in O
California B-LOC
Related tagging schemes sometimes include "START/END: This consists of the tags B, E, I, S or O where S is used to represent a chunk containing a single token. Chunks of length greater than or equal to two always start with the B tag and end with the E tag."[4]
Other Tagging Scheme's include BIOES/BILOU, where 'E' and 'L' denotes Last or Ending character is such a sequence and 'S' denotes Single element or 'U' Unit element.
An Example with BIOES format:
Alex S-PER
is O
going O
with O
Marty B-PER
A. I-PER
Rick E-PER
to O
Los B-LOC
Angeles E-LOC
Drawbacks
IOB syntax does not permit any nesting, so cannot (unless extended) also represent even very simple phenomena such as sentence boundaries (which are not trivial to locate reliably), the scope of parenthetical expressions in sentences, grammatical structures, nested Named Entities such as "University of Wisconsin Dept. of Computer Science", and so on. It also leaves no place for metadata such as an identifier for the particular sample, the confidence level of the NER assignment, and so on, which are commonplace in NLP systems.
Because of these limitations, data must often be converted out of IOB format, or projects must create custom extensions, which has led to a large number of not-quite-interoperable "IOB-like" formats. Many extended variations will also "pass" a non-extended parser, so it is easy to process incorrectly without noticing.
The space and "O" (meaning "not in any chunk") convey no information and could simply be omitted. The same is true for putting the "type" suffix on "I-" or "E-" markers as in some variants of "BIOES"; and for marking both "I" and "E" (if you have begun and not ended you are "in", and if you are "in", you have begun and not ended). Some other formats deploy verbosity to improve readability and/or error-checking, but no such benefits appear to come to IOB in exchange for its verbosity.
IOB's "one token per line" depends on the tokenization used, even though tokenization is not standardized in NLP, and details of tokenization do not have to be entangled with the representations of NERs. "11/31/2019" could be anywhere from one to five tokens in different systems, but the NER is the same. Some systems even permit whitespace within tokens, and space as a delimiter collides with this, narrowing the applicability of IOB and motivating more extensions. "space" might or might not include tab, multiple spaces, hard spaces, and so on, differences which are difficult to detect when proofreading.
IOB variants that allow multiple tokens per line often use "/" or another reserved character to separate the label from the token. This effectively "reserves" that character, which then cannot occur in tokens (or must be escaped, introducing more incompatibilities).
IOB files have no place to put commonly-needed meta-data, such as the character encoding
being used, the data source, internal location-markers, and so on.
More powerful formats (most obviously XML, but even JSON or s-expressions) can handle far more diverse annotations, have far less variation between implementations, and are often shorter and more readable as well. For example:
XML takes 80 bytes to do the same things as the 91 byte BIOES version shown above, or the 79 byte IOB version. However, it can easily also support sentence boundaries, part-of-speech annotations, location markers, and other features commonly needed in NLP systems. Breaking all tokens in particular places is not strictly part of the NER task; but even if every token were tagged (like "<T>is</T>") the total would grow only to 139 bytes: