Skip to content
Snippets Groups Projects
README 3.29 KiB
IOBBER -- a chunker for Slavic languages based on CRF++ and WCCL.
2012, Adam Radziszewski, Wrocław University of Technology

This is free software. See LICENCE for details.

The chunker reads input file(s) and adds chunk annotation. It is also able
to recognise chunks' syntactic heads.

By default, the input and output format is assumed to be CCL. This may be
altered by using -i and -o options. The following formats are supported:
* xces -- morphosyntactically annotated document divided into sentences,
tokens and usually paragraphs (by default this division is assumed; if the
input is not divided into paragraph or the existing division should be ignored,
use --sent-only); this is the XCES variant as used in the IPI PAN Corpus of
Polish (korpus.pl);
* ccl -- a simple modification to the above format that allows to include
chunk-style annotations and their heads; the specs may be found at:
http://nlp.pwr.wroc.pl/redmine/projects/corpus2/wiki/CCL_format
* iob-chan -- a very simple format that allows to store morphosyntactic
annotation (limited to one lemma,tag pair per token) and chunk-style
annotation per "channel". The format doesn't support chunk heads.

NOTE: the rest of the formats defined in corpus2 should theoretically work,
but in case of any troubles it is safe to use maca-convert to pre-convert
the input format to one of the above.


The above formats that support chunk annotation (ccl and iob-chan) assume that
there may be a number of independent "channels". A channel defines chunking of
one phrase. This makes it possible to annotate differently defined phrases in
one file, even if some of the chunks would overlap.

Although the data formats (and internal data representation) treat each chunk
type as a separate channel, Iobber may treat several chunk types as one
"layer", effectively treating them as one channel. This means that no chunks
from a given layer may overlap.

The kpwr.ini config defines two layers:
* layer1 with simple agreement-based noun/adj phrases: chunk_agp,
* layer2 with phrases based on pred-arg structure: chunk_np, chunk_adjp and
chunk_vp.

There are two trained models distributed with IOBBER:
* model-kpwr11-H: recognises chunks and their syntactic heads,
* model-kpwr11: chunks but no heads.

NOTE: the current version of iobber is unable to recognise discontinuous chunks.
Discontinuities, however, may be expressed in CCL format and this is likely to be
supported in the future.

If the input file contains annotations in channels other than those defined
in the config, they will be preserved (this makes it possible to e.g. chunk
files already annotated with named entities).



Example call using the bundled model:

iobber kpwr.ini -d model-kpwr11/ my_xces_input.xml -i xces -O ccl_chunked_output.xml

This will read the given XCES-encoded file and produce CCL output. Note: the
kpwr.ini config assumes the input file has been morphosyntactically tagged and
the NKJP tagset is employed.

Iobber also supports chunking stdin to stdout, as well as chunking multiple
files at a time, see -h for details.


There is also a convenient tool provided, named iobber_txt, that allows to
process plain text directly. The tool has an additional requirement: the
WCRFT tagger must be installed
(http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki)

echo 'Polacy wciąż jadają zbyt mało ryb.' | iobber_txt  -