diff --git a/INSTALL b/INSTALL new file mode 100644 index 0000000000000000000000000000000000000000..cf575a5434326e851c866c3aa527ce9341d8b8a6 --- /dev/null +++ b/INSTALL @@ -0,0 +1,24 @@ +IOBBER, a chunker for Slavic languages based on CRF++ and WCCL +(c) 2012, Adam Radziszewski (name.surname at pwr.wroc.pl) +Istitute of Informatics, Wrocław University of Technology + + +The software is written in Python, but requires additional C++/Python modules to work. + +You need to install the following packages beforehand: +* Python setuptools for installation, +* WCCL with Python support; http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki +* Corpus2 with Python support (also required by WCCL); http://nlp.pwr.wroc.pl/redmine/projects/corpus2/wiki +* CRF++ with Python support (install CRF++ itself first, then enter the `python' subdir and install Python wrappers); http://crfpp.googlecode.com/svn/trunk/doc/index.html + +If the above packages have been correctly installed, the installation of iobber is simple: +sudo python setup.py install + +This will install the python modules (iobber package), the iobber executable and the default configuration for KPWr and a trained model ready to use. + +To use the trained model, issue the following (for more details please consult README and the output of iobber -h): + +iobber kpwr.ini -d model-kpwr03/ my_xces_input.xml -i xces -O ccl_chunked_output.xml + +NOTE: the kpwr.ini configuration assumes that the input is morphosyntactically tagged. + diff --git a/README b/README new file mode 100644 index 0000000000000000000000000000000000000000..3c9f078e4f87e7d6eca1c2f708096fed75bc371b --- /dev/null +++ b/README @@ -0,0 +1,43 @@ +IOBBER -- a chunker for Slavic languages based on CRF++ and WCCL. +2012, Adam Radziszewski, Wrocław University of Technology + +The chunker reads input file(s) and adds chunk annotation. By default, the +input and output format is assumed to be CCL. This may be altered by using -i +and -o options. The following formats are supported: +* xces -- morphosyntactically annotated document divided into sentences, +tokens and usually paragraphs (by default this division is assumed; if the +input is not divided into paragraph or the existing division should be ignored, +use --sent-only); this is the XCES variant as used in the IPI PAN Corpus of +Polish (korpus.pl); +* ccl -- a simple modification to the above format that allows to include +chunk-style annotations; the specs may be found at: +http://nlp.pwr.wroc.pl/redmine/projects/corpus2/wiki/CCL_format +* iob-chan -- a very simple format that allows to store morphosyntactic +annotation (limited to one lemma,tag pair per token) and chunk-style +annotation per "channel". + +NOTE: the rest of the formats defined in corpus2 should theoretically work, +but in case of any troubles it is safe to use maca-convert to pre-convert +the input format to one of the above. + + +The above formats that support chunk annotation (ccl and iob-chan) assume that +there may be a number of independent "channels". A channel defines chunking of +one phrase. This makes it possible to annotate differently defined phrases in +one file, even if some of the chunks would overlap. + +Although the data formats (and internal data representation) treat each chunk +type as a separate channel, Iobber may treat several chunk types as one +"layer", effectively treating them as one channel. This means that no chunks +from a given layer may overlap. + +The kpwr.ini config (and its trained model -- model-kpwr03) defines two layers: +* layer1 with simple agreement-based noun/adj phrases: chunk_agp, +* layer2 with phrases based on pred-arg structure: chunk_np, chunk_adjp and +chunk_vp. + + +NOTE: the current version of iobber does not recognise chunks' syntactic heads. +Also, it is unable to recognise discontinuous chunks. Both types of information +may be included in ccl format and this is likely to be supported in the future. +