Skip to content
Snippets Groups Projects
Select Git revision
  • 74447ecf6b1faf69526f9105a965c7f4b8bc5996
  • master default protected
  • develop
  • sane_tag_sentence
  • tag_heads
5 results

iobber

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Adam Radziszewski authored
    74447ecf
    History
    Name Last commit Last update
    iobber
    INSTALL
    README
    setup.py
    IOBBER -- a chunker for Slavic languages based on CRF++ and WCCL.
    2012, Adam Radziszewski, Wrocław University of Technology
    
    The chunker reads input file(s) and adds chunk annotation. By default, the
    input and output format is assumed to be CCL. This may be altered by using -i
    and -o options. The following formats are supported:
    * xces -- morphosyntactically annotated document divided into sentences,
    tokens and usually paragraphs (by default this division is assumed; if the
    input is not divided into paragraph or the existing division should be ignored,
    use --sent-only); this is the XCES variant as used in the IPI PAN Corpus of
    Polish (korpus.pl);
    * ccl -- a simple modification to the above format that allows to include
    chunk-style annotations; the specs may be found at:
    http://nlp.pwr.wroc.pl/redmine/projects/corpus2/wiki/CCL_format
    * iob-chan -- a very simple format that allows to store morphosyntactic
    annotation (limited to one lemma,tag pair per token) and chunk-style
    annotation per "channel".
    
    NOTE: the rest of the formats defined in corpus2 should theoretically work,
    but in case of any troubles it is safe to use maca-convert to pre-convert
    the input format to one of the above.
    
    
    The above formats that support chunk annotation (ccl and iob-chan) assume that
    there may be a number of independent "channels". A channel defines chunking of
    one phrase. This makes it possible to annotate differently defined phrases in
    one file, even if some of the chunks would overlap.
    
    Although the data formats (and internal data representation) treat each chunk
    type as a separate channel, Iobber may treat several chunk types as one
    "layer", effectively treating them as one channel. This means that no chunks
    from a given layer may overlap.
    
    The kpwr.ini config (and its trained model -- model-kpwr03) defines two layers:
    * layer1 with simple agreement-based noun/adj phrases: chunk_agp,
    * layer2 with phrases based on pred-arg structure: chunk_np, chunk_adjp and
    chunk_vp.
    
    
    NOTE: the current version of iobber does not recognise chunks' syntactic heads.
    Also, it is unable to recognise discontinuous chunks. Both types of information
    may be included in ccl format and this is likely to be supported in the future.