Newer
Older
WCRFT, Wrocław CRF Tagger
(C) 2012 Adam Radziszewski (name.surname at pwr.wroc.pl)
Istitute of Informatics, Wrocław University of Technology
http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki
Dependencies:
* Python 2.6 with headers
* SWIG
* CRF++ with Python support (install CRF++ itself first, then enter the `python' subdir and install Python wrappers); http://crfpp.googlecode.com/svn/trunk/doc/index.html
* corpus2 library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) installed with Python support
* wccl library (http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki) installed with Python support
The tagger is able to tag morphologically analysed input (sentences divided into tokens, tokens assigned lists of candidate interpretations).
If you need to tag plain text, it is recommended to use MACA for the analysis (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki).
Instalation:
Currently there is no setup script, please use the wcrft.py module directly. See wcrft.py -h for details. A standard setup.py script will be included in future versions.
Basic usage:
The package comes with ready-made configuration for tagging (NCP, nkjp.pl) tagset. The configuration is config/nkjp.ini. A configuration specifies parameter values and points to a file with features used for different layers. To get a working tagger, a TRAINED MODEL is also needed. You can obtain one by training the tagger with a reference corpus and storing the model to a given directory, for instance:
wcrft/wcrft.py -d path/to/nkjp_model config/nkjp.ini --train path/to/training-corpus.xml -i xces
Note: for best results it is highly recommended to re-analyse the training data using the same version of morphological analyser (e.g. the same MACA config) as will be using during tagger usage. The model available for download at the WCRFT wiki page already includes this.
To use the trained model to tag a single file:
wcrft/wcrft.py -d path/to/nkjp_model config/nkjp-k11.ini input.xml -O tagged.xml
For more details, see wcrft.py -h and the project wiki.