Skip to content
Snippets Groups Projects
Commit 3d1f363f authored by Radosław Warzocha's avatar Radosław Warzocha
Browse files

Updated README and help text

parent 4c5ccf31
Branches
Tags
No related merge requests found
...@@ -4,15 +4,19 @@ Istitute of Informatics, Wrocław University of Technology ...@@ -4,15 +4,19 @@ Istitute of Informatics, Wrocław University of Technology
http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki
Dependencies: Dependencies:
* Python 2.6 with headers * g++ 4.6.3
* SWIG * CRF++ - http://crfpp.googlecode.com/svn/trunk/doc/index.html
* CRF++ with Python support (install CRF++ itself first, then enter the `python' subdir and install Python wrappers); http://crfpp.googlecode.com/svn/trunk/doc/index.html * Corpus2 library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki)
* corpus2 library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) installed with Python support * MACA library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki)
* MACA library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) installed with Python support
* Morfeusz SGJP (http://sgjp.pl/morfeusz/index.html), please install it before installing MACA so that it also builds Morfeusz plugin * Morfeusz SGJP (http://sgjp.pl/morfeusz/index.html), please install it before installing MACA so that it also builds Morfeusz plugin
* wccl library (http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki) installed with Python support * WCCL library (http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki)
WCRFT (Wrocław CRF Tagger) is a simple morpho-syntactic tagger for Polish.
The tagger combines tiered tagging, conditional random fields (CRF) and features tailored for inflective languages written in WCCL. The algorithm and code are inspired by Wrocław Memory-Based Tagger. WCRFT uses CRF++ API as the underlying CRF implementation.
Tiered tagging is assumed. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute is treated with a separate CRF and may be supplied a different set of feature templates.
The tagger is able to tag morphologically analysed input (sentences divided into tokens, tokens assigned lists of candidate interpretations). The tagger is able to tag morphologically analysed input (sentences divided into tokens, tokens assigned lists of candidate interpretations).
If you need to tag plain text, it is recommended to use MACA for the analysis (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki). If you need to tag plain text, it is recommended to use MACA for the analysis (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki).
...@@ -32,13 +36,12 @@ There are two possibilities with respect to placement of the model: ...@@ -32,13 +36,12 @@ There are two possibilities with respect to placement of the model:
Basic usage: Basic usage:
The package comes with ready-made configuration for tagging (NCP, nkjp.pl) tagset. The configuration is config/nkjp.ini. A configuration specifies parameter values and points to a file with features used for different layers. To get a working tagger, a TRAINED MODEL is also needed. You can obtain one by training the tagger with a reference corpus and storing the model to a given directory, for instance: The package comes with ready-made configuration for tagging (NCP, nkjp.pl) tagset. The configuration is config/nkjp.ini. A configuration specifies parameter values and points to a file with features used for different layers. To get a working tagger, a TRAINED MODEL is also needed. You can obtain one by training the tagger with a reference corpus and storing the model to a given directory, for instance:
wcrft/wcrft.py -d path/to/nkjp_model config/nkjp_s2.ini --train path/to/training-corpus.xml -i xces wcrft-app -d path/to/nkjp_model config/nkjp_s2.ini --train path/to/training-corpus.xml -i xces
Note: for best results it is highly recommended to re-analyse the training data using the same version of morphological analyser (e.g. the same MACA config) as will be using during tagger usage. The model available for download at the WCRFT wiki page already includes this. Note: for best results it is highly recommended to re-analyse the training data using the same version of morphological analyser (e.g. the same MACA config) as will be using during tagger usage. The model available for download at the WCRFT wiki page already includes this.
To use the trained model to tag a single file: To use the trained model to tag a single file:
wcrft/wcrft.py -d path/to/nkjp_model config/nkjp_s2.ini input.xml -O tagged.xml wcrft-app -d path/to/nkjp_model config/nkjp_s2.ini input.xml -O tagged.xml
For more details, see wcrft.py -h and the project wiki.
For more details, see wcrft-app -h and the project wiki.
...@@ -18,6 +18,7 @@ ...@@ -18,6 +18,7 @@
#include <utility> #include <utility>
#include <vector> #include <vector>
#include <boost/algorithm/string/replace.hpp>
#include <boost/foreach.hpp> #include <boost/foreach.hpp>
#include <boost/program_options.hpp> #include <boost/program_options.hpp>
...@@ -27,15 +28,23 @@ ...@@ -27,15 +28,23 @@
#include "program_options.h" #include "program_options.h"
const std::string ADDITIONAL_FORMAT_INFO = "INFO: formats: txt premorph; require installed Maca and Morfeusz\n" + Wcrft::FORMAT_HELP; const std::string ADDITIONAL_FORMAT_INFO =
"Supported I/O formats: txt premorph; require installed Maca and Morfeusz\n" +
boost::algorithm::replace_all_copy(Wcrft::FORMAT_HELP, "input formats: ccl",
"input formats: txt premorph ccl");
const std::string DESCRIPTION = "wcrft [options] CONFIGFILE [INPUT...]\n\ const std::string DESCRIPTION = "wcrft [options] CONFIGFILE [INPUT...]\n\
\n\ \n\
WCRFT, Wroclaw CRF Tagger\n\ WCRFT, Wroclaw CRF Tagger\n\
(C) 2012, Wroclaw University of Technology\n\ (C) 2012, Wroclaw University of Technology\n\
\n\ \n\
Tags input file(s) using the selected configuration. Use -d to specify where to\n\ Tags input file(s) using the selected configuration (e.g. nkjk_e2.ini).\
look for a trained tagger model (or where to store a model when training).\n\ Configurations may provide default name of a trained tagger model\
(the standard configurations do) so if both the tagger and the model\
is installed properly, you don't have to worry about traned tagger model.\
Otherwise, you can use -d to specify where to look for a trained tagger model.\
This may also be used to override default model dir. When training,\
use -d to specify a directory where trained model should be saved.\n\
\n\ \n\
Use -O to specify output path (by default will write to stdout).\n\ Use -O to specify output path (by default will write to stdout).\n\
Use - to tag stdin to stdout.\n\ Use - to tag stdin to stdout.\n\
......
...@@ -32,10 +32,10 @@ prog_opts::options_description create_options_description(const std::string& des ...@@ -32,10 +32,10 @@ prog_opts::options_description create_options_description(const std::string& des
("input-format,i", prog_opts::value<std::string>()->default_value("xces"), "set the input format") ("input-format,i", prog_opts::value<std::string>()->default_value("xces"), "set the input format")
("output-format,o", prog_opts::value<std::string>()->default_value("xces"), "set the output format") ("output-format,o", prog_opts::value<std::string>()->default_value("xces"), "set the output format")
("output-file,O", prog_opts::value<std::string>()->default_value(""), "set output filename (do not write to stdout)") ("output-file,O", prog_opts::value<std::string>()->default_value(""), "set output filename (do not write to stdout)")
("data-dir,d", prog_opts::value<std::string>(), "assume WCCL and trained model to sit in the given dir") ("data-dir,d", prog_opts::value<std::string>(), "search for trainedmodel in the given dir")
("maca-config,mc", prog_opts::value<std::string>()->default_value("morfeusz_nkjp"), "overrides maca config file") ("maca-config,mc", prog_opts::value<std::string>()->default_value("morfeusz_nkjp"), "overrides maca config file")
("ambiguity,A", prog_opts::value<bool>()->default_value(false), "preserve non-disamb interpretations after tagging") ("ambiguity,A", prog_opts::value<bool>()->default_value(false), "preserve non-disamb interpretations after tagging")
("chunks,C", prog_opts::value<bool>()->default_value(true), "preserve input paragraph chunks") ("chunks,C", prog_opts::value<bool>()->default_value(false), "preserve input paragraph chunks (the default is to read sentences only)")
("verbose,v", prog_opts::value<bool>()->default_value(false), "verbose mode") ("verbose,v", prog_opts::value<bool>()->default_value(false), "verbose mode")
("train", prog_opts::value<bool>()->default_value(false), "train the tagger") ("train", prog_opts::value<bool>()->default_value(false), "train the tagger")
("batch", prog_opts::value<bool>()->default_value(false), "treat arguments as lists of paths to files") ("batch", prog_opts::value<bool>()->default_value(false), "treat arguments as lists of paths to files")
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment