Updated README and help text

3d1f363f · Radosław Warzocha · 4c5ccf31 · 3d1f363f · 3d1f363f · 3d1f363f
Commit 3d1f363f authored May 27, 2014 by Radosław Warzocha
--- a/README
+++ b/README
@@ -4,15 +4,19 @@ Istitute of Informatics, Wrocław University of Technology
 http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki
 Dependencies:
-* Python 2.6 with headers
+* g++ 4.6.3
-* SWIG
+* CRF++ - http://crfpp.googlecode.com/svn/trunk/doc/index.html
-* CRF++ with Python support (install CRF++ itself first, then enter the `python' subdir and install Python wrappers); http://crfpp.googlecode.com/svn/trunk/doc/index.html
+* Corpus2 library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki)
-* corpus2 library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) installed with Python support
+* MACA library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki)
-* MACA library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) installed with Python support
 * Morfeusz SGJP (http://sgjp.pl/morfeusz/index.html), please install it before installing MACA so that it also builds Morfeusz plugin
-* wccl library (http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki) installed with Python support
+* WCCL library (http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki)
+WCRFT (Wrocław CRF Tagger) is a simple morpho-syntactic tagger for Polish.
+The tagger combines tiered tagging, conditional random fields (CRF) and features tailored for inflective languages written in WCCL. The algorithm and code are inspired by Wrocław Memory-Based Tagger. WCRFT uses CRF++ API as the underlying CRF implementation.
+Tiered tagging is assumed. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute is treated with a separate CRF and may be supplied a different set of feature templates.
 The tagger is able to tag morphologically analysed input (sentences divided into tokens, tokens assigned lists of candidate interpretations).
 If you need to tag plain text, it is recommended to use MACA for the analysis (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki).
@@ -32,13 +36,12 @@ There are two possibilities with respect to placement of the model:
 Basic usage:
 The package comes with ready-made configuration for tagging (NCP, nkjp.pl) tagset. The configuration is config/nkjp.ini. A configuration specifies parameter values and points to a file with features used for different layers. To get a working tagger, a TRAINED MODEL is also needed. You can obtain one by training the tagger with a reference corpus and storing the model to a given directory, for instance:
-wcrft/wcrft.py -d path/to/nkjp_model config/nkjp_s2.ini --train path/to/training-corpus.xml -i xces
+wcrft-app -d path/to/nkjp_model config/nkjp_s2.ini --train path/to/training-corpus.xml -i xces
 Note: for best results it is highly recommended to re-analyse the training data using the same version of morphological analyser (e.g. the same MACA config) as will be using during tagger usage. The model available for download at the WCRFT wiki page already includes this.
 To use the trained model to tag a single file:
-wcrft/wcrft.py -d path/to/nkjp_model config/nkjp_s2.ini input.xml -O tagged.xml
+wcrft-app -d path/to/nkjp_model config/nkjp_s2.ini input.xml -O tagged.xml
-For more details, see wcrft.py -h and the project wiki.
+For more details, see wcrft-app -h and the project wiki.
--- a/wcrft-app/main.cpp
+++ b/wcrft-app/main.cpp
@@ -18,6 +18,7 @@
 #include <utility>
 #include <vector>
+#include <boost/algorithm/string/replace.hpp>
 #include <boost/foreach.hpp>
 #include <boost/program_options.hpp>
@@ -27,15 +28,23 @@
 #include "program_options.h"
-const std::string ADDITIONAL_FORMAT_INFO = "INFO: formats: txt premorph; require installed Maca and Morfeusz\n" + Wcrft::FORMAT_HELP;
+const std::string ADDITIONAL_FORMAT_INFO = 
+"Supported I/O formats: txt premorph; require installed Maca and Morfeusz\n" +
+boost::algorithm::replace_all_copy(Wcrft::FORMAT_HELP, "input formats: ccl", 
+								   "input formats: txt premorph ccl");
 const std::string DESCRIPTION = "wcrft [options] CONFIGFILE [INPUT...]\n\
 \n\
 WCRFT, Wroclaw CRF Tagger\n\
 (C) 2012, Wroclaw University of Technology\n\
 \n\
-Tags input file(s) using the selected configuration. Use -d to specify where to\n\
+Tags input file(s) using the selected configuration (e.g. nkjk_e2.ini).\
-look for a trained tagger model (or where to store a model when training).\n\
+Configurations may provide default name of a trained tagger model\
+(the standard configurations do) so if both the tagger and the model\
+is installed properly, you don't have to worry about traned tagger model.\
+Otherwise, you can use -d to specify where to look for a trained tagger model.\
+This may also be used to override default model dir. When training,\
+use -d to specify a directory where trained model should be saved.\n\
 \n\
 Use -O to specify output path (by default will write to stdout).\n\
 Use - to tag stdin to stdout.\n\

--- a/wcrft-app/program_options.cpp
+++ b/wcrft-app/program_options.cpp
@@ -32,10 +32,10 @@ prog_opts::options_description create_options_description(const std::string& des
 		("input-format,i",    prog_opts::value<std::string>()->default_value("xces"), "set the input format")
 		("output-format,o",   prog_opts::value<std::string>()->default_value("xces"), "set the output format")
 		("output-file,O",     prog_opts::value<std::string>()->default_value(""),	  "set output filename (do not write to stdout)")
-		("data-dir,d",        prog_opts::value<std::string>(), "assume WCCL and trained model to sit in the given dir")
+		("data-dir,d",        prog_opts::value<std::string>(), "search for trainedmodel in the given dir")
 		("maca-config,mc",    prog_opts::value<std::string>()->default_value("morfeusz_nkjp"), "overrides maca config file")
 		("ambiguity,A",   prog_opts::value<bool>()->default_value(false),  "preserve non-disamb interpretations after tagging")
-		("chunks,C",	  prog_opts::value<bool>()->default_value(true),  "preserve input paragraph chunks")
+		("chunks,C",	  prog_opts::value<bool>()->default_value(false),  "preserve input paragraph chunks (the default is to read sentences only)")
 		("verbose,v",	  prog_opts::value<bool>()->default_value(false), "verbose mode")
 		("train",	 prog_opts::value<bool>()->default_value(false),  "train the tagger")
 		("batch",	 prog_opts::value<bool>()->default_value(false), "treat arguments as lists of paths to files")