Skip to content
Snippets Groups Projects
Commit 5f359ae5 authored by Adam Radziszewski's avatar Adam Radziszewski
Browse files

First step towards merge between rewrite-c-newconfig and new stuff in master...

First step towards merge between rewrite-c-newconfig and new stuff in master (explicit CLASS layer).
Merging in changes to models and configs, README and tagger app.
Merge commit '3d1f3' into rewrite-c-newconfig.
parents 8b1f9af2 3d1f363f
No related branches found
No related tags found
No related merge requests found
Showing
with 639 additions and 17 deletions
...@@ -4,15 +4,19 @@ Istitute of Informatics, Wrocław University of Technology ...@@ -4,15 +4,19 @@ Istitute of Informatics, Wrocław University of Technology
http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki
Dependencies: Dependencies:
* Python 2.6 with headers * g++ 4.6.3
* SWIG * CRF++ - http://crfpp.googlecode.com/svn/trunk/doc/index.html
* CRF++ with Python support (install CRF++ itself first, then enter the `python' subdir and install Python wrappers); http://crfpp.googlecode.com/svn/trunk/doc/index.html * Corpus2 library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki)
* corpus2 library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) installed with Python support * MACA library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki)
* MACA library (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) installed with Python support
* Morfeusz SGJP (http://sgjp.pl/morfeusz/index.html), please install it before installing MACA so that it also builds Morfeusz plugin * Morfeusz SGJP (http://sgjp.pl/morfeusz/index.html), please install it before installing MACA so that it also builds Morfeusz plugin
* wccl library (http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki) installed with Python support * WCCL library (http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki)
WCRFT (Wrocław CRF Tagger) is a simple morpho-syntactic tagger for Polish.
The tagger combines tiered tagging, conditional random fields (CRF) and features tailored for inflective languages written in WCCL. The algorithm and code are inspired by Wrocław Memory-Based Tagger. WCRFT uses CRF++ API as the underlying CRF implementation.
Tiered tagging is assumed. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute is treated with a separate CRF and may be supplied a different set of feature templates.
The tagger is able to tag morphologically analysed input (sentences divided into tokens, tokens assigned lists of candidate interpretations). The tagger is able to tag morphologically analysed input (sentences divided into tokens, tokens assigned lists of candidate interpretations).
If you need to tag plain text, it is recommended to use MACA for the analysis (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki). If you need to tag plain text, it is recommended to use MACA for the analysis (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki).
...@@ -32,13 +36,12 @@ There are two possibilities with respect to placement of the model: ...@@ -32,13 +36,12 @@ There are two possibilities with respect to placement of the model:
Basic usage: Basic usage:
The package comes with ready-made configuration for tagging (NCP, nkjp.pl) tagset. The configuration is config/nkjp.ini. A configuration specifies parameter values and points to a file with features used for different layers. To get a working tagger, a TRAINED MODEL is also needed. You can obtain one by training the tagger with a reference corpus and storing the model to a given directory, for instance: The package comes with ready-made configuration for tagging (NCP, nkjp.pl) tagset. The configuration is config/nkjp.ini. A configuration specifies parameter values and points to a file with features used for different layers. To get a working tagger, a TRAINED MODEL is also needed. You can obtain one by training the tagger with a reference corpus and storing the model to a given directory, for instance:
wcrft/wcrft.py -d path/to/nkjp_model config/nkjp_s2.ini --train path/to/training-corpus.xml -i xces wcrft-app -d path/to/nkjp_model config/nkjp_s2.ini --train path/to/training-corpus.xml -i xces
Note: for best results it is highly recommended to re-analyse the training data using the same version of morphological analyser (e.g. the same MACA config) as will be using during tagger usage. The model available for download at the WCRFT wiki page already includes this. Note: for best results it is highly recommended to re-analyse the training data using the same version of morphological analyser (e.g. the same MACA config) as will be using during tagger usage. The model available for download at the WCRFT wiki page already includes this.
To use the trained model to tag a single file: To use the trained model to tag a single file:
wcrft/wcrft.py -d path/to/nkjp_model config/nkjp_s2.ini input.xml -O tagged.xml wcrft-app -d path/to/nkjp_model config/nkjp_s2.ini input.xml -O tagged.xml
For more details, see wcrft.py -h and the project wiki.
For more details, see wcrft-app -h and the project wiki.
...@@ -88,4 +88,5 @@ message(STATUS "Model directory is in ${libwcrft_SRC_MODEL_DIR}") ...@@ -88,4 +88,5 @@ message(STATUS "Model directory is in ${libwcrft_SRC_MODEL_DIR}")
install(DIRECTORY ${libwcrft_SRC_MODEL_DIR}/ install(DIRECTORY ${libwcrft_SRC_MODEL_DIR}/
DESTINATION ${libwcrft_INSTALL_DATA_DIR} DESTINATION ${libwcrft_INSTALL_DATA_DIR}
FILES_MATCHING PATTERN model FILES_MATCHING PATTERN model
PATTERN model/*) PATTERN model/*
PATTERN model/*/*)
; NKJP tagset with unknown word treatment. This is the recommended config for NKJP. ; NKJP tagset with unknown word treatment.
; ; This is an OUTDATED config for NKJP.
; Use nkjp_s2 (slightly better)
; or nkjp_e2 (much smaller and somewhat faster, works just slightly
; worse than nkjp_s2).
[general] [general]
tagset = nkjp tagset = nkjp
; all the attrs ; all the attrs
attrs = nmb,cas,gnd,per,deg,asp,ngt,acm,acn,ppr,agg,vcl,dot attrs = CLASS,nmb,cas,gnd,per,deg,asp,ngt,acm,acn,ppr,agg,vcl,dot
macacfg = morfeusz-nkjp-official macacfg = morfeusz-nkjp-official
[lexicon] [lexicon]
......
# Unigram
# orth
U00:%x[-2,1]
U01:%x[-1,1]
U02:%x[0,1]
U03:%x[1,1]
U04:%x[2,1]
U05:%x[-1,0]
U06:%x[0,0]
U07:%x[1,0]
# class
U10:%x[-2,2]
U11:%x[-1,2]
U12:%x[0,2]
U13:%x[1,2]
U14:%x[2,2]
# cas
U20:%x[-2,3]
U21:%x[-1,3]
U22:%x[0,3]
U23:%x[1,3]
U24:%x[2,3]
# gnd
U30:%x[-2,4]
U31:%x[-1,4]
U32:%x[0,4]
U33:%x[1,4]
U34:%x[2,4]
# nmb
U40:%x[-2,5]
U41:%x[-1,5]
U42:%x[0,5]
U43:%x[1,5]
U44:%x[2,5]
# regex feats
U61:%x[0,8]/%x[0,9]
# Bigram
B
# Unigram
# orth
U00:%x[-2,1]
U01:%x[-1,1]
U02:%x[0,1]
U03:%x[1,1]
U04:%x[2,1]
U05:%x[-1,0]
U06:%x[0,0]
U07:%x[1,0]
# class
U10:%x[-2,2]
U11:%x[-1,2]
U12:%x[0,2]
U13:%x[1,2]
U14:%x[2,2]
# cas
U21:%x[-1,3]
U22:%x[0,3]
U23:%x[1,3]
# gnd
U32:%x[0,4]
# nmb
U41:%x[-1,5]
U42:%x[0,5]
U43:%x[1,5]
# regex feats
U61:%x[0,8]/%x[0,9]
# Bigram
B
# Unigram
# orth
U00:%x[-2,1]
U01:%x[-1,1]
U02:%x[0,1]
U03:%x[1,1]
U04:%x[2,1]
U05:%x[-1,0]
U06:%x[0,0]
U07:%x[1,0]
U08:%x[-1,1]/%x[0,1]
U09:%x[0,1]/%x[1,1]
# class
U10:%x[-2,2]
U11:%x[-1,2]
U12:%x[0,2]
U13:%x[1,2]
U14:%x[2,2]
U15:%x[-2,2]/%x[-1,2]
U16:%x[-1,2]/%x[0,2]
U17:%x[0,2]/%x[1,2]
U18:%x[1,2]/%x[2,2]
# cas
U20:%x[-2,3]
U21:%x[-1,3]
U22:%x[0,3]
U23:%x[1,3]
U24:%x[2,3]
# gnd
U30:%x[-2,4]
U31:%x[-1,4]
U32:%x[0,4]
U33:%x[1,4]
U34:%x[2,4]
# nmb
U40:%x[-2,5]
U41:%x[-1,5]
U42:%x[0,5]
U43:%x[1,5]
U44:%x[2,5]
# agr
U50:%x[-1,6] # agr(0,2) -> agr(-1,0)
U51:%x[0,6] # agr(0,2)
U52:%x[-1,7] # agr..(-1,2) -> agr(-2,0)
U53:%x[0,7] # (-1,2)
U54:%x[1,7] # ... -> (0,3)
# regex feats
U61:%x[0,8]/%x[0,9]
# Bigram
B
# Unigram
# orth
U00:%x[-2,1]
U01:%x[-1,1]
U02:%x[0,1]
U03:%x[1,1]
U04:%x[2,1]
U05:%x[-1,0]
U06:%x[0,0]
U07:%x[1,0]
U08:%x[-1,1]/%x[0,1]
U09:%x[0,1]/%x[1,1]
# class
U10:%x[-2,2]
U11:%x[-1,2]
U12:%x[0,2]
U13:%x[1,2]
U14:%x[2,2]
U15:%x[-2,2]/%x[-1,2]
U16:%x[-1,2]/%x[0,2]
U17:%x[0,2]/%x[1,2]
U18:%x[1,2]/%x[2,2]
# cas
U20:%x[-2,3]
U21:%x[-1,3]
U22:%x[0,3]
U23:%x[1,3]
U24:%x[2,3]
# gnd
U30:%x[-2,4]
U31:%x[-1,4]
U32:%x[0,4]
U33:%x[1,4]
U34:%x[2,4]
# nmb
U40:%x[-2,5]
U41:%x[-1,5]
U42:%x[0,5]
U43:%x[1,5]
U44:%x[2,5]
# agr
U50:%x[-1,6] # agr(0,2) -> agr(-1,0)
U51:%x[0,6] # agr(0,2)
U52:%x[-1,7] # agr..(-1,2) -> agr(-2,0)
U53:%x[0,7] # (-1,2)
U54:%x[1,7] # ... -> (0,3)
# regex feats
U61:%x[0,8]/%x[0,9]
# Bigram
B
# Unigram
# orth
U00:%x[-2,1]
U01:%x[-1,1]
U02:%x[0,1]
U03:%x[1,1]
U04:%x[2,1]
U05:%x[-1,0]
U06:%x[0,0]
U07:%x[1,0]
U08:%x[-1,1]/%x[0,1]
U09:%x[0,1]/%x[1,1]
# class
U10:%x[-2,2]
U11:%x[-1,2]
U12:%x[0,2]
U13:%x[1,2]
U14:%x[2,2]
U15:%x[-2,2]/%x[-1,2]
U16:%x[-1,2]/%x[0,2]
U17:%x[0,2]/%x[1,2]
U18:%x[1,2]/%x[2,2]
# cas
U20:%x[-2,3]
U21:%x[-1,3]
U22:%x[0,3]
U23:%x[1,3]
U24:%x[2,3]
# gnd
U30:%x[-2,4]
U31:%x[-1,4]
U32:%x[0,4]
U33:%x[1,4]
U34:%x[2,4]
# nmb
U40:%x[-2,5]
U41:%x[-1,5]
U42:%x[0,5]
U43:%x[1,5]
U44:%x[2,5]
# agr
U50:%x[-1,6] # agr(0,2) -> agr(-1,0)
U51:%x[0,6] # agr(0,2)
U52:%x[-1,7] # agr..(-1,2) -> agr(-2,0)
U53:%x[0,7] # (-1,2)
U54:%x[1,7] # ... -> (0,3)
# regex feats
U61:%x[0,8]/%x[0,9]
# Bigram
B
@ "default" (
affix(lower(orth[0]), 3); // 0
affix(lower(orth[0]), -3); // 1
class[0]; // 2
cas[0]; // 3
gnd[0]; // 4
nmb[0]; // 5
agrpp(0,1,{nmb,gnd,cas}); // 6
and(inside(-1), inside(1), wagr(-1,1,{nmb,gnd,cas})); // 7
regex(orth[0], "\\P{Ll}.*"); regex(orth[0], "\\P{Lu}.*") // 8, 9
)
; NKJP tagset with unknown word treatment, reduced feature set.
; Got rid of agreement features.
; For layers other than CLASS,nmb,gnd,cas reduced context to 3
;
[general]
tagset = nkjp
; all the attrs
attrs = CLASS,nmb,cas,gnd,asp
; acm,dot could be useful for uknown
macacfg = morfeusz-nkjp-official
defaultmodel = model_nkjp10_wcrft_e2
[lexicon]
; currently lexicon itself is not used, but unk tag list is
casesens = no
minfreq = 10
maxentries = 500
[lemmatiser]
; if lemmatiser outputs a lemma not present in morpho analysis
; --- should the lemma be ignored (forcelemma = no)
; or used to overwrite lemmas of each possible interpretation (yes)
forcelemma = yes
[crf]
params = -a CRF-L2 -f5
[unknown]
guess = yes
unktagfreq = 1
...@@ -6,8 +6,9 @@ ...@@ -6,8 +6,9 @@
[general] [general]
tagset = nkjp tagset = nkjp
; all the attrs ; all the attrs
attrs = nmb,cas,gnd,per,deg,asp,ngt,acm,acn,ppr,agg,vcl,dot attrs = CLASS,nmb,cas,gnd,per,deg,asp,ngt,acm,acn,ppr,agg,vcl,dot
macacfg = morfeusz-nkjp-official macacfg = morfeusz-nkjp-official
defaultmodel = model_nkjp10_wcrft_s2
[lexicon] [lexicon]
; currently lexicon itself is not used, but unk tag list is ; currently lexicon itself is not used, but unk tag list is
......
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
[general] [general]
tagset = nkjp tagset = nkjp
; all the attrs ; all the attrs
attrs = attrs = CLASS
macacfg = morfeusz-nkjp-official macacfg = morfeusz-nkjp-official
[lexicon] [lexicon]
......
# Unigram
# orth
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
# class
U10:%x[-2,1]
U11:%x[-1,1]
U12:%x[0,1]
U13:%x[1,1]
U14:%x[2,1]
# cas
U20:%x[-2,2]
U21:%x[-1,2]
U22:%x[0,2]
U23:%x[1,2]
U24:%x[2,2]
# gnd
U30:%x[-2,3]
U31:%x[-1,3]
U32:%x[0,3]
U33:%x[1,3]
U34:%x[2,3]
# nmb
U40:%x[-2,4]
U41:%x[-1,4]
U42:%x[0,4]
U43:%x[1,4]
U44:%x[2,4]
# regex feats
#U60:%x[-1,7]/%x[-1,8]
U61:%x[0,7]/%x[0,8]
#U62:%x[1,7]/%x[1,8]
# Bigram
B
# Unigram
# orth
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
# class
U10:%x[-2,1]
U11:%x[-1,1]
U12:%x[0,1]
U13:%x[1,1]
U14:%x[2,1]
# cas
U21:%x[-1,2]
U22:%x[0,2]
U23:%x[1,2]
# gnd
U32:%x[0,3]
# nmb
U41:%x[-1,4]
U42:%x[0,4]
U43:%x[1,4]
# regex feats
U61:%x[0,7]/%x[0,8]
# Bigram
B
# Unigram
# orth
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
# class
U10:%x[-2,1]
U11:%x[-1,1]
U12:%x[0,1]
U13:%x[1,1]
U14:%x[2,1]
U15:%x[-2,1]/%x[-1,1]
U16:%x[-1,1]/%x[0,1]
U17:%x[0,1]/%x[1,1]
U18:%x[1,1]/%x[2,1]
# cas
U20:%x[-2,2]
U21:%x[-1,2]
U22:%x[0,2]
U23:%x[1,2]
U24:%x[2,2]
# gnd
U30:%x[-2,3]
U31:%x[-1,3]
U32:%x[0,3]
U33:%x[1,3]
U34:%x[2,3]
# nmb
U40:%x[-2,4]
U41:%x[-1,4]
U42:%x[0,4]
U43:%x[1,4]
U44:%x[2,4]
# agr
U50:%x[-1,5] # agr(0,1) -> agr(-1,0)
U51:%x[0,5] # agr(0,1)
U52:%x[-1,6] # agr..(-1,1) -> agr(-2,0)
U53:%x[0,6] # (-1,1)
U54:%x[1,6] # ... -> (0,2)
# regex feats
#U60:%x[-1,7]/%x[-1,8]
U61:%x[0,7]/%x[0,8]
#U62:%x[1,7]/%x[1,8]
# wordclass trigrams
#U80:%x[-2,1]/%x[-1,1]/%x[0,1]
#U81:%x[-1,1]/%x[0,1]/%x[1,1]
#U82:%x[0,1]/%x[1,1]/%x[2,1]
# Bigram
B
# Unigram
# orth
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
# class
U10:%x[-2,1]
U11:%x[-1,1]
U12:%x[0,1]
U13:%x[1,1]
U14:%x[2,1]
U15:%x[-2,1]/%x[-1,1]
U16:%x[-1,1]/%x[0,1]
U17:%x[0,1]/%x[1,1]
U18:%x[1,1]/%x[2,1]
# cas
U20:%x[-2,2]
U21:%x[-1,2]
U22:%x[0,2]
U23:%x[1,2]
U24:%x[2,2]
# gnd
U30:%x[-2,3]
U31:%x[-1,3]
U32:%x[0,3]
U33:%x[1,3]
U34:%x[2,3]
# nmb
U40:%x[-2,4]
U41:%x[-1,4]
U42:%x[0,4]
U43:%x[1,4]
U44:%x[2,4]
# agr
U50:%x[-1,5] # agr(0,1) -> agr(-1,0)
U51:%x[0,5] # agr(0,1)
U52:%x[-1,6] # agr..(-1,1) -> agr(-2,0)
U53:%x[0,6] # (-1,1)
U54:%x[1,6] # ... -> (0,2)
# regex feats
#U60:%x[-1,7]/%x[-1,8]
U61:%x[0,7]/%x[0,8]
#U62:%x[1,7]/%x[1,8]
# wordclass trigrams
#U80:%x[-2,1]/%x[-1,1]/%x[0,1]
#U81:%x[-1,1]/%x[0,1]/%x[1,1]
#U82:%x[0,1]/%x[1,1]/%x[2,1]
# Bigram
B
# Unigram
# orth
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
# class
U10:%x[-2,1]
U11:%x[-1,1]
U12:%x[0,1]
U13:%x[1,1]
U14:%x[2,1]
U15:%x[-2,1]/%x[-1,1]
U16:%x[-1,1]/%x[0,1]
U17:%x[0,1]/%x[1,1]
U18:%x[1,1]/%x[2,1]
# cas
U20:%x[-2,2]
U21:%x[-1,2]
U22:%x[0,2]
U23:%x[1,2]
U24:%x[2,2]
# gnd
U30:%x[-2,3]
U31:%x[-1,3]
U32:%x[0,3]
U33:%x[1,3]
U34:%x[2,3]
# nmb
U40:%x[-2,4]
U41:%x[-1,4]
U42:%x[0,4]
U43:%x[1,4]
U44:%x[2,4]
# agr
U50:%x[-1,5] # agr(0,1) -> agr(-1,0)
U51:%x[0,5] # agr(0,1)
U52:%x[-1,6] # agr..(-1,1) -> agr(-2,0)
U53:%x[0,6] # (-1,1)
U54:%x[1,6] # ... -> (0,2)
# regex feats
#U60:%x[-1,7]/%x[-1,8]
U61:%x[0,7]/%x[0,8]
#U62:%x[1,7]/%x[1,8]
# wordclass trigrams
#U80:%x[-2,1]/%x[-1,1]/%x[0,1]
#U81:%x[-1,1]/%x[0,1]/%x[1,1]
#U82:%x[0,1]/%x[1,1]/%x[2,1]
# Bigram
B
@ "default" (
orth[0]; // 0
class[0]; // 1
cas[0]; // 2
gnd[0]; // 3
nmb[0]; // 4
agrpp(0,1,{nmb,gnd,cas}); // 5
and(inside(-1), inside(1), wagr(-1,1,{nmb,gnd,cas})); // 6
regex(orth[0], "\\P{Ll}.*"); regex(orth[0], "\\P{Lu}.*") // 7, 8
)
; NKJP tagset with unknown word treatment, reduced feature set.
; Generates quite small models and works almost as accurately
; as nkjp_s2.
[general]
tagset = nkjp
; all the attrs
attrs = CLASS,nmb,cas,gnd,asp
; acm,dot could be useful for uknown
macacfg = morfeusz-nkjp-official
defaultmodel = model_nkjp10_wcrft_s6
[lexicon]
; currently lexicon itself is not used, but unk tag list is
casesens = no
minfreq = 10
maxentries = 500
[lemmatiser]
; if lemmatiser outputs a lemma not present in morpho analysis
; --- should the lemma be ignored (forcelemma = no)
; or used to overwrite lemmas of each possible interpretation (yes)
forcelemma = yes
[crf]
params = -a CRF-L2 -f5
[unknown]
guess = yes
unktagfreq = 1
Model trained on full NKJP 1.0.
To be used with nkjp_e2.ini config.
Trained with WCRFT 0.9.5, 1 April 2014.
time wcrft --train nkjp_e2.ini -d model_nkjp10_wcrft_e2/ nkjp10-merged-ng-rea.xml -v
real 262m53.201s
user 2475m4.391s
sys 0m45.750s
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment