Overview
Package facilitating usage of corpus2mwe
in order to annotate CCL
document with WCCL dictionary. Supports running with many WCCL dicts and/or
many CCL documents.
This tool can be used as standalone tool (wccl_annotator
) or as a module.
Filtering annotations
Available filters
Tool allows to erase document annotations based on specified criteria. Criteria can be specified with use of following filters:
-
annotations_exclusions
: Erases additional annotations for tokens described by two or more annotations, based on exclusion dependency graph, -
capital_letter_consistency
: Erases annotations which contains uppercased (first letter is uppercased) annotation base lemma, but both lemma (orth) and lexeme don't start with capital letter. -
exact_match_terms_list
: Erases such annotations, which contains base lemma specified in given list and lemma (original document form) does not match annotation base lemma, -
exact_match_by_category
: Erases annotations of specified names, which lemma (original document form) does not match annotation base lemma; conceptually similar toexact_match_terms_list
, -
retain_single_occurence
: If document contains more than one annotation of given type, selects first and other annotations, -
token_stoplist
: Applies stoplist to annotations: erases single-word annotations with annotation base lemma included in mentioned list. By default, this function is case-insensitive. Does not erase multiword annotations containing some of words specified in stoplist.
Description of filters config
Config specified in JSON file.
-
enabled_filters
: list of names (actually: list of lists, where first element is a name of filter and the rest are filter arguments) of enabled filters. Below options can be specified for various filters and names of such filters must be provided in this list. Filters not specified on this list won't be applied. As an addition to filter arguments, entire filters config is passed to the filter function. -
annotations_exclusion
: dict specifying which annotations should be discarded (if present) when token(s) is already annotated with given annotation. -
exact_match_terms_list
: list of dict terms, which are accepted only in provided original text form (text lemma (orth) is taken into consideration) -
exact_match_terms_list_letter_case
: same asexact_match_terms_list
, with additional match for letter case; useful for abbreviations. -
lemma_stoplist
: list of lemmas to exclude when annotated as single-word annotation - ...
Usage
As command line tool
usage: wccl_annotator [-h] -c CCL_FILE [-o OUT_FILE] [-m] [-t TAGSET]
[-a ANNOTATION] [-d WCCL_DICT] [-D WCCL_DICTS_LIST] [-b]
[-v] [-s SEPARATOR]
optional arguments:
-h, --help show this help message and exit
-c CCL_FILE, --ccl CCL_FILE
CCL file or text file with list of paths to CCL files
(for batch mode)
-o OUT_FILE, --output OUT_FILE
Required when processing single document. If used with
'--batch-mode', then list of output files will be
stored under given path.
-m, --mwe_merged
-t TAGSET, --tagset TAGSET
-a ANNOTATION, --annotation ANNOTATION
Name of annotation to set
-d WCCL_DICT, --dict WCCL_DICT
WCCL dict with terms to annotate
-D WCCL_DICTS_LIST, --dicts-list WCCL_DICTS_LIST
Tabular file with annotations and paths to related
WCCL dicts to use. Use "--separator" to specify
separator in this file.
-b, --batch-mode If enabled, then input file is treated as list of ccl
files. If output path is present, then list of created
files will be stored there. Processed files will have
'.mwe' suffix added.
-v, --verbose
-s SEPARATOR, --separator SEPARATOR
Only applicable, when using "--dicts-list". Specifies
separator.
As python module
from wccl_annotator.wccl_annotator import WcclAnnotator
annotator = WcclAnnotator()
...
annotator.process(input_file, output_file, ann_2_wccl_dict=selected_dicts_set)
Installation
This package is installed together with corpus2mwe
, no additional actions
is required.
Tests
You can run manual tests (requires verification of content of console output) by calling script corpus2mwe/src/cclmwe/tests/custom_annotations/test.sh, which uses prepared container to run this scripts for different cases.