Skip to content
Snippets Groups Projects
Select Git revision
  • master
  • develop
2 results

wccl-annotator-standalone

  • Clone with SSH
  • Clone with HTTPS
  • Overview

    Package facilitating usage of corpus2mwe in order to annotate CCL document with WCCL dictionary. Supports running with many WCCL dicts and/or many CCL documents. This tool can be used as standalone tool (wccl_annotator) or as a module.

    Filtering annotations

    Available filters

    Tool allows to erase document annotations based on specified criteria. Criteria can be specified with use of following filters:

    • annotations_exclusions: Erases additional annotations for tokens described by two or more annotations, based on exclusion dependency graph,
    • capital_letter_consistency: Erases annotations which contains uppercased (first letter is uppercased) annotation base lemma, but both lemma (orth) and lexeme don't start with capital letter.
    • exact_match_terms_list: Erases such annotations, which contains base lemma specified in given list and lemma (original document form) does not match annotation base lemma,
    • exact_match_by_category: Erases annotations of specified names, which lemma (original document form) does not match annotation base lemma; conceptually similar to exact_match_terms_list,
    • retain_single_occurence: If document contains more than one annotation of given type, selects first and other annotations,
    • token_stoplist: Applies stoplist to annotations: erases single-word annotations with annotation base lemma included in mentioned list. By default, this function is case-insensitive. Does not erase multiword annotations containing some of words specified in stoplist.

    Description of filters config

    Config specified in JSON file.

    1. enabled_filters: list of names (actually: list of lists, where first element is a name of filter and the rest are filter arguments) of enabled filters. Below options can be specified for various filters and names of such filters must be provided in this list. Filters not specified on this list won't be applied. As an addition to filter arguments, entire filters config is passed to the filter function.
    2. annotations_exclusion: dict specifying which annotations should be discarded (if present) when token(s) is already annotated with given annotation.
    3. exact_match_terms_list: list of dict terms, which are accepted only in provided original text form (text lemma (orth) is taken into consideration)
    4. exact_match_terms_list_letter_case: same as exact_match_terms_list, with additional match for letter case; useful for abbreviations.
    5. lemma_stoplist: list of lemmas to exclude when annotated as single-word annotation
    6. ...

    Usage

    As command line tool

    usage: wccl_annotator [-h] -c CCL_FILE [-o OUT_FILE] [-m] [-t TAGSET]
                          [-a ANNOTATION] [-d WCCL_DICT] [-D WCCL_DICTS_LIST] [-b]
                          [-v] [-s SEPARATOR]
    
    optional arguments:
      -h, --help            show this help message and exit
      -c CCL_FILE, --ccl CCL_FILE
                            CCL file or text file with list of paths to CCL files
                            (for batch mode)
      -o OUT_FILE, --output OUT_FILE
                            Required when processing single document. If used with
                            '--batch-mode', then list of output files will be
                            stored under given path.
      -m, --mwe_merged
      -t TAGSET, --tagset TAGSET
      -a ANNOTATION, --annotation ANNOTATION
                            Name of annotation to set
      -d WCCL_DICT, --dict WCCL_DICT
                            WCCL dict with terms to annotate
      -D WCCL_DICTS_LIST, --dicts-list WCCL_DICTS_LIST
                            Tabular file with annotations and paths to related
                            WCCL dicts to use. Use "--separator" to specify
                            separator in this file.
      -b, --batch-mode      If enabled, then input file is treated as list of ccl
                            files. If output path is present, then list of created
                            files will be stored there. Processed files will have
                            '.mwe' suffix added.
      -v, --verbose
      -s SEPARATOR, --separator SEPARATOR
                            Only applicable, when using "--dicts-list". Specifies
                            separator.

    As python module

    from wccl_annotator.wccl_annotator import WcclAnnotator
    annotator = WcclAnnotator()
    ...
    annotator.process(input_file, output_file, ann_2_wccl_dict=selected_dicts_set)

    Installation

    This package is installed together with corpus2mwe, no additional actions is required.

    Tests

    You can run manual tests (requires verification of content of console output) by calling script corpus2mwe/src/cclmwe/tests/custom_annotations/test.sh, which uses prepared container to run this scripts for different cases.