Skip to content
Snippets Groups Projects
Konrad Wojtasik's avatar
Konrad Wojtasik authored
c96f85c5

Easymatcher worker

Worker for matching phrases from a dictionary on the text via use of an easymatcher tool - https://gitlab.clarin-pl.eu/knowledge-extraction/tools/easymatcher

task_options

labels_path: str - path to json file with labelsunder 'labels_path' key,
n_workers: Optional[int] - number of workers used in the matcher,
sim_threshold: Optional[float] - cosine similarity threshold used in the matcher.

Input file structure

input_path should direct to a file or folder with text documents from which the worker should prepare the jsonl data with a following format:

{"text": "Example text1"}
{"text": "Example text2"}
...

Labels file structure

labels_path should direct to a json file with a following structure:

{
    "labels": {
        "Example label1": ["example1", "example2"],
        "Example label2": ["example1", "example2"],
        ...
        }
}

Output file structure

In output file every text gets list of labels with start and end indices of detected label in document.

{"text": "Example text1", "label": [(1, 9, "label1"), (21, 27, "label2")]}
{"text": "Example text2", "label": [(4, 14, "label42")]}
...

Currently blacklist words added to the labels file aren't compatible with current version of easymatcher.
Currently the only matcher available in this tool is a MatrixMatcher which can be found under the link at the top of this README.