Easymatcher worker
Worker for matching phrases from a dictionary on the text via use of an easymatcher tool - https://gitlab.clarin-pl.eu/knowledge-extraction/tools/easymatcher
task_options
labels_path
: str - path to json file with labelsunder 'labels_path' key,
n_workers
: Optional[int] - number of workers used in the matcher,
sim_threshold
: Optional[float] - cosine similarity threshold used in the matcher.
Input file structure
input_path
should direct to a file or folder with text documents from which
the worker should prepare the jsonl data with a following format:
{"text": "Example text1"}
{"text": "Example text2"}
...
Labels file structure
labels_path
should direct to a json file with a following structure:
{
"labels": {
"Example label1": ["example1", "example2"],
"Example label2": ["example1", "example2"],
...
}
}
Output file structure
In output file every text gets list of labels with start and end indices of detected label in document.
{"text": "Example text1", "label": [(1, 9, "label1"), (21, 27, "label2")]}
{"text": "Example text2", "label": [(4, 14, "label42")]}
...
Currently blacklist words added to the labels file aren't compatible with current version of easymatcher.
Currently the only matcher available in this tool is a MatrixMatcher
which can be found under the link at the top of this README.