Skip to content
Snippets Groups Projects
Select Git revision
  • c96f85c5314281180b58a40ccb0faaa251f25753
  • master default protected
  • 2-read-not-only-txt-files-2
  • jakubbalicki-master-patch-50863
  • tox-test
5 results

easymatcher

  • Clone with SSH
  • Clone with HTTPS
  • Easymatcher worker

    Worker for matching phrases from a dictionary on the text via use of an easymatcher tool - https://gitlab.clarin-pl.eu/knowledge-extraction/tools/easymatcher

    task_options

    labels_path: str - path to json file with labelsunder 'labels_path' key,
    n_workers: Optional[int] - number of workers used in the matcher,
    sim_threshold: Optional[float] - cosine similarity threshold used in the matcher.

    Input file structure

    input_path should direct to a file or folder with text documents from which the worker should prepare the jsonl data with a following format:

    {"text": "Example text1"}
    {"text": "Example text2"}
    ...

    Labels file structure

    labels_path should direct to a json file with a following structure:

    {
        "labels": {
            "Example label1": ["example1", "example2"],
            "Example label2": ["example1", "example2"],
            ...
            }
    }

    Output file structure

    In output file every text gets list of labels with start and end indices of detected label in document.

    {"text": "Example text1", "label": [(1, 9, "label1"), (21, 27, "label2")]}
    {"text": "Example text2", "label": [(4, 14, "label42")]}
    ...

    Currently blacklist words added to the labels file aren't compatible with current version of easymatcher.
    Currently the only matcher available in this tool is a MatrixMatcher which can be found under the link at the top of this README.