Skip to content
Snippets Groups Projects
Paweł Walkowiak's avatar
Paweł Walkowiak authored
Fix anonymizer errors

See merge request !13
f879f239

Anonymizer

Service that automatically anonymizes text for polish language.

Anonymizer works in 3 modes, when sensitive data is detected, it can perform operations:

  • delete - sensitive data is deleted
  • tag - sensitive data is replaced by the category tag it belongs to
  • pseudo (pseudonymization) - sensitive data is replaced by another object in the same category

Running from cli

python3 cli.py example_inputs/wiktorner_jsonl.jsonl output.json --configuration wiktorner_jsonl --language pl --replace-method tag

How it works?

Anonymizer is a pipeline of modules. The overall pipeline is as follows:

  1. Text is loaded from a file by using input_parser module. The role of this module is to read the data from the file and output text and it's annotations into standardized format.
  2. A series of detector modules are run agains the text and annotations from the previous step. Each detector module is responsible for detecting a specific type of sensitive data. The output of the detector is a list of parsed detections. At the and detections from all detectors are merged into one list.
  3. Multiple detector modules can detect sensitive data in the same or overlapping spans (eg. 523-612-298 will be detected as a phone number, but also as multiple numbers). The role of a suppresor is to select which annotations should be kept and which should be removed. The simplest suppresor is the order based, that - on overlap - selects the detections that was first in the list (so the detection that was created by detector module that was higher on the list of detectors).
  4. A series of replacer modules are run against the text and detections from the previous step. Each replacer module is responsible for replacing a specific type of sensitive data. The output of the replacer is a list of parsed replacements (the entires that were handled by a specific replacer) and list of unhandled detections (the detections that were not handled by a specific replacer). All of not handled detections are passed to the next replacer module. It's usually a good idea to put the most general replacer at the end of the list of replacers (ie the one that will be able to put some generic replacement for every possible detection).

All of those steps are managed by pipeline module.

Configuration

The project uses hydra for configuration. You can find the configuration files in config. The project is structured in such a way, that different configurations of the software are placed in config/configuration. For example, there you can find ccl.yaml configuration, which configures anonimizer so that it works on single CCL files with n5 ner.

Examples:

  • Delete
    • Spotkałem się dzisiaj z Janem Kowalskim.
    • Spotkałem się dzisiaj z .
  • Tag
    • Spotkałem się dzisiaj z Janem Kowalskim.
    • Spotkałem się dzisiaj z [OSOBA] [OSOBA].
  • Pseudonymization
    • Spotkałem się dzisiaj z Janem Kowalskim.
    • Spotkałem się dzisiaj z Stefanem Michlem.

Liner2 should use model 5nam. tekst->any2txt->morphodita->liner2->anonymizer