Skip to content
Snippets Groups Projects
Szymon Ciombor's avatar
Szymon Ciombor authored
Moving production dockerfile to python-cuda

See merge request !10
092f2326

Punctuator

A service that automatically adds punctuation to raw word-stream (eg. from speech2text) for polish language.

Example input:

według webometrycznego rankingu uniwersytetów świata ze stycznia 2019 pokazującego zaangażowanie instytucji akademickich w internecie uczelnia zajmuje 5 miejsce w polsce wśród uczelni technicznych a na świecie 964 wśród wszystkich typów uczelni w rankingu szkół wyższych perspektyw politechnika wrocławska zajęła w 2019 roku 3 miejsce wśród uczelni technicznych oraz 6 miejsce spośród wszystkich uczelni akademickich w polsce

Output:

Według webometrycznego rankingu uniwersytetów świata ze stycznia 2019, pokazującego zaangażowanie instytucji akademickich w Internecie, uczelnia zajmuje 5. miejsce w Polsce wśród uczelni technicznych, a na świecie 964. Wśród wszystkich typów uczelni w rankingu szkół wyższych perspektyw Politechnika Wrocławska zajęła w 2019 roku 3. miejsce wśród uczelni technicznych oraz 6. miejsce spośród wszystkich uczelni akademickich w Polsce

Models

Action-Based

  1. actions_base: A simple model, architecturally based on BERT. It's learned on a task to predict an "Action" for each token in the sentence. Action is described as either uppercasing of the token or adding a punctuation sign at the end of the token.

  2. actions_restricted: The model nearly identical with actions_base, however it predicts punctuation as a categorical distribution (so that punctuation is mutually exclusive in training time). The idea is to better differentiate between each punctuation.

  3. actions_mixed: A model based on the full transformer (encoder + decoder) architecture. It's much less performant, as it only predicts actions for one word at the time. However, it can model action probabilities conditioned on both the input and output predicted so far. Because of that, it's much less prone to not uppercasing letters in a new sentence or placing multiple punctuation signs in close proximity.

Translation

  1. translation (Deprecated): Full encoder-decoder stack that takes input (unpunctuated text) and the output produced so far to predict the next token. The main difference from the actions model is that it's a full text2text model without restriction on tokens. Because of that, in theory, it can represent more cases (eg. all upper, some upper, dashes, ellipsis, etc...), as opposed to only a few explicitly defined actions. However, the lack of constraints makes it much harder to train (both in performance and data size).

Usage

To test the model localy you can use punctuate.py script.

punctuate.py [-h] -a {base,restricted,mixed} -d DIRECTORY -i INPUT [-m MODEL] [-l {upper_case,dot,colon,question_mark,none}] [-dv DEVICE]

Evaluate actions model

optional arguments:
  -h, --help            show this help message and exit
  -a {base,restricted,mixed}, --architecture {base,restricted,mixed}
                        Model architecture
  -d DIRECTORY, --directory DIRECTORY
                        Directory where trained model is located, relative to project root
  -i INPUT, --input INPUT
                        Input text file
  -m MODEL, --model MODEL
                        Pretrained model name
  -l {upper_case,dot,colon,question_mark,none}, --highlight {upper_case,dot,colon,question_mark,none}
                        Highlight prediction confidence of selected action per-word
  -dv DEVICE, --device DEVICE
                        Device on which inference will be made

Eg. if you place your model named "production" at punctuator/checkpoints/actions_base/ and example unpunctuated at punctuator/test_data/test.txt you can call

python3 punctuate.py -a mixed -d /deploy/actions_mixed -i test_data/text.txt -m production -dv cuda:0

Config

[deployment]
device = cpu ; Device on which inference will be made (eg. cpu, cuda:0 etc)
models_dir = deploy ; Relative path to directory, where models will be placed
models_enabled = actions_base,actions_mixed,actions_restricted ; which models are available. 

LPMN

filedir(/users/michal.pogoda)|any2txt|punctuator_test

or

filedir(/users/michal.pogoda)|any2txt|punctuator_test({"model":"model_name"})

where model_name is one of models specified in models_enabled. If no model is provided or requested model is unavailable, actions_base will be used.

Mountpoints

Directory where the model will be downloaded (~500Mb) needs to be mounted at /punctuator/deploy