Punctuator
A service that automatically adds punctuation to raw word-stream (eg. from speech2text) for polish language.
Example input:
według webometrycznego rankingu uniwersytetów świata ze stycznia 2019 pokazującego zaangażowanie instytucji akademickich w internecie uczelnia zajmuje 5 miejsce w polsce wśród uczelni technicznych a na świecie 964 wśród wszystkich typów uczelni w rankingu szkół wyższych perspektyw politechnika wrocławska zajęła w 2019 roku 3 miejsce wśród uczelni technicznych oraz 6 miejsce spośród wszystkich uczelni akademickich w polsce
Output:
Według webometrycznego rankingu uniwersytetów świata ze stycznia 2019, pokazującego zaangażowanie instytucji akademickich w Internecie, uczelnia zajmuje 5. miejsce w Polsce wśród uczelni technicznych, a na świecie 964. Wśród wszystkich typów uczelni w rankingu szkół wyższych perspektyw Politechnika Wrocławska zajęła w 2019 roku 3. miejsce wśród uczelni technicznych oraz 6. miejsce spośród wszystkich uczelni akademickich w Polsce
Models
Action-Based
-
actions_base: A simple model, architecturally based on BERT. It's learned on a task to predict an "Action" for each token in the sentence. Action is described as either uppercasing of the token or adding a punctuation sign at the end of the token.
-
actions_restricted: The model nearly identical with actions_base, however it predicts punctuation as a categorical distribution (so that punctuation is mutually exclusive in training time). The idea is to better differentiate between each punctuation.
-
actions_mixed: A model based on the full transformer (encoder + decoder) architecture. It's much less performant, as it only predicts actions for one word at the time. However, it can model action probabilities conditioned on both the input and output predicted so far. Because of that, it's much less prone to not uppercasing letters in a new sentence or placing multiple punctuation signs in close proximity.
Translation
- translation (Deprecated): Full encoder-decoder stack that takes input (unpunctuated text) and the output produced so far to predict the next token. The main difference from the actions model is that it's a full text2text model without restriction on tokens. Because of that, in theory, it can represent more cases (eg. all upper, some upper, dashes, ellipsis, etc...), as opposed to only a few explicitly defined actions. However, the lack of constraints makes it much harder to train (both in performance and data size).
Usage
To test the model localy you can use punctuate.py
script.
punctuate.py [-h] -a {base,restricted,mixed} -d DIRECTORY -i INPUT [-m MODEL] [-l {upper_case,dot,colon,question_mark,none}] [-dv DEVICE]
Evaluate actions model
optional arguments:
-h, --help show this help message and exit
-a {base,restricted,mixed}, --architecture {base,restricted,mixed}
Model architecture
-d DIRECTORY, --directory DIRECTORY
Directory where trained model is located, relative to project root
-i INPUT, --input INPUT
Input text file
-m MODEL, --model MODEL
Pretrained model name
-l {upper_case,dot,colon,question_mark,none}, --highlight {upper_case,dot,colon,question_mark,none}
Highlight prediction confidence of selected action per-word
-dv DEVICE, --device DEVICE
Device on which inference will be made
Eg. if you place your model named "production" at punctuator/checkpoints/actions_base/
and example unpunctuated at punctuator/test_data/test.txt
you can call
python3 punctuate.py -a mixed -d /deploy/actions_mixed -i test_data/text.txt -m production -dv cuda:0
Config
[deployment]
device = cpu ; Device on which inference will be made (eg. cpu, cuda:0 etc)
models_dir = deploy ; Relative path to directory, where models will be placed
models_enabled = actions_base,actions_mixed,actions_restricted ; which models are available.
LPMN
filedir(/users/michal.pogoda)|any2txt|punctuator_test
or
filedir(/users/michal.pogoda)|any2txt|punctuator_test({"model":"model_name"})
where model_name is one of models specified in models_enabled. If no model is provided or requested model is unavailable, actions_base will be used.
Mountpoints
Directory where the model will be downloaded (~500Mb) needs to be mounted at /punctuator/deploy