Skip to content
Snippets Groups Projects
Szymon Ciombor's avatar
Szymon Ciombor authored
Fixed multiple punctuation signs beeing predicted at once

See merge request !5
19344621

Punctuator

A service that automatically adds punctuation to raw word-stream (eg. from speech2text).

Approaches

  1. Token classification (actions): Each token is classified with 4 labels: Uppercase, dot, colon, question mark. The model is based on the stacked encoder part of transformer architecture (Bert), followed by FC-layer that transforms the output into per-token multilabel binary classifications. For now, there is no restriction for taking dot, question_mark and colon labels simultaneously, so that's the are of improvement (hierarchical, multilabel classification)

  2. Sequence-to-Sequence (translations): Full encoder-decoder stack that takes input (unpunctuated text) and the output produced so far to predict the next token. In theory, this model should be able to represent many more cases (eg. all upper, some upper, dashes, ellipsis etc...) without explicit defines. However, the lack of constraints makes it much harder to train.

Mountpoints

Directory where model will be downloaded (~500Mb) needs to be mounted at /punctuator/deploy