nlpworkers
punctuator

Repository



Punctuator
A service that automatically adds punctuation to raw word-stream (eg. from speech2text).

Approaches


Token classification (actions): Each token is classified with 4 labels: Uppercase, dot, colon, question mark. The model is based on the stacked encoder part of transformer architecture (Bert), followed by FC-layer that transforms the output into per-token multilabel binary classifications. For now, there is no restriction for taking dot, question_mark and colon labels simultaneously, so that's the are of improvement  (hierarchical, multilabel classification)


Sequence-to-Sequence (translations): Full encoder-decoder stack that takes input (unpunctuated text) and the output produced so far to predict the next token. In theory, this model should be able to represent many more cases (eg. all upper, some upper, dashes, ellipsis etc...) without explicit defines. However, the lack of constraints makes it much harder to train.


Mountpoints
Directory where model will be downloaded (~500Mb) needs to be mounted at /punctuator/deploy