-
piotrmp authoredd0f7ac18
LAMBO segmenter
LAMBO (Layered Approach to Multi-level BOundary identification) is a segmentation tool that is able to divide text on several levels:
- Dividing the original text into turns according to the provided list of separators. Turns can correspond to seperate utterences in a dialogue, paragraphs in a continuous text, etc.
- Splitting each turn into sentences.
- Finding tokens in sentences. Most tokens correspond to words, but multi-word tokens are also detected. LAMBO also supports special tokens that should be kept separate regardless of context, such as emojis and pause markers.
LAMBO is a machine learning model, which means it was trained to recognise boundaries of tokens and sentences from real-world text. It is implemented as a PyTorch deep neural network, including embeddings and recurrent layers operating at the character level. At the same time, LAMBO contains rule-based elements to allow a user to easily adjust it to one's needs, e.g. by adding custom special tokens or turn division markers.
LAMBO was developed in context of dependency parsing. Thus, it includes models trained on Universal Dependencies treebanks, uses .conllu
as the training data format and supports integration with COMBO, a state-of-the-art system for dependency parsing and more. However, you can use LAMBO as the first stage of any NLP process.
LAMBO currently includes models trained on 98 corpora in 53 languages. The full list is available in languages.txt. For each of these, two model variants are available:
- simple LAMBO, trained on the UD corpus
- pretrained LAMBO, same as above, but starting from weights pre-trained on unsupervised masked character prediction using multilingual corpora from OSCAR.
Installation
Installation of LAMBO is easy.
-
First, you need to prepare an environment with Python, at least 3.6.9,
-
Then, download LAMBO from this repository:
git clone https://gitlab.clarin-pl.eu/syntactic-tools/lambo.git
- Install LAMBO:
pip install ./lambo
You now have LAMBO installed in your environment.
Using LAMBO
To use LAMBO, you first need to import it:
from lambo.segmenter.lambo import Lambo
Now you need to create a segmenter by providing the language your text is in, e.g. English
:
lambo = Lambo.get('English')
This will (if necessary) download the appropriate model from the online repository and load it. Note that you can use any language name (e.g. Ancient_Greek
) or ISO 639-1 code (e.g. fi
) from languages.txt.