Skip to content
Snippets Groups Projects
Commit b8ddab9a authored by piotrmp's avatar piotrmp
Browse files

Updated README.

parent de8b92a5
Branches
Tags
No related merge requests found
Pipeline #17017 passed
......@@ -10,11 +10,9 @@ LAMBO is a machine learning model, which means it was trained to recognise bound
LAMBO was developed in context of dependency parsing. Thus, it includes models trained on [Universal Dependencies treebanks](https://universaldependencies.org/#language-), uses `.conllu` as the training [data format](https://universaldependencies.org/conll18/evaluation.html) and supports integration with [COMBO](https://gitlab.clarin-pl.eu/syntactic-tools/combo), a state-of-the-art system for dependency parsing and more. However, you can use LAMBO as the first stage of any NLP process.
LAMBO currently includes models trained on 98 corpora in 53 languages. The full list is available in [`languages.txt`](src/lambo/resources/languages.txt). For each of these, two model variants are available:
- simple LAMBO, trained on the UD corpus
- pretrained LAMBO, same as above, but starting from weights pre-trained on unsupervised masked character prediction using multilingual corpora from [OSCAR](https://oscar-corpus.com/).
LAMBO currently includes models trained on 130 corpora in 67 languages. The full list is available in [`languages.txt`](src/lambo/resources/languages.txt). Most of these are pretrained on unsupervised masked character prediction using multilingual corpora from [OSCAR](https://oscar-corpus.com/) and fine-tuned on UD 2.13 corpus.
For 49 of the corpora, a subword splitting model is available. Note that different types of multi-word tokens exist in different languages:
For 54 of the corpora, a subword splitting model is available. Note that different types of multi-word tokens exist in different languages:
- those that are a concatenation of their subwords, as in English: *don't* = *do* + *n't*
- those that differ from their subwords, as in Spanish: *al* = *a* + *el*
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment