Updated README.

b8ddab9a · piotrmp · de8b92a5 · b8ddab9a
Commit b8ddab9a authored Mar 15, 2024 by piotrmp
--- a/README.md
+++ b/README.md
@@ -10,11 +10,9 @@ LAMBO is a machine learning model, which means it was trained to recognise bound

 LAMBO was developed in context of dependency parsing. Thus, it includes models trained on [Universal Dependencies treebanks](https://universaldependencies.org/#language-), uses `.conllu` as the training [data format](https://universaldependencies.org/conll18/evaluation.html) and supports integration with [COMBO](https://gitlab.clarin-pl.eu/syntactic-tools/combo), a state-of-the-art system for dependency parsing and more. However, you can use LAMBO as the first stage of any NLP process.

-LAMBO currently includes models trained on 98 corpora in 53 languages. The full list is available in [`languages.txt`](src/lambo/resources/languages.txt). For each of these, two model variants are available:
- simple LAMBO, trained on the UD corpus
- pretrained LAMBO, same as above, but starting from weights pre-trained on unsupervised masked character prediction using multilingual corpora from [OSCAR](https://oscar-corpus.com/).
+LAMBO currently includes models trained on 130 corpora in 67 languages. The full list is available in [`languages.txt`](src/lambo/resources/languages.txt). Most of these are pretrained on unsupervised masked character prediction using multilingual corpora from [OSCAR](https://oscar-corpus.com/) and fine-tuned on UD 2.13 corpus.

-For 49 of the corpora, a subword splitting model is available. Note that different types of multi-word tokens exist in different languages:
+For 54 of the corpora, a subword splitting model is available. Note that different types of multi-word tokens exist in different languages:
 - those that are a concatenation of their subwords, as in English: *don't* = *do* + *n't*
 - those that differ from their subwords, as in Spanish: *al* = *a* + *el*