diff --git a/README.md b/README.md index 76baafd5d5ae99e39b70f8a7302c189855a61612..d192e1ae7c93e306da62c393b516df6bb4ab9596 100644 --- a/README.md +++ b/README.md @@ -10,11 +10,9 @@ LAMBO is a machine learning model, which means it was trained to recognise bound LAMBO was developed in context of dependency parsing. Thus, it includes models trained on [Universal Dependencies treebanks](https://universaldependencies.org/#language-), uses `.conllu` as the training [data format](https://universaldependencies.org/conll18/evaluation.html) and supports integration with [COMBO](https://gitlab.clarin-pl.eu/syntactic-tools/combo), a state-of-the-art system for dependency parsing and more. However, you can use LAMBO as the first stage of any NLP process. -LAMBO currently includes models trained on 98 corpora in 53 languages. The full list is available in [`languages.txt`](src/lambo/resources/languages.txt). For each of these, two model variants are available: -- simple LAMBO, trained on the UD corpus -- pretrained LAMBO, same as above, but starting from weights pre-trained on unsupervised masked character prediction using multilingual corpora from [OSCAR](https://oscar-corpus.com/). +LAMBO currently includes models trained on 130 corpora in 67 languages. The full list is available in [`languages.txt`](src/lambo/resources/languages.txt). Most of these are pretrained on unsupervised masked character prediction using multilingual corpora from [OSCAR](https://oscar-corpus.com/) and fine-tuned on UD 2.13 corpus. -For 49 of the corpora, a subword splitting model is available. Note that different types of multi-word tokens exist in different languages: +For 54 of the corpora, a subword splitting model is available. Note that different types of multi-word tokens exist in different languages: - those that are a concatenation of their subwords, as in English: *don't* = *do* + *n't* - those that differ from their subwords, as in Spanish: *al* = *a* + *el*