diff --git a/README.md b/README.md index f11a6638ce646ead39a2af97a096ddd2aa60c112..987fc04d8465b6b5968a0fe4b52770b2f11a1572 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ LAMBO is a machine learning model, which means it was trained to recognise bound LAMBO was developed in context of dependency parsing. Thus, it includes models trained on [Universal Dependencies treebanks](https://universaldependencies.org/#language-), uses `.conllu` as the training [data format](https://universaldependencies.org/conll18/evaluation.html) and supports integration with [COMBO](https://gitlab.clarin-pl.eu/syntactic-tools/combo), a state-of-the-art system for dependency parsing and more. However, you can use LAMBO as the first stage of any NLP process. -LAMBO currently includes models trained on 98 corpora in 53 languages. The full list is available in [languages.txt](src/lambo/resources/languages.txt). For each of these, two model variants are available: +LAMBO currently includes models trained on 98 corpora in 53 languages. The full list is available in `[languages.txt](src/lambo/resources/languages.txt)`. For each of these, two model variants are available: - simple LAMBO, trained on the UD corpus - pretrained LAMBO, same as above, but starting from weights pre-trained on unsupervised masked character prediction using multilingual corpora from [OSCAR](https://oscar-corpus.com/). @@ -43,9 +43,9 @@ Now you need to create a segmenter by providing the language your text is in, e. ``` lambo = Lambo.get('English') ``` -This will (if necessary) download the appropriate model from the online repository and load it. Note that you can use any language name (e.g. `Ancient_Greek`) or ISO 639-1 code (e.g. `fi`) from [languages.txt](src/lambo/resources/languages.txt). +This will (if necessary) download the appropriate model from the online repository and load it. Note that you can use any language name (e.g. `Ancient_Greek`) or ISO 639-1 code (e.g. `fi`) from `[languages.txt](src/lambo/resources/languages.txt)`. -Alternatively, you can select a specific model by defining LAMBO variant (`LAMBO` or `LAMBO_no_pretraining`) and training dataset from [languages.txt](src/lambo/resources/languages.txt): +Alternatively, you can select a specific model by defining LAMBO variant (`LAMBO` or `LAMBO_no_pretraining`) and training dataset from `[languages.txt](src/lambo/resources/languages.txt)`: ``` lambo = Lambo.get('LAMBO-UD_Polish-PDB') ``` @@ -113,13 +113,19 @@ print("{:5} {:15} {:15} {:10} {:10} {:10}".format('ID', 'TOKEN', 'LEMMA', 'UPOS' ## Extending LAMBO -You don't have to rely on the models trained so far in COMBO. You can use the included code to train on new corpora and languages, tune to specific usecases or simply retrain larger models with more resources. The scripts in [examples](src/lambo/examples) include examples on how to do that: +You don't have to rely on the models trained so far in COMBO. You can use the included code to train on new corpora and languages, tune to specific usecases or simply retrain larger models with more resources. The scripts in `[examples](src/lambo/examples)` include examples on how to do that: - `run_training.py` -- train simple LAMBO models. This script was used with [UD treebanks](https://universaldependencies.org/#language-) to generate `LAMBO_no_pretraining` models. - `run_pretraining.py` -- pretrain unsupervised LAMBO models. This script was used with [OSCAR](https://oscar-corpus.com/). - `run_training_pretrained.py` -- train LAMBO models on UD training data, starting from pretrained models. This script was used to generate `LAMBO` models. - `run_tuning.py` -- tune existing LAMBO model to fit new data. - `run_evaluation.py` -- evaluate existing models using UD gold standard. +Note that you can also extend LAMBO by modifying the data files that specify string that will be treated specially: +- `[emoji.tab](src/lambo/resources/emoji.tab)` includes a list of emojis (they will always be treated as separate tokens), +- `[pauses.txt](src/lambo/resources/pauses.txt)` include a list of verbal pauses (they will also be separated, but not split), +- `[turn_regexp.txt](src/lambo/resources/turn_regexp.txt)` enumerates regular expressions used to split turns (such as double newline), + + ## Credits If you use LAMBO in your research, please cite it as software: