diff --git a/README.md b/README.md index affa601c0c63be1b9fef44c09e37b182594bc67d..7c41fe8bb6fb3d166a58b8f2ed2843a56d832e67 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,9 @@ print("{:5} {:15} {:15} {:10} {:10} {:10}".format('ID', 'TOKEN', 'LEMMA', 'UPOS' for token in sentence.tokens: print("{:5} {:15} {:15} {:10} {:10} {:10}".format(str(token.id), token.token, token.lemma, token.upostag, str(token.head), token.deprel)) ``` +## COMBO tutorial + +We encourage you to use the [beginner's tutorial](https://colab.research.google.com/drive/1D1P4AiE40Cc_4SF3HY-Mz06JY0XMiEFs#scrollTo=6Teza7or_Qvw) (colab notebook). ## Details diff --git a/docs/installation.md b/docs/installation.md index 98d13b6b1df91bccca1c54270089d06a133d24b9..7cf539b5550704747e4bbeac58f9e8f021b4db56 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -14,6 +14,14 @@ pip install -U pip setuptools wheel pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1 ``` +### Conda example: +```bash +conda create -n combo python=3.8 +conda activate combo +pip install -U pip setuptools wheel +pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1 +``` + ## Problems & solutions * **jsonnet** installation error diff --git a/docs/models.md b/docs/models.md index 96bd7e93d890f80643adf9f77d5229c6ce517aca..06ef9110c87be7c97ef346a3ced57dd3a14a9cf6 100644 --- a/docs/models.md +++ b/docs/models.md @@ -1,35 +1,52 @@ # Models COMBO provides pre-trained models for: -- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org), -- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html). +- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org) ([Zeman et al. 2020](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)), +- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html) ([Bouma et al. 2020](https://www.aclweb.org/anthology/2020.iwpt-1.16.pdf)). ## Pre-trained models -**Pre-trained models** list with the **evaluation results** is available in the [spreadsheet](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) -Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download). +All **pre-trained models** for different languages and their **evaluation results** are listed in the spreadsheets: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324). + +<!--- +Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download).) +--> ### License -Models are licensed on the same license as data used to train. +Models are distributed under the same license as datasets used for their training. See [Universal Dependencies v2.7 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7) and [Universal Dependencies v2.5 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5) for details. + +## Automatic download +The pre-trained models can be automatically downloaded with the `from_pretrained` method in the Python mode. Select a model name from the pre-trained model lists (see the column **Model name** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324)) and pass the name as an attribute of the `from_pretrained` method: + +<!--- +[pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/) and pass the name as the attribute to `from_pretrained` method: +---> +```python +from combo.predict import COMBO + +nlp = COMBO.from_pretrained("polish-herbert-base") +``` +If the model name doesn't match any model on the pre-trained model lists, COMBO looks for a model in local env. + +<!--- + of [pre-trained models](c), COMBO looks for a model in the local env. +---> + ## Manual download -The pre-trained models can be downloaded from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/). +The pre-trained models can be manually downloaded to a local disk with the `wget` package. You need to manually download a pre-trained model, if you want to use COMBO in the command-line mode. The links to the pre-trained models are listed in the column **Model link** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324). + +<!--- +from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/). If you want to use the console version of COMBO, you need to download a pre-trained model manually: +---> + ```bash wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz ``` -The downloaded model should be passed as a parameter for COMBO (see [prediction doc](prediction.md)). - -## Automatic download -The pre-trained models can be downloaded automatically with the Python `from_pretrained` method. Select a model name (without the extension .tar.gz) from the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/) and pass the name as the attribute to `from_pretrained` method: -```python -from combo.predict import COMBO - -nlp = COMBO.from_pretrained("polish-herbert-base") -``` -If the model name doesn't match any model on the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/), COMBO looks for a model in local env. +The path to the downloaded model should be passed as a parameter for COMBO in CLI (see [prediction doc](prediction.md)). diff --git a/docs/prediction.md b/docs/prediction.md index 125c1298c0df6b4a8a9b0ce0544e90422f4ae155..b359bdf041d503a3f2cf5ca4cbf8b9614b0bdc06 100644 --- a/docs/prediction.md +++ b/docs/prediction.md @@ -1,20 +1,41 @@ # Prediction -## ConLLU file prediction: -Input and output are both in `*.conllu` format. -```bash -combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent +## COMBO as a Python library +The pre-trained models can be automatically downloaded with the `from_pretrained` method. Select a model name from the lists: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324), and pass it as an argument of `from_pretrained`. +```python +from combo.predict import COMBO + +nlp = COMBO.from_pretrained(`polish-herbert-base`) +sentence = nlp("Sentence to parse.") ``` -## Console -Works for models where input was text-based only. +You can also load your own COMBO model: -Interactive testing in console (load model and just type sentence in console). +```python +from combo.predict import COMBO + +model_path = "your_model.tar.gz" +nlp = COMBO.from_pretrained(model_path) +sentence = nlp("Sentence to parse.") +``` + +COMBO allows to enter presegmented sentences (or texts: +```python +from combo.predict import COMBO + +model_path = "your_model.tar.gz" +nlp = COMBO.from_pretrained(model_path) +tokenized_sentence = ["Sentence", "to", "parse", "."] +sentence = nlp([tokenized_sentence]) +``` +## COMBO as a command-line interface +### CoNLL-U file prediction: +Input and output are both in `*.conllu` format. ```bash -combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent +combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent ``` -## Raw text +### Raw text prediction: Works for models where input was text-based only. Input: one sentence per line. @@ -24,27 +45,20 @@ Output: List of token jsons. ```bash combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format ``` -### Advanced -There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model). -Use either `--predictor_name combo` or `--predictor_name combo-spacy`. +### Console prediction: +Works for models where input was text-based only. -## Python -```python -from combo.predict import COMBO +Interactive testing in console (load model and just type sentence in console). -model_path = "your_model.tar.gz" -nlp = COMBO.from_pretrained(model_path) -sentence = nlp("Sentence to parse.") +```bash +combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent ``` -Using your own tokenization: -```python -from combo.predict import COMBO +### Advanced + +There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model). + +Use either `--predictor_name combo` or `--predictor_name combo-spacy` (default tokenizer). -model_path = "your_model.tar.gz" -nlp = COMBO.from_pretrained(model_path) -tokenized_sentence = ["Sentence", "to", "parse", "."] -sentence = nlp([tokenized_sentence]) -``` diff --git a/docs/training.md b/docs/training.md index d3f69e0913c59681279b1fd966be0f4901ade11e..7d7b0a8256e0c7896bf1372f817ab4e263b4f45e 100644 --- a/docs/training.md +++ b/docs/training.md @@ -44,9 +44,19 @@ Examples (for clarity without training/validation data paths): combo --mode train --targets head,deprel --features token,char,upostag ``` -## Enhanced UD +## Enhanced Dependencies + +Enhanced Dependencies are described [here](https://universaldependencies.org/u/overview/enhanced-syntax.html). Training an enhanced graph prediction model **requires** data pre-processing. + +### Data pre-processing +The organisers of [IWPT20 shared task](https://universaldependencies.org/iwpt20/data.html) distributed the data sets and a data pre-processing script `enhanced_collapse_empty_nodes.pl`. If you wish to train a model on IWPT20 data, apply this script to the training and validation data sets, before training the COMBO EUD model. + +```bash +perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu +``` + +### Training EUD model -Training a model with Enhanced UD prediction **requires** data pre-processing. ```bash combo --mode train \ @@ -55,14 +65,7 @@ combo --mode train \ --targets feats,upostag,xpostag,head,deprel,lemma,deps \ --config_path config.graph.template.jsonnet ``` -### Data pre-processing -Download data from [IWPT20 Shared Task](https://universaldependencies.org/iwpt20/data.html). -It contains `enhanced_collapse_empty_nodes.pl` script which is required as pre-processing step. -Apply this script to training and validation data. -```bash -perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu -``` ## Configuration