Documentation updated

196187ac · Alina Wróblewska · 320b4a96 · 196187ac · 196187ac · 196187ac
Commit 196187ac authored Jan 20, 2021 by Alina Wróblewska
--- a/README.md
+++ b/README.md
@@ -28,6 +28,9 @@ print("{:5} {:15} {:15} {:10} {:10} {:10}".format('ID', 'TOKEN', 'LEMMA', 'UPOS'
 for token in sentence.tokens:
    print("{:5} {:15} {:15} {:10} {:10} {:10}".format(str(token.id), token.token, token.lemma, token.upostag, str(token.head), token.deprel))
 ```
+## COMBO tutorial
+We encourage you to use the [beginner's tutorial](https://colab.research.google.com/drive/1D1P4AiE40Cc_4SF3HY-Mz06JY0XMiEFs#scrollTo=6Teza7or_Qvw) (colab notebook).
 ## Details

--- a/docs/installation.md
+++ b/docs/installation.md
@@ -14,6 +14,14 @@ pip install -U pip setuptools wheel
 pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1
 ```
+### Conda example:
+```bash
+conda create -n combo python=3.8
+conda activate combo
+pip install -U pip setuptools wheel
+pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1
+```
 ## Problems & solutions
 * **jsonnet** installation error

--- a/docs/models.md
+++ b/docs/models.md
 # Models
 COMBO provides pre-trained models for:
- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org),
+- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org) ([Zeman et al. 2020](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)),
- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html).
+- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html) ([Bouma et al. 2020](https://www.aclweb.org/anthology/2020.iwpt-1.16.pdf)).
 ## Pre-trained models
-**Pre-trained models** list with the **evaluation results** is available in the [spreadsheet](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing)
+All **pre-trained models** for different languages and their **evaluation results**  are listed in the spreadsheets: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
-Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download).
+<!--- 
+Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download).)
+-->
 ### License
-Models are licensed on the same license as data used to train.
+Models are distributed under the same license as datasets used for their training.
 See [Universal Dependencies v2.7 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7) and [Universal Dependencies v2.5 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5) for details.
+## Automatic download
+The pre-trained models can be automatically downloaded  with the `from_pretrained` method in the Python mode. Select a model name from the pre-trained model lists (see the column **Model name** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324)) and pass the name as an attribute of the `from_pretrained` method:
+<!---
+[pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/)  and pass the name as the attribute to `from_pretrained` method:
+---> 
+```python
+from combo.predict import COMBO
+nlp = COMBO.from_pretrained("polish-herbert-base")
+```
+If the model name doesn't match any model on the pre-trained model lists, COMBO looks for a model in local env.
+<!---
+ of [pre-trained models](c), COMBO looks for a model in the local env.
+--->
 ## Manual download
-The pre-trained models can be downloaded from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
+The pre-trained models can be manually downloaded to a local disk with the `wget` package. You need to manually download a pre-trained model, if you want to use COMBO in the command-line mode. The links to the pre-trained models are listed in the column **Model link** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
+<!---
+from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
 If you want to use the console version of COMBO, you need to download a pre-trained model manually:
+--->
 ```bash
 wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz
 ```
-The downloaded model should be passed as a parameter for COMBO (see [prediction doc](prediction.md)).
+The path to the downloaded model should be passed as a parameter for COMBO in CLI (see [prediction doc](prediction.md)).
-## Automatic download
-The pre-trained models can be downloaded automatically with the Python `from_pretrained` method. Select a model name (without the extension .tar.gz) from the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/) and pass the name as the attribute to `from_pretrained` method:
-```python
-from combo.predict import COMBO
-nlp = COMBO.from_pretrained("polish-herbert-base")
-```
-If the model name doesn't match any model on the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/), COMBO looks for a model in local env.
--- a/docs/prediction.md
+++ b/docs/prediction.md
 # Prediction
-## ConLLU file prediction:
+## COMBO as a Python library
-Input and output are both in `*.conllu` format.
+The pre-trained models can be automatically downloaded with the `from_pretrained` method. Select a model name from the lists: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324), and pass it as an argument of `from_pretrained`.
-```bash
+```python
-combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
+from combo.predict import COMBO
+nlp = COMBO.from_pretrained(`polish-herbert-base`)
+sentence = nlp("Sentence to parse.")
 ```
-## Console
+You can also load your own COMBO model:
-Works for models where input was text-based only.
-Interactive testing in console (load model and just type sentence in console).
+```python
+from combo.predict import COMBO
+model_path = "your_model.tar.gz"
+nlp = COMBO.from_pretrained(model_path)
+sentence = nlp("Sentence to parse.")
+```
+COMBO allows to enter presegmented sentences (or texts:
+```python
+from combo.predict import COMBO
+model_path = "your_model.tar.gz"
+nlp = COMBO.from_pretrained(model_path)
+tokenized_sentence = ["Sentence", "to", "parse", "."]
+sentence = nlp([tokenized_sentence])
+```
+## COMBO as a command-line interface 
+### CoNLL-U file prediction:
+Input and output are both in `*.conllu` format.
 ```bash
-combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
+combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
 ```
-## Raw text
+### Raw text prediction:
 Works for models where input was text-based only. 
 Input: one sentence per line.
@@ -24,27 +45,20 @@ Output: List of token jsons.
 ```bash
 combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format
 ```
-### Advanced
-There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
-Use either `--predictor_name combo` or `--predictor_name combo-spacy`.
+### Console prediction:
+Works for models where input was text-based only.
-## Python
+Interactive testing in console (load model and just type sentence in console).
-```python
-from combo.predict import COMBO
-model_path = "your_model.tar.gz"
+```bash
-nlp = COMBO.from_pretrained(model_path)
+combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
-sentence = nlp("Sentence to parse.")
 ```
-Using your own tokenization:
+### Advanced
-```python
-from combo.predict import COMBO
+There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
+Use either `--predictor_name combo` or `--predictor_name combo-spacy` (default tokenizer).
-model_path = "your_model.tar.gz"
-nlp = COMBO.from_pretrained(model_path)
-tokenized_sentence = ["Sentence", "to", "parse", "."]
-sentence = nlp([tokenized_sentence])
-```
--- a/docs/training.md
+++ b/docs/training.md
@@ -44,9 +44,19 @@ Examples (for clarity without training/validation data paths):
    combo --mode train --targets head,deprel --features token,char,upostag
    ```
-## Enhanced UD
+## Enhanced Dependencies
+Enhanced Dependencies are described [here](https://universaldependencies.org/u/overview/enhanced-syntax.html). Training an enhanced graph prediction model **requires** data pre-processing.
+### Data pre-processing
+The organisers of [IWPT20 shared task](https://universaldependencies.org/iwpt20/data.html) distributed the data sets and a data pre-processing script `enhanced_collapse_empty_nodes.pl`. If you wish to train a model on IWPT20 data, apply this script to the training and validation data sets, before training the COMBO EUD model.
+```bash
+perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
+``` 
+### Training EUD model
-Training a model with Enhanced UD prediction **requires** data pre-processing.
 ```bash
 combo --mode train \
@@ -55,14 +65,7 @@ combo --mode train \
      --targets feats,upostag,xpostag,head,deprel,lemma,deps \
      --config_path config.graph.template.jsonnet
 ```
-### Data pre-processing
-Download data from [IWPT20 Shared Task](https://universaldependencies.org/iwpt20/data.html).
-It contains `enhanced_collapse_empty_nodes.pl` script which is required as pre-processing step.
-Apply this script to training and validation data.
-```bash
-perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
-``` 
 ## Configuration