Skip to content
Snippets Groups Projects
Commit 196187ac authored by Alina Wróblewska's avatar Alina Wróblewska
Browse files

Documentation updated

parent 320b4a96
Branches
Tags
1 merge request!29Documentation updated.
Pipeline #2363 passed
...@@ -28,6 +28,9 @@ print("{:5} {:15} {:15} {:10} {:10} {:10}".format('ID', 'TOKEN', 'LEMMA', 'UPOS' ...@@ -28,6 +28,9 @@ print("{:5} {:15} {:15} {:10} {:10} {:10}".format('ID', 'TOKEN', 'LEMMA', 'UPOS'
for token in sentence.tokens: for token in sentence.tokens:
print("{:5} {:15} {:15} {:10} {:10} {:10}".format(str(token.id), token.token, token.lemma, token.upostag, str(token.head), token.deprel)) print("{:5} {:15} {:15} {:10} {:10} {:10}".format(str(token.id), token.token, token.lemma, token.upostag, str(token.head), token.deprel))
``` ```
## COMBO tutorial
We encourage you to use the [beginner's tutorial](https://colab.research.google.com/drive/1D1P4AiE40Cc_4SF3HY-Mz06JY0XMiEFs#scrollTo=6Teza7or_Qvw) (colab notebook).
## Details ## Details
......
...@@ -14,6 +14,14 @@ pip install -U pip setuptools wheel ...@@ -14,6 +14,14 @@ pip install -U pip setuptools wheel
pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1 pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1
``` ```
### Conda example:
```bash
conda create -n combo python=3.8
conda activate combo
pip install -U pip setuptools wheel
pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1
```
## Problems & solutions ## Problems & solutions
* **jsonnet** installation error * **jsonnet** installation error
......
# Models # Models
COMBO provides pre-trained models for: COMBO provides pre-trained models for:
- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org), - morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org) ([Zeman et al. 2020](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)),
- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html). - enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html) ([Bouma et al. 2020](https://www.aclweb.org/anthology/2020.iwpt-1.16.pdf)).
## Pre-trained models ## Pre-trained models
**Pre-trained models** list with the **evaluation results** is available in the [spreadsheet](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) All **pre-trained models** for different languages and their **evaluation results** are listed in the spreadsheets: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download).
<!---
Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download).)
-->
### License ### License
Models are licensed on the same license as data used to train. Models are distributed under the same license as datasets used for their training.
See [Universal Dependencies v2.7 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7) and [Universal Dependencies v2.5 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5) for details. See [Universal Dependencies v2.7 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7) and [Universal Dependencies v2.5 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5) for details.
## Automatic download
The pre-trained models can be automatically downloaded with the `from_pretrained` method in the Python mode. Select a model name from the pre-trained model lists (see the column **Model name** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324)) and pass the name as an attribute of the `from_pretrained` method:
<!---
[pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/) and pass the name as the attribute to `from_pretrained` method:
--->
```python
from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base")
```
If the model name doesn't match any model on the pre-trained model lists, COMBO looks for a model in local env.
<!---
of [pre-trained models](c), COMBO looks for a model in the local env.
--->
## Manual download ## Manual download
The pre-trained models can be downloaded from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/). The pre-trained models can be manually downloaded to a local disk with the `wget` package. You need to manually download a pre-trained model, if you want to use COMBO in the command-line mode. The links to the pre-trained models are listed in the column **Model link** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
<!---
from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
If you want to use the console version of COMBO, you need to download a pre-trained model manually: If you want to use the console version of COMBO, you need to download a pre-trained model manually:
--->
```bash ```bash
wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz
``` ```
The downloaded model should be passed as a parameter for COMBO (see [prediction doc](prediction.md)). The path to the downloaded model should be passed as a parameter for COMBO in CLI (see [prediction doc](prediction.md)).
## Automatic download
The pre-trained models can be downloaded automatically with the Python `from_pretrained` method. Select a model name (without the extension .tar.gz) from the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/) and pass the name as the attribute to `from_pretrained` method:
```python
from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base")
```
If the model name doesn't match any model on the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/), COMBO looks for a model in local env.
# Prediction # Prediction
## ConLLU file prediction: ## COMBO as a Python library
Input and output are both in `*.conllu` format. The pre-trained models can be automatically downloaded with the `from_pretrained` method. Select a model name from the lists: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324), and pass it as an argument of `from_pretrained`.
```bash ```python
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent from combo.predict import COMBO
nlp = COMBO.from_pretrained(`polish-herbert-base`)
sentence = nlp("Sentence to parse.")
``` ```
## Console You can also load your own COMBO model:
Works for models where input was text-based only.
Interactive testing in console (load model and just type sentence in console). ```python
from combo.predict import COMBO
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
```
COMBO allows to enter presegmented sentences (or texts:
```python
from combo.predict import COMBO
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
tokenized_sentence = ["Sentence", "to", "parse", "."]
sentence = nlp([tokenized_sentence])
```
## COMBO as a command-line interface
### CoNLL-U file prediction:
Input and output are both in `*.conllu` format.
```bash ```bash
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
``` ```
## Raw text ### Raw text prediction:
Works for models where input was text-based only. Works for models where input was text-based only.
Input: one sentence per line. Input: one sentence per line.
...@@ -24,27 +45,20 @@ Output: List of token jsons. ...@@ -24,27 +45,20 @@ Output: List of token jsons.
```bash ```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format
``` ```
### Advanced
There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name combo` or `--predictor_name combo-spacy`. ### Console prediction:
Works for models where input was text-based only.
## Python Interactive testing in console (load model and just type sentence in console).
```python
from combo.predict import COMBO
model_path = "your_model.tar.gz" ```bash
nlp = COMBO.from_pretrained(model_path) combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
sentence = nlp("Sentence to parse.")
``` ```
Using your own tokenization: ### Advanced
```python
from combo.predict import COMBO There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name combo` or `--predictor_name combo-spacy` (default tokenizer).
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
tokenized_sentence = ["Sentence", "to", "parse", "."]
sentence = nlp([tokenized_sentence])
```
...@@ -44,9 +44,19 @@ Examples (for clarity without training/validation data paths): ...@@ -44,9 +44,19 @@ Examples (for clarity without training/validation data paths):
combo --mode train --targets head,deprel --features token,char,upostag combo --mode train --targets head,deprel --features token,char,upostag
``` ```
## Enhanced UD ## Enhanced Dependencies
Enhanced Dependencies are described [here](https://universaldependencies.org/u/overview/enhanced-syntax.html). Training an enhanced graph prediction model **requires** data pre-processing.
### Data pre-processing
The organisers of [IWPT20 shared task](https://universaldependencies.org/iwpt20/data.html) distributed the data sets and a data pre-processing script `enhanced_collapse_empty_nodes.pl`. If you wish to train a model on IWPT20 data, apply this script to the training and validation data sets, before training the COMBO EUD model.
```bash
perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
```
### Training EUD model
Training a model with Enhanced UD prediction **requires** data pre-processing.
```bash ```bash
combo --mode train \ combo --mode train \
...@@ -55,14 +65,7 @@ combo --mode train \ ...@@ -55,14 +65,7 @@ combo --mode train \
--targets feats,upostag,xpostag,head,deprel,lemma,deps \ --targets feats,upostag,xpostag,head,deprel,lemma,deps \
--config_path config.graph.template.jsonnet --config_path config.graph.template.jsonnet
``` ```
### Data pre-processing
Download data from [IWPT20 Shared Task](https://universaldependencies.org/iwpt20/data.html).
It contains `enhanced_collapse_empty_nodes.pl` script which is required as pre-processing step.
Apply this script to training and validation data.
```bash
perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
```
## Configuration ## Configuration
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment