Skip to content
Snippets Groups Projects
Commit 196187ac authored by Alina Wróblewska's avatar Alina Wróblewska
Browse files

Documentation updated

parent 320b4a96
Branches
Tags
1 merge request!29Documentation updated.
Pipeline #2363 passed with stages
in 6 minutes and 56 seconds
......@@ -28,6 +28,9 @@ print("{:5} {:15} {:15} {:10} {:10} {:10}".format('ID', 'TOKEN', 'LEMMA', 'UPOS'
for token in sentence.tokens:
print("{:5} {:15} {:15} {:10} {:10} {:10}".format(str(token.id), token.token, token.lemma, token.upostag, str(token.head), token.deprel))
```
## COMBO tutorial
We encourage you to use the [beginner's tutorial](https://colab.research.google.com/drive/1D1P4AiE40Cc_4SF3HY-Mz06JY0XMiEFs#scrollTo=6Teza7or_Qvw) (colab notebook).
## Details
......
......@@ -14,6 +14,14 @@ pip install -U pip setuptools wheel
pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1
```
### Conda example:
```bash
conda create -n combo python=3.8
conda activate combo
pip install -U pip setuptools wheel
pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.1
```
## Problems & solutions
* **jsonnet** installation error
......
# Models
COMBO provides pre-trained models for:
- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org),
- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html).
- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org) ([Zeman et al. 2020](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)),
- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html) ([Bouma et al. 2020](https://www.aclweb.org/anthology/2020.iwpt-1.16.pdf)).
## Pre-trained models
**Pre-trained models** list with the **evaluation results** is available in the [spreadsheet](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing)
Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download).
All **pre-trained models** for different languages and their **evaluation results** are listed in the spreadsheets: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
<!---
Please notice that the name in the brackets matches the name used in [Automatic Download](models.md#Automatic download).)
-->
### License
Models are licensed on the same license as data used to train.
Models are distributed under the same license as datasets used for their training.
See [Universal Dependencies v2.7 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7) and [Universal Dependencies v2.5 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5) for details.
## Automatic download
The pre-trained models can be automatically downloaded with the `from_pretrained` method in the Python mode. Select a model name from the pre-trained model lists (see the column **Model name** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324)) and pass the name as an attribute of the `from_pretrained` method:
<!---
[pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/) and pass the name as the attribute to `from_pretrained` method:
--->
```python
from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base")
```
If the model name doesn't match any model on the pre-trained model lists, COMBO looks for a model in local env.
<!---
of [pre-trained models](c), COMBO looks for a model in the local env.
--->
## Manual download
The pre-trained models can be downloaded from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
The pre-trained models can be manually downloaded to a local disk with the `wget` package. You need to manually download a pre-trained model, if you want to use COMBO in the command-line mode. The links to the pre-trained models are listed in the column **Model link** in [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
<!---
from [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
If you want to use the console version of COMBO, you need to download a pre-trained model manually:
--->
```bash
wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz
```
The downloaded model should be passed as a parameter for COMBO (see [prediction doc](prediction.md)).
## Automatic download
The pre-trained models can be downloaded automatically with the Python `from_pretrained` method. Select a model name (without the extension .tar.gz) from the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/) and pass the name as the attribute to `from_pretrained` method:
```python
from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base")
```
If the model name doesn't match any model on the list of [pre-trained models](http://mozart.ipipan.waw.pl/~mklimaszewski/models/), COMBO looks for a model in local env.
The path to the downloaded model should be passed as a parameter for COMBO in CLI (see [prediction doc](prediction.md)).
# Prediction
## ConLLU file prediction:
Input and output are both in `*.conllu` format.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
## COMBO as a Python library
The pre-trained models can be automatically downloaded with the `from_pretrained` method. Select a model name from the lists: [UD-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit?usp=sharing) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324), and pass it as an argument of `from_pretrained`.
```python
from combo.predict import COMBO
nlp = COMBO.from_pretrained(`polish-herbert-base`)
sentence = nlp("Sentence to parse.")
```
## Console
Works for models where input was text-based only.
You can also load your own COMBO model:
Interactive testing in console (load model and just type sentence in console).
```python
from combo.predict import COMBO
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
```
COMBO allows to enter presegmented sentences (or texts:
```python
from combo.predict import COMBO
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
tokenized_sentence = ["Sentence", "to", "parse", "."]
sentence = nlp([tokenized_sentence])
```
## COMBO as a command-line interface
### CoNLL-U file prediction:
Input and output are both in `*.conllu` format.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
```
## Raw text
### Raw text prediction:
Works for models where input was text-based only.
Input: one sentence per line.
......@@ -24,27 +45,20 @@ Output: List of token jsons.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format
```
### Advanced
There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name combo` or `--predictor_name combo-spacy`.
### Console prediction:
Works for models where input was text-based only.
## Python
```python
from combo.predict import COMBO
Interactive testing in console (load model and just type sentence in console).
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
```bash
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
```
Using your own tokenization:
```python
from combo.predict import COMBO
### Advanced
There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name combo` or `--predictor_name combo-spacy` (default tokenizer).
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
tokenized_sentence = ["Sentence", "to", "parse", "."]
sentence = nlp([tokenized_sentence])
```
......@@ -44,9 +44,19 @@ Examples (for clarity without training/validation data paths):
combo --mode train --targets head,deprel --features token,char,upostag
```
## Enhanced UD
## Enhanced Dependencies
Enhanced Dependencies are described [here](https://universaldependencies.org/u/overview/enhanced-syntax.html). Training an enhanced graph prediction model **requires** data pre-processing.
### Data pre-processing
The organisers of [IWPT20 shared task](https://universaldependencies.org/iwpt20/data.html) distributed the data sets and a data pre-processing script `enhanced_collapse_empty_nodes.pl`. If you wish to train a model on IWPT20 data, apply this script to the training and validation data sets, before training the COMBO EUD model.
```bash
perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
```
### Training EUD model
Training a model with Enhanced UD prediction **requires** data pre-processing.
```bash
combo --mode train \
......@@ -55,14 +65,7 @@ combo --mode train \
--targets feats,upostag,xpostag,head,deprel,lemma,deps \
--config_path config.graph.template.jsonnet
```
### Data pre-processing
Download data from [IWPT20 Shared Task](https://universaldependencies.org/iwpt20/data.html).
It contains `enhanced_collapse_empty_nodes.pl` script which is required as pre-processing step.
Apply this script to training and validation data.
```bash
perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
```
## Configuration
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment