Skip to content
Snippets Groups Projects
Commit aaea8f55 authored by Mateusz Klimaszewski's avatar Mateusz Klimaszewski
Browse files

Split documentation into multiple markdown files.

parent 8fba2edd
No related branches found
No related tags found
1 merge request!4Documentation
This commit is part of merge request !4. Comments created here will be created in the context of that merge request.
# COMBO
<p align="center">
A GPL-3.0 system, build on top of PyTorch and AllenNLP, for morphosyntactic analysis.
A language-independent NLP system for dependency parsing, part-of-speech tagging, lemmatisation and more built on top of PyTorch and AllenNLP.
</p>
<hr/>
<p align="center">
......@@ -9,118 +9,26 @@
</a>
</p>
[Pre-trained models!](http://mozart.ipipan.waw.pl/~mklimaszewski/models/)
```python
import combo.predict as predict
nlp = predict.SemanticMultitaskPredictor.from_pretrained("polish-herbert-base")
sentence = nlp("Moje zdanie.")
print(sentence.tokens)
```
## Installation
Clone this repository and run:
## Quick start
Clone this repository and install COMBO (we suggest using virtualenv/conda with Python 3.6+):
```bash
git clone https://github.com/ipipan/combo.git
cd combo
python setup.py develop
```
### Problems & solutions
* **jsonnet** installation error
use `conda install -c conda-forge jsonnet=0.15.0`
## Training
Command:
```bash
combo --mode train \
--training_data_path your_training_path \
--validation_data_path your_validation_path
```
Options:
```bash
combo --helpfull
```
Examples (for clarity without training/validation data paths):
* train on gpu 0
```bash
combo --mode train --cuda_device 0
```
* use pretrained embeddings:
```bash
combo --mode train --pretrained_tokens your_pretrained_embeddings_path --embedding_dim your_embeddings_dim
```
* use pretrained transformer embeddings:
```bash
combo --mode train --pretrained_transformer_name your_choosen_pretrained_transformer
```
* predict only dependency tree:
```bash
combo --mode train --targets head,deprel
```
* use part-of-speech tags for predicting only dependency tree
```bash
combo --mode train --targets head,deprel --features token,char,upostag
```
Advanced configuration: [Configuration](#configuration)
## Prediction
### ConLLU file prediction:
Input and output are both in `*.conllu` format.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
```
### Console
Works for models where input was text-based only.
Interactive testing in console (load model and just type sentence in console).
```bash
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
```
### Raw text
Works for models where input was text-based only.
Input: one sentence per line.
Output: List of token jsons.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent
```
#### Advanced
There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name semantic-multitask-predictor` or `--predictor_name semantic-multitask-predictor-spacy`.
### Python
Run the following lines in your Python console to make predictions with a pre-trained model:
```python
import combo.predict as predict
model_path = "your_model.tar.gz"
nlp = predict.SemanticMultitaskPredictor.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
nlp = predict.SemanticMultitaskPredictor.from_pretrained("polish-herbert-base")
sentence = nlp("Moje zdanie.")
print(sentence.tokens)
```
## Configuration
## Details
- [**Installation**](docs/installation.md)
- [**Pre-trained models**](docs/models.md)
- [**Training**](docs/training.md)
- [**Prediction**](docs/prediction.md)
### Advanced
Config template [config.template.jsonnet](config.template.jsonnet) is formed in `allennlp` format so you can freely modify it.
There is configuration for all the training/model parameters (learning rates, epochs number etc.).
Some of them use `jsonnet` syntax to get values from configuration flags, however most of them can be modified directly there.
# Installation
Clone this repository and install COMBO (we suggest using virtualenv/conda with Python 3.6+):
```bash
git clone https://github.com/ipipan/combo.git
cd combo
python setup.py develop
combo --helpfull
```
## Problems & solutions
* **jsonnet** installation error
use `conda install -c conda-forge jsonnet=0.15.0`
docs/models.md 0 → 100644
# Models
Pre-trained models are available [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
## Automatic download
Python `from_pretrained` method will download the pre-trained model if the provided name (without the extension .tar.gz) matches one of the names in [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
```python
import combo.predict as predict
nlp = predict.SemanticMultitaskPredictor.from_pretrained("polish-herbert-base")
```
Otherwise it looks for a model in local env.
## Console prediction/Local model
If you want to use the console version of COMBO, you need to download a pre-trained model manually
```bash
wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz
```
and pass it as a parameter (see [prediction doc](prediction.md)).
# Prediction
## ConLLU file prediction:
Input and output are both in `*.conllu` format.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
```
## Console
Works for models where input was text-based only.
Interactive testing in console (load model and just type sentence in console).
```bash
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
```
## Raw text
Works for models where input was text-based only.
Input: one sentence per line.
Output: List of token jsons.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format
```
### Advanced
There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name semantic-multitask-predictor` or `--predictor_name semantic-multitask-predictor-spacy`.
## Python
```python
import combo.predict as predict
model_path = "your_model.tar.gz"
nlp = predict.SemanticMultitaskPredictor.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
```
# Training
Command:
```bash
combo --mode train \
--training_data_path your_training_path \
--validation_data_path your_validation_path
```
Options:
```bash
combo --helpfull
```
Examples (for clarity without training/validation data paths):
* train on gpu 0
```bash
combo --mode train --cuda_device 0
```
* use pretrained embeddings:
```bash
combo --mode train --pretrained_tokens your_pretrained_embeddings_path --embedding_dim your_embeddings_dim
```
* use pretrained transformer embeddings:
```bash
combo --mode train --pretrained_transformer_name your_choosen_pretrained_transformer
```
* predict only dependency tree:
```bash
combo --mode train --targets head,deprel
```
* use part-of-speech tags for predicting only dependency tree
```bash
combo --mode train --targets head,deprel --features token,char,upostag
```
## Configuration
### Advanced
Config template [config.template.jsonnet](config.template.jsonnet) is formed in `allennlp` format so you can freely modify it.
There is configuration for all the training/model parameters (learning rates, epochs number etc.).
Some of them use `jsonnet` syntax to get values from configuration flags, however most of them can be modified directly there.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment