Skip to content
Snippets Groups Projects
Commit aaea8f55 authored by Mateusz Klimaszewski's avatar Mateusz Klimaszewski
Browse files

Split documentation into multiple markdown files.

parent 8fba2edd
1 merge request!4Documentation
This commit is part of merge request !4. Comments created here will be created in the context of that merge request.
# COMBO
<p align="center">
A GPL-3.0 system, build on top of PyTorch and AllenNLP, for morphosyntactic analysis.
A language-independent NLP system for dependency parsing, part-of-speech tagging, lemmatisation and more built on top of PyTorch and AllenNLP.
</p>
<hr/>
<p align="center">
......@@ -9,118 +9,26 @@
</a>
</p>
[Pre-trained models!](http://mozart.ipipan.waw.pl/~mklimaszewski/models/)
```python
import combo.predict as predict
nlp = predict.SemanticMultitaskPredictor.from_pretrained("polish-herbert-base")
sentence = nlp("Moje zdanie.")
print(sentence.tokens)
```
## Installation
Clone this repository and run:
## Quick start
Clone this repository and install COMBO (we suggest using virtualenv/conda with Python 3.6+):
```bash
git clone https://github.com/ipipan/combo.git
cd combo
python setup.py develop
```
### Problems & solutions
* **jsonnet** installation error
use `conda install -c conda-forge jsonnet=0.15.0`
## Training
Command:
```bash
combo --mode train \
--training_data_path your_training_path \
--validation_data_path your_validation_path
```
Options:
```bash
combo --helpfull
```
Examples (for clarity without training/validation data paths):
* train on gpu 0
```bash
combo --mode train --cuda_device 0
```
* use pretrained embeddings:
```bash
combo --mode train --pretrained_tokens your_pretrained_embeddings_path --embedding_dim your_embeddings_dim
```
* use pretrained transformer embeddings:
```bash
combo --mode train --pretrained_transformer_name your_choosen_pretrained_transformer
```
* predict only dependency tree:
```bash
combo --mode train --targets head,deprel
```
* use part-of-speech tags for predicting only dependency tree
```bash
combo --mode train --targets head,deprel --features token,char,upostag
```
Advanced configuration: [Configuration](#configuration)
## Prediction
### ConLLU file prediction:
Input and output are both in `*.conllu` format.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
```
### Console
Works for models where input was text-based only.
Interactive testing in console (load model and just type sentence in console).
```bash
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
```
### Raw text
Works for models where input was text-based only.
Input: one sentence per line.
Output: List of token jsons.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent
```
#### Advanced
There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name semantic-multitask-predictor` or `--predictor_name semantic-multitask-predictor-spacy`.
### Python
Run the following lines in your Python console to make predictions with a pre-trained model:
```python
import combo.predict as predict
model_path = "your_model.tar.gz"
nlp = predict.SemanticMultitaskPredictor.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
nlp = predict.SemanticMultitaskPredictor.from_pretrained("polish-herbert-base")
sentence = nlp("Moje zdanie.")
print(sentence.tokens)
```
## Configuration
## Details
- [**Installation**](docs/installation.md)
- [**Pre-trained models**](docs/models.md)
- [**Training**](docs/training.md)
- [**Prediction**](docs/prediction.md)
### Advanced
Config template [config.template.jsonnet](config.template.jsonnet) is formed in `allennlp` format so you can freely modify it.
There is configuration for all the training/model parameters (learning rates, epochs number etc.).
Some of them use `jsonnet` syntax to get values from configuration flags, however most of them can be modified directly there.
# Installation
Clone this repository and install COMBO (we suggest using virtualenv/conda with Python 3.6+):
```bash
git clone https://github.com/ipipan/combo.git
cd combo
python setup.py develop
combo --helpfull
```
## Problems & solutions
* **jsonnet** installation error
use `conda install -c conda-forge jsonnet=0.15.0`
docs/models.md 0 → 100644
# Models
Pre-trained models are available [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
## Automatic download
Python `from_pretrained` method will download the pre-trained model if the provided name (without the extension .tar.gz) matches one of the names in [here](http://mozart.ipipan.waw.pl/~mklimaszewski/models/).
```python
import combo.predict as predict
nlp = predict.SemanticMultitaskPredictor.from_pretrained("polish-herbert-base")
```
Otherwise it looks for a model in local env.
## Console prediction/Local model
If you want to use the console version of COMBO, you need to download a pre-trained model manually
```bash
wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz
```
and pass it as a parameter (see [prediction doc](prediction.md)).
# Prediction
## ConLLU file prediction:
Input and output are both in `*.conllu` format.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
```
## Console
Works for models where input was text-based only.
Interactive testing in console (load model and just type sentence in console).
```bash
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
```
## Raw text
Works for models where input was text-based only.
Input: one sentence per line.
Output: List of token jsons.
```bash
combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format
```
### Advanced
There are 2 tokenizers: whitespace and spacy-based (`en_core_web_sm` model).
Use either `--predictor_name semantic-multitask-predictor` or `--predictor_name semantic-multitask-predictor-spacy`.
## Python
```python
import combo.predict as predict
model_path = "your_model.tar.gz"
nlp = predict.SemanticMultitaskPredictor.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
```
# Training
Command:
```bash
combo --mode train \
--training_data_path your_training_path \
--validation_data_path your_validation_path
```
Options:
```bash
combo --helpfull
```
Examples (for clarity without training/validation data paths):
* train on gpu 0
```bash
combo --mode train --cuda_device 0
```
* use pretrained embeddings:
```bash
combo --mode train --pretrained_tokens your_pretrained_embeddings_path --embedding_dim your_embeddings_dim
```
* use pretrained transformer embeddings:
```bash
combo --mode train --pretrained_transformer_name your_choosen_pretrained_transformer
```
* predict only dependency tree:
```bash
combo --mode train --targets head,deprel
```
* use part-of-speech tags for predicting only dependency tree
```bash
combo --mode train --targets head,deprel --features token,char,upostag
```
## Configuration
### Advanced
Config template [config.template.jsonnet](config.template.jsonnet) is formed in `allennlp` format so you can freely modify it.
There is configuration for all the training/model parameters (learning rates, epochs number etc.).
Some of them use `jsonnet` syntax to get values from configuration flags, however most of them can be modified directly there.
\ No newline at end of file
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment