Select Git revision
setpredicate.h
training.md NaN GiB
# Training
Basic command:
```bash
combo --mode train \
--training_data_path your_training_path \
--validation_data_path your_validation_path
```
Options:
```bash
combo --helpfull
```
Examples (for clarity without training/validation data paths):
* train on gpu 0
```bash
combo --mode train --cuda_device 0
```
* use pretrained embeddings:
```bash
combo --mode train --pretrained_tokens your_pretrained_embeddings_path --embedding_dim your_embeddings_dim
```
* use pretrained transformer embeddings:
```bash
combo --mode train --pretrained_transformer_name your_choosen_pretrained_transformer
```
* train only a dependency parser:
```bash
combo --mode train --targets head,deprel
```
* use additional features (e.g. part-of-speech tags) for training a dependency parser (`token` and `char` are default features)
```bash
combo --mode train --targets head,deprel --features token,char,upostag
```
## Enhanced Dependencies
Enhanced Dependencies are described [here](https://universaldependencies.org/u/overview/enhanced-syntax.html). Training an enhanced graph prediction model **requires** data pre-processing.
### Data pre-processing
The organisers of [IWPT20 shared task](https://universaldependencies.org/iwpt20/data.html) distributed the data sets and a data pre-processing script `enhanced_collapse_empty_nodes.pl`. If you wish to train a model on IWPT20 data, apply this script to the training and validation data sets, before training the COMBO EUD model.
```bash
perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
```
### Training EUD model
```bash
combo --mode train \
--training_data_path your_preprocessed_training_path \
--validation_data_path your_preprocessed_validation_path \
--targets feats,upostag,xpostag,head,deprel,lemma,deps \
--config_path config.graph.template.jsonnet
```
## Configuration
### Advanced
Config template [config.template.jsonnet](config.template.jsonnet) is formed in `allennlp` format so you can freely modify it.
There is configuration for all the training/model parameters (learning rates, epochs number etc.).
Some of them use `jsonnet` syntax to get values from configuration flags, however most of them can be modified directly there.