training.md

# Training

Basic command:
```bash
combo --mode train \
      --training_data_path your_training_path \
      --validation_data_path your_validation_path
```

Options:
```bash
combo --helpfull
```

Examples (for clarity without training/validation data paths):

* train on gpu 0

    ```bash
    combo --mode train --cuda_device 0
    ```

* use pretrained embeddings:

    ```bash
    combo --mode train --pretrained_tokens your_pretrained_embeddings_path --embedding_dim your_embeddings_dim
    ```

* use pretrained transformer embeddings:

    ```bash
    combo --mode train --pretrained_transformer_name your_choosen_pretrained_transformer
    ```

* train only a dependency parser:

    ```bash
    combo --mode train --targets head,deprel
    ```

* use additional features (e.g. part-of-speech tags) for training a dependency parser (`token` and `char` are default features)

    ```bash
    combo --mode train --targets head,deprel --features token,char,upostag
    ```

## Enhanced Dependencies

Enhanced Dependencies are described [here](https://universaldependencies.org/u/overview/enhanced-syntax.html). Training an enhanced graph prediction model **requires** data pre-processing.

### Data pre-processing
The organisers of [IWPT20 shared task](https://universaldependencies.org/iwpt20/data.html) distributed the data sets and a data pre-processing script `enhanced_collapse_empty_nodes.pl`. If you wish to train a model on IWPT20 data, apply this script to the training and validation data sets, before training the COMBO EUD model.

```bash
perl enhanced_collapse_empty_nodes.pl training.conllu > training.fixed.conllu
```

### Training EUD model


```bash
combo --mode train \
      --training_data_path your_preprocessed_training_path \
      --validation_data_path your_preprocessed_validation_path \
      --targets feats,upostag,xpostag,head,deprel,lemma,deps \
      --config_path config.graph.template.jsonnet
```


## Configuration

### Advanced
Config template [config.template.jsonnet](config.template.jsonnet) is formed in `allennlp` format so you can freely modify it.
There is configuration for all the training/model parameters (learning rates, epochs number etc.).
Some of them use `jsonnet` syntax to get values from configuration flags, however most of them can be modified directly there.