Skip to content
Snippets Groups Projects

Description

The script train_ud_version.py allows for training multiple combo models on specific UD treebank version. To run the script, three parameters are required:

  • output_directory - path to the location where training results will be saved
  • treebank_id
  • treebank_version

To find the treebank_id and treebank_version, visit https://universaldependencies.org/#download. The treebank_version is indicated at the beginning of the UD version, while the treebank_id is the value at the end of the link used to download it. See the attached image where both values for UD2.11 are highlighted in yellow.

Where to find trebank_id and treebank_version

The script will automatically download and extract the UD data into the folder output_directory/ud_treebanks-treebank_version. Then, it creates a subfolder output_directory/results containing:

  • serialization_directories - folder with training results
  • completed_training.txt - a text file with the names of UD treebanks on which training was successfully completed
  • skipped_training.csv - a csv file with two columns, the first containing names of UD treebanks, the second listing reasons why training failed. Possible reasons include:
    • Dev or test or train file missing - it is expected that there is a .conllu file in the UD directory that contains train, dev, and test in its name. Otherwise, this error is thrown.
    • Training file less than 1000 bytes - if the training file has less than 1000 bytes, training is skipped.
    • Training file corrupted - number of columns is less than 10.
    • Specify transformer model for language code: <lang_code> - No BERT model was assigned to the specified language. To address this, modify the LANG2TRANSFORMER variable in the file constants.
    • Command ... returned non-zero exit status 1 - An error was thrown during the training process. You need to examine logs from this particular training to understand what happened.

If script was interrupted at some point, you can rerun it with the same command. Based on values in completed_training reruned script will ommit training on UD treebanks that already have model.

Some of the models need adjusted value of word_batch_size, default value will be used unless you specify <word_batch_size> pair in UD_2_BATCH_SIZE constant in constants.

Example usage

Terminal command:

python train_ud_version.py --treebank_id 1-5287 --treebank_version 2.13 --output_directory C:\Users\abc\Desktop