Skip to content
Snippets Groups Projects
Mateusz Klimaszewski's avatar
aa2b4d90

COMBO

A GPL-3.0 system, build on top of PyTorch and AllenNLP, for morphosyntactic analysis.


License

Pre-trained models!

import combo.predict as predict
nlp = predict.SemanticMultitaskPredictor.from_pretrained("polish-herbert-base")
sentence = nlp("Moje zdanie.")
print(sentence.tokens)

Installation

Clone this repository and run:

python setup.py develop

Problems & solutions

  • jsonnet installation error

use conda install -c conda-forge jsonnet=0.15.0

Training

Command:

combo --mode train \
      --training_data_path your_training_path \
      --validation_data_path your_validation_path

Options:

combo --helpfull

Examples (for clarity without training/validation data paths):

  • train on gpu 0

    combo --mode train --cuda_device 0
  • use pretrained embeddings:

    combo --mode train --pretrained_tokens your_pretrained_embeddings_path --embedding_dim your_embeddings_dim
  • use pretrained transformer embeddings:

    combo --mode train --pretrained_transformer_name your_choosen_pretrained_transformer
  • predict only dependency tree:

    combo --mode train --targets head,deprel
  • use part-of-speech tags for predicting only dependency tree

    combo --mode train --targets head,deprel --features token,char,upostag

Advanced configuration: Configuration

Prediction

ConLLU file prediction:

Input and output are both in *.conllu format.

combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent

Console

Works for models where input was text-based only.

Interactive testing in console (load model and just type sentence in console).

combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent

Raw text

Works for models where input was text-based only.

Input: one sentence per line.

Output: List of token jsons.

combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent

Advanced

There are 2 tokenizers: whitespace and spacy-based (en_core_web_sm model).

Use either --predictor_name semantic-multitask-predictor or --predictor_name semantic-multitask-predictor-spacy.

Python

import combo.predict as predict

model_path = "your_model.tar.gz"
nlp = predict.SemanticMultitaskPredictor.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")

Configuration

Advanced

Config template config.template.jsonnet is formed in allennlp format so you can freely modify it. There is configuration for all the training/model parameters (learning rates, epochs number etc.). Some of them use jsonnet syntax to get values from configuration flags, however most of them can be modified directly there.