-
Lukasz Pszenny authored
Adding Lambo tokenizer
c989d8c1
Prediction
COMBO as a Python library
The pre-trained models can be automatically downloaded with the from_pretrained
method. Select a model name from the lists: UD-trained COMBO models and enhanced COMBO models, and pass it as an argument of from_pretrained
.
from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base-ud29")
sentence = nlp("Sentence to parse.")
You can also load your own COMBO model:
from combo.predict import COMBO
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
sentence = nlp("Sentence to parse.")
COMBO allows to enter presegmented sentences (or texts:
from combo.predict import COMBO
model_path = "your_model.tar.gz"
nlp = COMBO.from_pretrained(model_path)
tokenized_sentence = ["Sentence", "to", "parse", "."]
sentence = nlp([tokenized_sentence])
You can use COMBO with the LAMBO tokeniser (Note: installing LAMBO is necessary, see LAMBO installation ).
# Import COMBO and lambo
from combo.predict import COMBO
from combo.utils import lambo_tokenizer
# Download models
nlp = COMBO.from_pretrained("english-bert-base-ud29",tokenizer=lambo_tokenizer.LamboTokenizer("en"))
sentences = nlp("This is the first sentence. This is the second sentence to parse.")
COMBO as a command-line interface
CoNLL-U file prediction:
Input and output are both in *.conllu
format.
combo --mode predict --model_path your_model_tar_gz --input_file your_conllu_file --output_file your_output_file --silent
Raw text prediction:
Works for models where input was text-based only.
Input: one sentence per line.
Output: List of token jsons.
combo --mode predict --model_path your_model_tar_gz --input_file your_text_file --output_file your_output_file --silent --noconllu_format
Console prediction:
Works for models where input was text-based only.
Interactive testing in console (load model and just type sentence in console).
combo --mode predict --model_path your_model_tar_gz --input_file "-" --nosilent
Advanced
There are 2 tokenizers: whitespace and spacy-based (en_core_web_sm
model).
Use either --predictor_name combo
or --predictor_name combo-spacy
(default tokenizer).