PolDeepNer2
About
PolDeepNer2 is a tool for sequence labeling tasks based on RoBERTa transformer. It offers a set of pretrained models for Polish. The main features are:
- handle nested annotations (nkjp models),
- can process plain text using one of the following tokenizers: spaCy, NLTK, KRNNT,
- squeeze mode — multiple sentences from a single document can be put to the same batch,
- set of trained models for NER for Polish trained on NKJP and KPWr corpora.
Authors
- Michał Marcińczuk
- Jarema Radom
Credits
This code is based on xlm-roberta-ner by mohammadKhalifa.
Setting up
Requirements
- Python 3.6
- CUDA 10.0+
- PyTorch
Virtual environment
sudo apt-get install python3-pip python3-dev python-virtualenv
sudo pip install -U pip
virtualenv -p python3.6 venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
spaCy
Required for the spacy
and spacy-ext
tokenizers.
python -m spacy download pl_core_news_sm
or
python -m pip install pl_core_news_sm-2.3.0.tar.gz
KRNNT
Required for the krnnt
tokenizer.
docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0
Polish RoBERTa models
Download the Polish RoBERTa base model.
mkdir models/roberta_base_fairseq -p
wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_base_fairseq.zip
unzip roberta_base_fairseq.zip -d models/roberta_base_fairseq
rm roberta_base_fairseq.zip
Download the Polish RoBERTa large model.
mkdir models/roberta_large_fairseq -p
wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_large_fairseq.zip
unzip roberta_large_fairseq.zip -d models/roberta_large_fairseq
rm roberta_large_fairseq.zip
Pre-trained models
https://minio.clarin-pl.eu/minio/public/models/poldeepner2/
-
base and large refers to
roberta_base_fairseq
androberta_large_fairseq
respectively, -
sq indicates that the model should be used with the
--squeeze
option.
wget "https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip" -O models/nkjp_base_sq.zip
unzip models/nkjp_base_sq.zip -d models
N82 models
KPWr
Results on the test part of the KPWr n82 corpus.
Model | Precision | Recall | F1 | Time | Memory usage | GPU memory | Embeddings size |
---|---|---|---|---|---|---|---|
kpwr_n82_large | 77.05 | 78.79 | 77.91 | ~ 3.3 m | 3.0 GB | 3.8 GB | 0.71 GB + 1.40 GB |
kpwr_n82_base | 75.02 | 77.67 | 76.32 | ~ 1.5 m | 3.0 GB | 2.0 GB | 0.25 GB + 0.50 GB |
PolDeepNer (n82-elmo-kgr10) | 73.97 | 75.49 | 74.72 | ~ 4.0 m | 4.5 GB | - | 0.4 GB |
See detailed results.
N82 Summary (KPWr, CEN)
Model | Eval | Precision | Recall | F-measure | Support |
---|---|---|---|---|---|
kpwr_n82_base | KPWr | 75.02 | 77.67 | 76.32 | 4430 |
kpwr_n82_large | KPWr | 77.05 | 78.79 | 77.91 | 4430 |
cen_n82_base | CEN | 84.64 | 85.95 | 85.29 | 1423 |
cen_n82_large | CEN | 86.94 | 88.40 | 87.67 | 1423 |
Cross-corpora evaluation
Model | Eval | Precision | Recall | F-measure | Support |
---|---|---|---|---|---|
kpwr_n82_base | CEN | 80.90 | 81.87 | 81.38 | 1423 |
kpwr_n82_large | CEN | 80.16 | 82.08 | 81.11 | 1423 |
cen_n82_base | KPWr | 58.58 | 64.79 | 61.53 | 4430 |
cen_n82_large | KPWr | 61.38 | 66.66 | 63.91 | 4430 |
PolEval 2018
Unpack datasets
wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O data/POLEVAL-NER_GOLD.json
Performance
Model | Score | Exact | Overlap | Score main | Test Time | Source |
---|---|---|---|---|---|---|
PolDeepNer2 (nkjp_base_sq, spacy-ext) | 91.4 | 89.9 | 92.7 | 94.00 | 2m 13s | |
PolDeepNer2 (nkjp_base, pre) | 90.0 | 87.7 | 90.5 | 92.40 | *6m 44s | |
PolDeepNer2 (nkjp_base, spacy-ext) | 89.8 | 87.4 | 90.4 | 92.20 | 8m 10s | |
Dadas and Protasiewicz, 2020 | 88.6 | 87.0 | 89.0 | - | link | |
Polish RoBERTa large | - | - | - | 89.98 | link | |
Polish RoBERTa base | - | - | - | 87.94 | link |
- Does not include tokenization time.
Evaluation
Evaluate on a pre-tokenized dataset (nkjp_base, pre)
time python process_poleval_pretokenized.py \
--input data/poleval2018ner-data/index.list \
--output poleval2018-predictions-pretokenized.json \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base \
--max_seq_length 256 \
--device cuda:0
python poleval_ner_test.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-pretokenized.json
Evaluate on raw text using spaCy tokenizer (nkjp_base, spaCy)
time python process_poleval.py \
--input data/POLEVAL-NER_GOLD.json \
--output poleval2018-predictions-spacy.json \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base \
--max_seq_length 256 \
--tokenization spacy-ext \
--device cuda:0
python poleval_ner_test.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-spacy.json
Evaluate on raw text using spaCy tokenizer (nkjp_base_sq, spaCy)
time python process_poleval.py \
--input data/POLEVAL-NER_GOLD.json \
--output poleval2018-predictions-spacy-sq.json \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base_sq \
--max_seq_length 256 \
--tokenization spacy-ext \
--squeeze \
--device cuda:0
Score
, Exact
and Overlap
:
python poleval_ner_test.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-spacy-sq.json
Score main
:
python poleval_ner_test_v2.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-spacy-sq.json \
--categories-main
Training
Sample usage
Command:
python sample.py
Expected output:
--------------------
Marek Nowak z Politechniki Wrocławskiej mieszka przy ul. Sądeckiej.
0:11 nam_liv_person Marek Nowak
14:39 nam_org_organization Politechniki Wrocławskiej
57:66 nam_fac_road Sądeckiej
--------------------
#PoselAdamNowak Co Pan myśli na temat fuzji Orlenu i Lotosu?
6:15 nam_liv_person AdamNowak
44:50 nam_org_group_team Orlenu
53:59 nam_org_group_team Lotosu
Flask server
Run
python server.py \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base_sq/ \
--tokenization spacy-ext \
--device cuda:0 \
--squeeze
Process
curl -XPOST localhost:8000/predict -d "Marek Nowak z Politechniki Wrocławskiej mieszka przy ul. Sądeckiej."
Expected output:
{
"entities": [
{
"begin": 0,
"end": 11,
"label": "persName",
"text": "Marek Nowak"
},
{
"begin": 0,
"end": 5,
"label": "persName_forename",
"text": "Marek"
},
{
"begin": 6,
"end": 11,
"label": "persName_surname",
"text": "Nowak"
},
{
"begin": 14,
"end": 39,
"label": "orgName",
"text": "Politechniki Wroc\u0142awskiej"
},
{
"begin": 27,
"end": 39,
"label": "placeName_settlement",
"text": "Wroc\u0142awskiej"
},
{
"begin": 53,
"end": 67,
"label": "geogName",
"text": "ul. S\u0105deckiej."
}
],
"text": "Marek Nowak z Politechniki Wroc\u0142awskiej mieszka przy ul. S\u0105deckiej."
}
Training
The code expects the data directory passed to contain 3 dataset splits: train.txt
, valid.txt
and test.txt
.
KPWr
time python main.py \
--data_dir=data/kpwr_n82/ \
--task_name=ner \
--output_dir=models/kpwr_n82_base/ \
--max_seq_length=128 \
--num_train_epochs 30 \
--do_eval \
--warmup_proportion=0.0 \
--pretrained_path models/roberta_base_fairseq \
--learning_rate 6e-5 \
--gradient_accumulation_steps 4 \
--do_train \
--eval_on test \
--train_batch_size 32 \
--dropout 0.3
CEN
time python main.py \
--data_dir=data/cen_n82/ \
--task_name=ner \
--output_dir=models/cen_n82_large/ \
--max_seq_length=512 \
--num_train_epochs 30 \
--do_eval \
--warmup_proportion=0.0 \
--pretrained_path models/roberta_large_fairseq \
--learning_rate 6e-5 \
--gradient_accumulation_steps 4 \
--do_train \
--eval_on test \
--train_batch_size 32 \
--dropout 0.3
PolEval 2018
time python main.py \
--data_dir=data/nkjp-nested-full-aug/ \
--task_name=ner \
--output_dir=models/nkjp_base_sq/ \
--max_seq_length=256 \
--num_train_epochs 10 \
--do_eval \
--warmup_proportion=0.0 \
--pretrained_path models/roberta_base_fairseq \
--learning_rate 6e-5 \
--gradient_accumulation_steps 4 \
--do_train \
--eval_on test \
--train_batch_size 32 \
--dropout 0.3 \
--squeeze
Docker
To build base image
docker build -f Dockerfiles/base/Dockerfile . --tag poldeepner2
To build specific models on top of base image
docker build -f Dockerfiles/nkjp_base_sq/Dockerfile . --tag poldeepner2_nkjp_base_sq
To run container with chosen model
docker run --publish 8000:8000 poldeepner2_nkjp_base_sq
HerBERT
time python main.py \
--data_dir=data/nkjp-nested-full-aug/ \
--task_name=ner \
--output_dir=models/nkjp_base_sq/ \
--max_seq_length=256 \
--num_train_epochs 10 \
--do_eval \
--warmup_proportion=0.0 \
--pretrained_path models/roberta_base_fairseq \
--learning_rate 6e-5 \
--gradient_accumulation_steps 4 \
--do_train \
--eval_on test \
--train_batch_size 32 \
--dropout 0.3 \
--model=Herbert \
--squeeze