PolDeepNer2
About
PolDeepNer2 is a tool for sequence labeling tasks based on RoBERTa transformer. It offers a set of pretrained models for Polish. The main features are:
- handle nested annotations (nkjp models),
- can process plain text using one of the following tokenizers: spaCy, KRNNT and two customs based on split (spaces and fast),
- squeeze mode — multiple sentences from a single document can be put to the same batch,
- set of trained models for NER for Polish trained on NKJP and KPWr corpora.
Authors
- Michał Marcińczuk (michal.marcinczuk@pwr.edu.pl, marcinczuk@gmail.com)
- Jarema Radom
Setting up
Requirements
- Python 3.8
- CUDA 10.0+
- PyTorch 1.9
Virtual environment
venv
sudo apt-get install python3-pip python3-dev python-virtualenv
sudo pip install -U pip
virtualenv -p python3.8 venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
Conda
conda create -n pdn2 python=3.8
conda activate pdn2
conda install -c anaconda cudatoolkit=10.2
conda install -c anaconda cudnn
pip install -r requirements.txt
Tokenization methods
spaCy
Required for the spacy
and spacy-ext
tokenizers.
python -m spacy download pl_core_news_sm
or
python -m pip install pl_core_news_sm-2.3.0.tar.gz
KRNNT
Required for the krnnt
tokenizer.
docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0
Polish RoBERTa models
Download the Polish RoBERTa base model.
mkdir models/roberta_base_fairseq -p
wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_base_fairseq.zip
unzip roberta_base_fairseq.zip -d models/roberta_base_fairseq
rm roberta_base_fairseq.zip
Download the Polish RoBERTa large model.
mkdir models/roberta_large_fairseq -p
wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_large_fairseq.zip
unzip roberta_large_fairseq.zip -d models/roberta_large_fairseq
rm roberta_large_fairseq.zip
Lemmatization
Lemmatization module requires the KRNNT and Polem web services up and running.
Setup Polem WS
ToDo
Usage
ToDo
Pre-trained models
https://minio.clarin-pl.eu/minio/public/models/poldeepner2/
-
base and large refers to
roberta_base_fairseq
androberta_large_fairseq
respectively, -
sq indicates that the model should be used with the
--squeeze
option.
wget "https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip" -O models/nkjp_base_sq.zip
unzip models/nkjp_base_sq.zip -d models
N82 models
KPWr
Results on the test part of the KPWr n82 corpus.
Model | Precision | Recall | F1 | Time | Memory usage | GPU memory | Embeddings size |
---|---|---|---|---|---|---|---|
kpwr_n82_large | 77.05 | 78.79 | 77.91 | ~ 3.3 m | 3.0 GB | 3.8 GB | 0.71 GB + 1.40 GB |
kpwr_n82_base | 75.02 | 77.67 | 76.32 | ~ 1.5 m | 3.0 GB | 2.0 GB | 0.25 GB + 0.50 GB |
PolDeepNer (n82-elmo-kgr10) | 73.97 | 75.49 | 74.72 | ~ 4.0 m | 4.5 GB | - | 0.4 GB |
See detailed results.
N82 Summary (KPWr, CEN)
Model | Eval | Precision | Recall | F-measure | Support |
---|---|---|---|---|---|
kpwr_n82_base | KPWr | 75.02 | 77.67 | 76.32 | 4430 |
kpwr_n82_large | KPWr | 77.05 | 78.79 | 77.91 | 4430 |
cen_n82_base | CEN | 84.64 | 85.95 | 85.29 | 1423 |
cen_n82_large | CEN | 86.94 | 88.40 | 87.67 | 1423 |
Cross-corpora evaluation
Model | Eval | Precision | Recall | F-measure | Support |
---|---|---|---|---|---|
kpwr_n82_base | CEN | 80.90 | 81.87 | 81.38 | 1423 |
kpwr_n82_large | CEN | 80.16 | 82.08 | 81.11 | 1423 |
cen_n82_base | KPWr | 58.58 | 64.79 | 61.53 | 4430 |
cen_n82_large | KPWr | 61.38 | 66.66 | 63.91 | 4430 |
PolEval 2018
Unpack datasets
wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O data/POLEVAL-NER_GOLD.json
Performance
Model | Score | Exact | Overlap | Score main | Test Time | Source |
---|---|---|---|---|---|---|
PolDeepNer2 (nkjp_base_sq, spacy-ext) | 91.4 | 89.9 | 92.7 | 94.00 | 2m 13s | |
PolDeepNer2 (nkjp_base, pre) | 90.0 | 87.7 | 90.5 | 92.40 | *6m 44s | |
PolDeepNer2 (nkjp_base, spacy-ext) | 89.8 | 87.4 | 90.4 | 92.20 | 8m 10s | |
Dadas and Protasiewicz, 2020 | 88.6 | 87.0 | 89.0 | - | link | |
Polish RoBERTa large | - | - | - | 89.98 | link | |
Polish RoBERTa base | - | - | - | 87.94 | link |
- Does not include tokenization time.
Evaluation
Evaluate on a pre-tokenized dataset (nkjp_base, pre)
time python process_poleval_pretokenized.py \
--input data/poleval2018ner-data/index.list \
--output poleval2018-predictions-pretokenized.json \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base \
--max_seq_length 256 \
--device cuda:0
python poleval_ner_test.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-pretokenized.json
Evaluate on raw text using spaCy tokenizer (nkjp_base, spaCy)
time python process_poleval.py \
--input data/POLEVAL-NER_GOLD.json \
--output poleval2018-predictions-spacy.json \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base \
--max_seq_length 256 \
--tokenization spacy-ext \
--device cuda:0
python poleval_ner_test.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-spacy.json
Evaluate on raw text using spaCy tokenizer (nkjp_base_sq, spaCy)
time python process_poleval.py \
--input data/POLEVAL-NER_GOLD.json \
--output poleval2018-predictions-spacy-sq.json \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base_sq \
--max_seq_length 256 \
--tokenization spacy-ext \
--squeeze \
--device cuda:0
Score
, Exact
and Overlap
:
python poleval_ner_test.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-spacy-sq.json
Score main
:
python poleval_ner_test_v2.py \
--goldfile data/POLEVAL-NER_GOLD.json \
--userfile poleval2018-predictions-spacy-sq.json \
--categories-main
Usage
Sample usage
Command:
python sample.py
Expected output:
--------------------
Marek Nowak z Politechniki Wrocławskiej mieszka przy ul. Sądeckiej.
0:11 nam_liv_person Marek Nowak
14:39 nam_org_organization Politechniki Wrocławskiej
57:66 nam_fac_road Sądeckiej
--------------------
#PoselAdamNowak Co Pan myśli na temat fuzji Orlenu i Lotosu?
6:15 nam_liv_person AdamNowak
44:50 nam_org_group_team Orlenu
53:59 nam_org_group_team Lotosu
Docker
To build base image
docker build -f Dockerfiles/base/Dockerfile . --tag poldeepner2
To build specific models on top of base image
docker build -f Dockerfiles/nkjp_base_sq/Dockerfile . --tag poldeepner2_nkjp_base_sq
To run container with chosen model
docker run --publish 8000:8000 poldeepner2_nkjp_base_sq
docker build -f Dockerfiles/cen_n82_herbert_large_polem_gpu/Dockerfile . --tag poldeepner2:cen_n82_herbert_large_polem_gpu
docker run -p 8001:8001 --gpus all --network host mczuk/poldeepner2:cen_n82_herbert_large_polem_gpu
Flask server
Run
python server.py \
--pretrained_path models/roberta_base_fairseq \
--model models/nkjp_base_sq/ \
--tokenization spacy-ext \
--device cuda:0 \
--squeeze
Process endpoint
curl -XPOST localhost:8001/predict -d \
'{"text": "Poznałem Marka Nowaka z Politechniki Wrocławskiej, który mieszka przy ul. Sądeckiej."}'
Expected output:
{
"entities": [
{
"begin": 0,
"end": 11,
"label": "persName",
"text": "Marek Nowak"
},
{
"begin": 0,
"end": 5,
"label": "persName_forename",
"text": "Marek"
},
{
"begin": 6,
"end": 11,
"label": "persName_surname",
"text": "Nowak"
},
{
"begin": 14,
"end": 39,
"label": "orgName",
"text": "Politechniki Wroc\u0142awskiej"
},
{
"begin": 27,
"end": 39,
"label": "placeName_settlement",
"text": "Wroc\u0142awskiej"
},
{
"begin": 53,
"end": 67,
"label": "geogName",
"text": "ul. S\u0105deckiej."
}
],
"text": "Marek Nowak z Politechniki Wroc\u0142awskiej mieszka przy ul. S\u0105deckiej."
}
Training
See docs/training.md
NER with lemmatization for Polish
Requirements:
- KRNNT tagger
docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0
- Polem
docker run -d -p 8000:8000 mczuk/polem:1.0.0
Install PolDeepNer2:
pip install -r requirements.txt
Run sample code:
python sample_polem.py
Expected output:
--------------------
Spotkałem Marka Nowaka na Politechnice Wrocławskiej, który pracuje w Intelu.
2:4 10:22 nam_liv_person Marka Nowaka Marek Nowak
5:7 26:51 nam_org_organization Politechnice Wrocławskiej Politechnika Wrocławska
11:12 69:75 nam_org_company Intelu Intel
--------------------
Wczoraj mieliśmy kontrolę Naczelnej Izby Skarbowej.
4:7 26:50 nam_org_institution Naczelnej Izby Skarbowej Naczelna Izba Skarbowa
(...)
Credits
- This code is based on xlm-roberta-ner by mohammadKhalifa.
- Language models for Polish:
- KRNNT Tagger: https://github.com/kwrobel-nlp/krnnt