Skip to content
Snippets Groups Projects
Michał Marcińczuk's avatar
83dcdfdf

PolDeepNer2

About

PolDeepNer2 is a tool for sequence labeling tasks based on RoBERTa transformer. It offers a set of pretrained models for Polish. The main features are:

  • handle nested annotations (nkjp models),
  • can process plain text using one of the following tokenizers: spaCy, KRNNT and two customs based on split (spaces and fast),
  • squeeze mode — multiple sentences from a single document can be put to the same batch,
  • set of trained models for NER for Polish trained on NKJP and KPWr corpora.

Authors

Setting up

Requirements

  • Python 3.8
  • CUDA 10.0+
  • PyTorch 1.9

Virtual environment

venv

sudo apt-get install python3-pip python3-dev python-virtualenv
sudo pip install -U pip
virtualenv -p python3.8 venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt

Conda

conda create -n pdn2 python=3.8
conda activate pdn2
conda install -c anaconda cudatoolkit=10.2
conda install -c anaconda cudnn
pip install -r requirements.txt

Tokenization methods

spaCy

Required for the spacy and spacy-ext tokenizers.

python -m spacy download pl_core_news_sm

or

python -m pip install pl_core_news_sm-2.3.0.tar.gz

KRNNT

Required for the krnnt tokenizer.

docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0

Polish RoBERTa models

Download the Polish RoBERTa base model.

mkdir models/roberta_base_fairseq -p
wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_base_fairseq.zip
unzip roberta_base_fairseq.zip -d models/roberta_base_fairseq
rm roberta_base_fairseq.zip

Download the Polish RoBERTa large model.

mkdir models/roberta_large_fairseq -p
wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_large_fairseq.zip
unzip roberta_large_fairseq.zip -d models/roberta_large_fairseq
rm roberta_large_fairseq.zip

Lemmatization

Lemmatization module requires the KRNNT and Polem web services up and running.

Setup Polem WS

ToDo

Usage

ToDo

Pre-trained models

https://minio.clarin-pl.eu/minio/public/models/poldeepner2/

Model Path
cen_n82_base https://minio.clarin-pl.eu/public/models/poldeepner2/cen_n82_base.zip
cen_n82_large https://minio.clarin-pl.eu/public/models/poldeepner2/cen_n82_large.zip
kpwr_n82_base https://minio.clarin-pl.eu/public/models/poldeepner2/kpwr_n82_base.zip
kpwr_n82_large https://minio.clarin-pl.eu/public/models/poldeepner2/kpwr_n82_large.zip
nkjp_base https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base.zip
nkjp_base_sq https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip
  • base and large refers to roberta_base_fairseq and roberta_large_fairseq respectively,
  • sq indicates that the model should be used with the --squeeze option.
wget "https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip" -O models/nkjp_base_sq.zip
unzip models/nkjp_base_sq.zip -d models 

N82 models

KPWr

Results on the test part of the KPWr n82 corpus.

Model Precision Recall F1 Time Memory usage GPU memory Embeddings size
kpwr_n82_large 77.05 78.79 77.91 ~ 3.3 m 3.0 GB 3.8 GB 0.71 GB + 1.40 GB
kpwr_n82_base 75.02 77.67 76.32 ~ 1.5 m 3.0 GB 2.0 GB 0.25 GB + 0.50 GB
PolDeepNer (n82-elmo-kgr10) 73.97 75.49 74.72 ~ 4.0 m 4.5 GB - 0.4 GB

See detailed results.

N82 Summary (KPWr, CEN)

Model Eval Precision Recall F-measure Support
kpwr_n82_base KPWr 75.02 77.67 76.32 4430
kpwr_n82_large KPWr 77.05 78.79 77.91 4430
cen_n82_base CEN 84.64 85.95 85.29 1423
cen_n82_large CEN 86.94 88.40 87.67 1423

Cross-corpora evaluation

Model Eval Precision Recall F-measure Support
kpwr_n82_base CEN 80.90 81.87 81.38 1423
kpwr_n82_large CEN 80.16 82.08 81.11 1423
cen_n82_base KPWr 58.58 64.79 61.53 4430
cen_n82_large KPWr 61.38 66.66 63.91 4430

PolEval 2018

Unpack datasets

wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O data/POLEVAL-NER_GOLD.json

Performance

Model Score Exact Overlap Score main Test Time Source
PolDeepNer2 (nkjp_base_sq, spacy-ext) 91.4 89.9 92.7 94.00 2m 13s
PolDeepNer2 (nkjp_base, pre) 90.0 87.7 90.5 92.40 *6m 44s
PolDeepNer2 (nkjp_base, spacy-ext) 89.8 87.4 90.4 92.20 8m 10s
Dadas and Protasiewicz, 2020 88.6 87.0 89.0 - link
Polish RoBERTa large - - - 89.98 link
Polish RoBERTa base - - - 87.94 link
  • Does not include tokenization time.

Evaluation

Evaluate on a pre-tokenized dataset (nkjp_base, pre)

time python process_poleval_pretokenized.py \
  --input data/poleval2018ner-data/index.list \
  --output poleval2018-predictions-pretokenized.json \
  --pretrained_path models/roberta_base_fairseq \
  --model models/nkjp_base \
  --max_seq_length 256 \
  --device cuda:0
python poleval_ner_test.py \
  --goldfile data/POLEVAL-NER_GOLD.json \
  --userfile poleval2018-predictions-pretokenized.json

Evaluate on raw text using spaCy tokenizer (nkjp_base, spaCy)

time python process_poleval.py \
  --input data/POLEVAL-NER_GOLD.json \
  --output poleval2018-predictions-spacy.json \
  --pretrained_path models/roberta_base_fairseq \
  --model models/nkjp_base \
  --max_seq_length 256 \
  --tokenization spacy-ext \
  --device cuda:0
python poleval_ner_test.py \
  --goldfile data/POLEVAL-NER_GOLD.json \
  --userfile poleval2018-predictions-spacy.json

Evaluate on raw text using spaCy tokenizer (nkjp_base_sq, spaCy)

time python process_poleval.py \
  --input data/POLEVAL-NER_GOLD.json \
  --output poleval2018-predictions-spacy-sq.json \
  --pretrained_path models/roberta_base_fairseq \
  --model models/nkjp_base_sq \
  --max_seq_length 256 \
  --tokenization spacy-ext \
  --squeeze \
  --device cuda:0

Score, Exact and Overlap:

python poleval_ner_test.py \
  --goldfile data/POLEVAL-NER_GOLD.json \
  --userfile poleval2018-predictions-spacy-sq.json

Score main:

python poleval_ner_test_v2.py \
  --goldfile data/POLEVAL-NER_GOLD.json \
  --userfile poleval2018-predictions-spacy-sq.json \
  --categories-main

Usage

Sample usage

Command:

python sample.py

Expected output:

--------------------
Marek Nowak z Politechniki Wrocławskiej mieszka przy ul. Sądeckiej.
0:11     nam_liv_person       Marek Nowak
14:39    nam_org_organization Politechniki Wrocławskiej
57:66    nam_fac_road         Sądeckiej
--------------------
#PoselAdamNowak Co Pan myśli na temat fuzji Orlenu i Lotosu?
6:15     nam_liv_person       AdamNowak
44:50    nam_org_group_team   Orlenu
53:59    nam_org_group_team   Lotosu

Docker

To build base image

docker build -f Dockerfiles/base/Dockerfile . --tag poldeepner2

To build specific models on top of base image

docker build -f Dockerfiles/nkjp_base_sq/Dockerfile . --tag poldeepner2_nkjp_base_sq

To run container with chosen model

docker run --publish 8000:8000 poldeepner2_nkjp_base_sq
docker build -f Dockerfiles/cen_n82_herbert_large_polem_gpu/Dockerfile . --tag poldeepner2:cen_n82_herbert_large_polem_gpu

docker run -p 8001:8001 --gpus all --network host mczuk/poldeepner2:cen_n82_herbert_large_polem_gpu

Flask server

Run

python server.py \
   --pretrained_path models/roberta_base_fairseq \
   --model models/nkjp_base_sq/ \
   --tokenization spacy-ext \
   --device cuda:0 \
   --squeeze

Process endpoint

curl -XPOST localhost:8001/predict -d \
     '{"text": "Poznałem Marka Nowaka z Politechniki Wrocławskiej, który mieszka przy ul. Sądeckiej."}'

Expected output:

{
  "entities": [
    {
      "begin": 0, 
      "end": 11, 
      "label": "persName", 
      "text": "Marek Nowak"
    }, 
    {
      "begin": 0, 
      "end": 5, 
      "label": "persName_forename", 
      "text": "Marek"
    }, 
    {
      "begin": 6, 
      "end": 11, 
      "label": "persName_surname", 
      "text": "Nowak"
    }, 
    {
      "begin": 14, 
      "end": 39, 
      "label": "orgName", 
      "text": "Politechniki Wroc\u0142awskiej"
    }, 
    {
      "begin": 27, 
      "end": 39, 
      "label": "placeName_settlement", 
      "text": "Wroc\u0142awskiej"
    }, 
    {
      "begin": 53, 
      "end": 67, 
      "label": "geogName", 
      "text": "ul. S\u0105deckiej."
    }
  ], 
  "text": "Marek Nowak z Politechniki Wroc\u0142awskiej mieszka przy ul. S\u0105deckiej."
}

Training

See docs/training.md

NER with lemmatization for Polish

Requirements:

  • KRNNT tagger
    docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0
  • Polem
    docker run -d -p 8000:8000 mczuk/polem:1.0.0

Install PolDeepNer2:

pip install -r requirements.txt

Run sample code:

python sample_polem.py

Expected output:

--------------------
Spotkałem Marka Nowaka na Politechnice Wrocławskiej, który pracuje w Intelu.
2:4      10:22        nam_liv_person            Marka Nowaka               Marek Nowak
5:7      26:51        nam_org_organization      Politechnice Wrocławskiej  Politechnika Wrocławska
11:12    69:75        nam_org_company           Intelu                     Intel

--------------------
Wczoraj mieliśmy kontrolę Naczelnej Izby Skarbowej.
4:7      26:50        nam_org_institution       Naczelnej Izby Skarbowej   Naczelna Izba Skarbowa

(...)

Credits