Skip to content
Snippets Groups Projects
Select Git revision
  • 12-handle-long-sequences
  • master default protected
  • develop protected
  • develop-0.7.x
  • develop-0.8.0
  • dev_czuk
  • loader
  • kgr10_roberta
  • 14-BiLSTM-CRF-RoBERTa
  • 13-flair-embeddings
  • BiLSTM
  • v0.7.0
  • v0.6.1
  • v0.5
  • v0.4.1
  • v0.3
16 results

poldeepner2

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Jarema Radom authored
    44c44d1f
    History

    PolDeepNer2

    About

    PolDeepNer2 is a tool for sequence labeling tasks based on RoBERTa transformer. It offers a set of pretrained models for Polish. The main features are:

    • handle nested annotations (nkjp models),
    • can process plain text using one of the following tokenizers: spaCy, NLTK, KRNNT,
    • squeeze mode — multiple sentences from a single document can be put to the same batch,
    • set of trained models for NER for Polish trained on NKJP and KPWr corpora.

    Authors

    • Michał Marcińczuk
    • Jarema Radom

    Credits

    This code is based on xlm-roberta-ner by mohammadKhalifa.

    Setting up

    Requirements

    • Python 3.6
    • CUDA 10.0+
    • PyTorch

    Virtual environment

    sudo apt-get install python3-pip python3-dev python-virtualenv
    sudo pip install -U pip
    virtualenv -p python3.6 venv
    source venv/bin/activate
    pip install -U pip
    pip install -r requirements.txt

    spaCy

    Required for the spacy and spacy-ext tokenizers.

    python -m spacy download pl_core_news_sm

    or

    python -m pip install pl_core_news_sm-2.3.0.tar.gz

    KRNNT

    Required for the krnnt tokenizer.

    docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0

    Polish RoBERTa models

    Download the Polish RoBERTa base model.

    mkdir models/roberta_base_fairseq -p
    wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_base_fairseq.zip
    unzip roberta_base_fairseq.zip -d models/roberta_base_fairseq
    rm roberta_base_fairseq.zip

    Download the Polish RoBERTa large model.

    mkdir models/roberta_large_fairseq -p
    wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_large_fairseq.zip
    unzip roberta_large_fairseq.zip -d models/roberta_large_fairseq
    rm roberta_large_fairseq.zip

    Pre-trained models

    https://minio.clarin-pl.eu/minio/public/models/poldeepner2/

    Model Path
    cen_n82_base https://minio.clarin-pl.eu/public/models/poldeepner2/cen_n82_base.zip
    cen_n82_large https://minio.clarin-pl.eu/public/models/poldeepner2/cen_n82_large.zip
    kpwr_n82_base https://minio.clarin-pl.eu/public/models/poldeepner2/kpwr_n82_base.zip
    kpwr_n82_large https://minio.clarin-pl.eu/public/models/poldeepner2/kpwr_n82_large.zip
    nkjp_base https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base.zip
    nkjp_base_sq https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip
    • base and large refers to roberta_base_fairseq and roberta_large_fairseq respectively,
    • sq indicates that the model should be used with the --squeeze option.
    wget "https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip" -O models/nkjp_base_sq.zip
    unzip models/nkjp_base_sq.zip -d models 

    N82 models

    KPWr

    Results on the test part of the KPWr n82 corpus.

    Model Precision Recall F1 Time Memory usage GPU memory Embeddings size
    kpwr_n82_large 77.05 78.79 77.91 ~ 3.3 m 3.0 GB 3.8 GB 0.71 GB + 1.40 GB
    kpwr_n82_base 75.02 77.67 76.32 ~ 1.5 m 3.0 GB 2.0 GB 0.25 GB + 0.50 GB
    PolDeepNer (n82-elmo-kgr10) 73.97 75.49 74.72 ~ 4.0 m 4.5 GB - 0.4 GB

    See detailed results.

    N82 Summary (KPWr, CEN)

    Model Eval Precision Recall F-measure Support
    kpwr_n82_base KPWr 75.02 77.67 76.32 4430
    kpwr_n82_large KPWr 77.05 78.79 77.91 4430
    cen_n82_base CEN 84.64 85.95 85.29 1423
    cen_n82_large CEN 86.94 88.40 87.67 1423

    Cross-corpora evaluation

    Model Eval Precision Recall F-measure Support
    kpwr_n82_base CEN 80.90 81.87 81.38 1423
    kpwr_n82_large CEN 80.16 82.08 81.11 1423
    cen_n82_base KPWr 58.58 64.79 61.53 4430
    cen_n82_large KPWr 61.38 66.66 63.91 4430

    PolEval 2018

    Unpack datasets

    wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O data/POLEVAL-NER_GOLD.json

    Performance

    Model Score Exact Overlap Score main Test Time Source
    PolDeepNer2 (nkjp_base_sq, spacy-ext) 91.4 89.9 92.7 94.00 2m 13s
    PolDeepNer2 (nkjp_base, pre) 90.0 87.7 90.5 92.40 *6m 44s
    PolDeepNer2 (nkjp_base, spacy-ext) 89.8 87.4 90.4 92.20 8m 10s
    Dadas and Protasiewicz, 2020 88.6 87.0 89.0 - link
    Polish RoBERTa large - - - 89.98 link
    Polish RoBERTa base - - - 87.94 link
    • Does not include tokenization time.

    Evaluation

    Evaluate on a pre-tokenized dataset (nkjp_base, pre)

    time python process_poleval_pretokenized.py \
      --input data/poleval2018ner-data/index.list \
      --output poleval2018-predictions-pretokenized.json \
      --pretrained_path models/roberta_base_fairseq \
      --model models/nkjp_base \
      --max_seq_length 256 \
      --device cuda:0
    python poleval_ner_test.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-pretokenized.json

    Evaluate on raw text using spaCy tokenizer (nkjp_base, spaCy)

    time python process_poleval.py \
      --input data/POLEVAL-NER_GOLD.json \
      --output poleval2018-predictions-spacy.json \
      --pretrained_path models/roberta_base_fairseq \
      --model models/nkjp_base \
      --max_seq_length 256 \
      --tokenization spacy-ext \
      --device cuda:0
    python poleval_ner_test.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-spacy.json

    Evaluate on raw text using spaCy tokenizer (nkjp_base_sq, spaCy)

    time python process_poleval.py \
      --input data/POLEVAL-NER_GOLD.json \
      --output poleval2018-predictions-spacy-sq.json \
      --pretrained_path models/roberta_base_fairseq \
      --model models/nkjp_base_sq \
      --max_seq_length 256 \
      --tokenization spacy-ext \
      --squeeze \
      --device cuda:0

    Score, Exact and Overlap:

    python poleval_ner_test.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-spacy-sq.json

    Score main:

    python poleval_ner_test_v2.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-spacy-sq.json \
      --categories-main

    Training

    Sample usage

    Command:

    python sample.py

    Expected output:

    --------------------
    Marek Nowak z Politechniki Wrocławskiej mieszka przy ul. Sądeckiej.
    0:11     nam_liv_person       Marek Nowak
    14:39    nam_org_organization Politechniki Wrocławskiej
    57:66    nam_fac_road         Sądeckiej
    --------------------
    #PoselAdamNowak Co Pan myśli na temat fuzji Orlenu i Lotosu?
    6:15     nam_liv_person       AdamNowak
    44:50    nam_org_group_team   Orlenu
    53:59    nam_org_group_team   Lotosu

    Flask server

    Run

    python server.py \
       --pretrained_path models/roberta_base_fairseq \
       --model models/nkjp_base_sq/ \
       --tokenization spacy-ext \
       --device cuda:0 \
       --squeeze

    Process

    curl -XPOST localhost:8000/predict -d "Marek Nowak z Politechniki Wrocławskiej mieszka przy ul. Sądeckiej."

    Expected output:

    {
      "entities": [
        {
          "begin": 0, 
          "end": 11, 
          "label": "persName", 
          "text": "Marek Nowak"
        }, 
        {
          "begin": 0, 
          "end": 5, 
          "label": "persName_forename", 
          "text": "Marek"
        }, 
        {
          "begin": 6, 
          "end": 11, 
          "label": "persName_surname", 
          "text": "Nowak"
        }, 
        {
          "begin": 14, 
          "end": 39, 
          "label": "orgName", 
          "text": "Politechniki Wroc\u0142awskiej"
        }, 
        {
          "begin": 27, 
          "end": 39, 
          "label": "placeName_settlement", 
          "text": "Wroc\u0142awskiej"
        }, 
        {
          "begin": 53, 
          "end": 67, 
          "label": "geogName", 
          "text": "ul. S\u0105deckiej."
        }
      ], 
      "text": "Marek Nowak z Politechniki Wroc\u0142awskiej mieszka przy ul. S\u0105deckiej."
    }

    Training

    The code expects the data directory passed to contain 3 dataset splits: train.txt, valid.txt and test.txt.

    KPWr

    time python main.py  \
          --data_dir=data/kpwr_n82/  \
          --task_name=ner \
          --output_dir=models/kpwr_n82_base/   \
          --max_seq_length=128   \
          --num_train_epochs 30  \
          --do_eval \
          --warmup_proportion=0.0 \
          --pretrained_path models/roberta_base_fairseq \
          --learning_rate 6e-5 \
          --gradient_accumulation_steps 4 \
          --do_train \
          --eval_on test \
          --train_batch_size 32 \
          --dropout 0.3

    CEN

    time python main.py  \
          --data_dir=data/cen_n82/  \
          --task_name=ner \
          --output_dir=models/cen_n82_large/   \
          --max_seq_length=512   \
          --num_train_epochs 30  \
          --do_eval \
          --warmup_proportion=0.0 \
          --pretrained_path models/roberta_large_fairseq \
          --learning_rate 6e-5 \
          --gradient_accumulation_steps 4 \
          --do_train \
          --eval_on test \
          --train_batch_size 32 \
          --dropout 0.3

    PolEval 2018

    time python main.py  \
          --data_dir=data/nkjp-nested-full-aug/  \
          --task_name=ner \
          --output_dir=models/nkjp_base_sq/   \
          --max_seq_length=256   \
          --num_train_epochs 10  \
          --do_eval \
          --warmup_proportion=0.0 \
          --pretrained_path models/roberta_base_fairseq \
          --learning_rate 6e-5 \
          --gradient_accumulation_steps 4 \
          --do_train \
          --eval_on test \
          --train_batch_size 32 \
          --dropout 0.3 \
          --squeeze

    Docker

    To build base image

    docker build -f Dockerfiles/base/Dockerfile . --tag poldeepner2

    To build specific models on top of base image

    docker build -f Dockerfiles/nkjp_base_sq/Dockerfile . --tag poldeepner2_nkjp_base_sq

    To run container with chosen model

    docker run --publish 8000:8000 poldeepner2_nkjp_base_sq

    HerBERT

    time python main.py  \
          --data_dir=data/nkjp-nested-full-aug/  \
          --task_name=ner \
          --output_dir=models/nkjp_base_sq/   \
          --max_seq_length=256   \
          --num_train_epochs 10  \
          --do_eval \
          --warmup_proportion=0.0 \
          --pretrained_path models/roberta_base_fairseq \
          --learning_rate 6e-5 \
          --gradient_accumulation_steps 4 \
          --do_train \
          --eval_on test \
          --train_batch_size 32 \
          --dropout 0.3 \
          --model=Herbert \
          --squeeze