Skip to content
Snippets Groups Projects
Select Git revision
  • 5fd4d9598fe7a36a37e3464414b81a49963065f2
  • master default protected
  • develop protected
  • develop-0.7.x
  • develop-0.8.0
  • dev_czuk
  • loader
  • kgr10_roberta
  • 14-BiLSTM-CRF-RoBERTa
  • 12-handle-long-sequences
  • 13-flair-embeddings
  • BiLSTM
  • v0.7.0
  • v0.6.1
  • v0.5
  • v0.4.1
  • v0.3
17 results

poldeepner2

  • Clone with SSH
  • Clone with HTTPS
  • PolDeepNer2

    About

    PolDeepNer2 is a tool for sequence labeling tasks based on RoBERTa transformer. It offers a set of pretrained models for Polish. The main features are:

    • handle nested annotations (nkjp models),
    • can process plain text using one of the following tokenizers: spaCy, KRNNT and two customs based on split (spaces and fast),
    • squeeze mode — multiple sentences from a single document can be put to the same batch,
    • set of trained models for NER for Polish trained on NKJP and KPWr corpora.

    Authors

    Setting up

    Requirements

    • Python 3.8
    • CUDA 10.0+
    • PyTorch 1.9

    Virtual environment

    venv

    sudo apt-get install python3-pip python3-dev python-virtualenv
    sudo pip install -U pip
    virtualenv -p python3.8 venv
    source venv/bin/activate
    pip install -U pip
    pip install -r requirements.txt

    Conda

    conda create -n pdn2 python=3.8
    conda activate pdn2
    conda install -c anaconda cudatoolkit=10.2
    conda install -c anaconda cudnn
    pip install -r requirements.txt

    Tokenization methods

    spaCy

    Required for the spacy and spacy-ext tokenizers.

    python -m spacy download pl_core_news_sm

    or

    python -m pip install pl_core_news_sm-2.3.0.tar.gz

    KRNNT

    Required for the krnnt tokenizer.

    docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0

    Polish RoBERTa models

    Download the Polish RoBERTa base model.

    mkdir models/roberta_base_fairseq -p
    wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_base_fairseq.zip
    unzip roberta_base_fairseq.zip -d models/roberta_base_fairseq
    rm roberta_base_fairseq.zip

    Download the Polish RoBERTa large model.

    mkdir models/roberta_large_fairseq -p
    wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_large_fairseq.zip
    unzip roberta_large_fairseq.zip -d models/roberta_large_fairseq
    rm roberta_large_fairseq.zip

    Lemmatization

    Lemmatization module requires the KRNNT and Polem web services up and running.

    Setup Polem WS

    ToDo

    Usage

    ToDo

    Pre-trained models

    https://minio.clarin-pl.eu/minio/public/models/poldeepner2/

    Model Path
    cen_n82_base https://minio.clarin-pl.eu/public/models/poldeepner2/cen_n82_base.zip
    cen_n82_large https://minio.clarin-pl.eu/public/models/poldeepner2/cen_n82_large.zip
    kpwr_n82_base https://minio.clarin-pl.eu/public/models/poldeepner2/kpwr_n82_base.zip
    kpwr_n82_large https://minio.clarin-pl.eu/public/models/poldeepner2/kpwr_n82_large.zip
    nkjp_base https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base.zip
    nkjp_base_sq https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip
    • base and large refers to roberta_base_fairseq and roberta_large_fairseq respectively,
    • sq indicates that the model should be used with the --squeeze option.
    wget "https://minio.clarin-pl.eu/public/models/poldeepner2/nkjp_base_sq.zip" -O models/nkjp_base_sq.zip
    unzip models/nkjp_base_sq.zip -d models 

    N82 models

    KPWr

    Results on the test part of the KPWr n82 corpus.

    Model Precision Recall F1 Time Memory usage GPU memory Embeddings size
    kpwr_n82_large 77.05 78.79 77.91 ~ 3.3 m 3.0 GB 3.8 GB 0.71 GB + 1.40 GB
    kpwr_n82_base 75.02 77.67 76.32 ~ 1.5 m 3.0 GB 2.0 GB 0.25 GB + 0.50 GB
    PolDeepNer (n82-elmo-kgr10) 73.97 75.49 74.72 ~ 4.0 m 4.5 GB - 0.4 GB

    See detailed results.

    N82 Summary (KPWr, CEN)

    Model Eval Precision Recall F-measure Support
    kpwr_n82_base KPWr 75.02 77.67 76.32 4430
    kpwr_n82_large KPWr 77.05 78.79 77.91 4430
    cen_n82_base CEN 84.64 85.95 85.29 1423
    cen_n82_large CEN 86.94 88.40 87.67 1423

    Cross-corpora evaluation

    Model Eval Precision Recall F-measure Support
    kpwr_n82_base CEN 80.90 81.87 81.38 1423
    kpwr_n82_large CEN 80.16 82.08 81.11 1423
    cen_n82_base KPWr 58.58 64.79 61.53 4430
    cen_n82_large KPWr 61.38 66.66 63.91 4430

    PolEval 2018

    Unpack datasets

    wget http://mozart.ipipan.waw.pl/~axw/poleval2018/POLEVAL-NER_GOLD.json -O data/POLEVAL-NER_GOLD.json

    Performance

    Model Score Exact Overlap Score main Test Time Source
    PolDeepNer2 (nkjp_base_sq, spacy-ext) 91.4 89.9 92.7 94.00 2m 13s
    PolDeepNer2 (nkjp_base, pre) 90.0 87.7 90.5 92.40 *6m 44s
    PolDeepNer2 (nkjp_base, spacy-ext) 89.8 87.4 90.4 92.20 8m 10s
    Dadas and Protasiewicz, 2020 88.6 87.0 89.0 - link
    Polish RoBERTa large - - - 89.98 link
    Polish RoBERTa base - - - 87.94 link
    • Does not include tokenization time.

    Evaluation

    Evaluate on a pre-tokenized dataset (nkjp_base, pre)

    time python process_poleval_pretokenized.py \
      --input data/poleval2018ner-data/index.list \
      --output poleval2018-predictions-pretokenized.json \
      --pretrained_path models/roberta_base_fairseq \
      --model models/nkjp_base \
      --max_seq_length 256 \
      --device cuda:0
    python poleval_ner_test.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-pretokenized.json

    Evaluate on raw text using spaCy tokenizer (nkjp_base, spaCy)

    time python process_poleval.py \
      --input data/POLEVAL-NER_GOLD.json \
      --output poleval2018-predictions-spacy.json \
      --pretrained_path models/roberta_base_fairseq \
      --model models/nkjp_base \
      --max_seq_length 256 \
      --tokenization spacy-ext \
      --device cuda:0
    python poleval_ner_test.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-spacy.json

    Evaluate on raw text using spaCy tokenizer (nkjp_base_sq, spaCy)

    time python process_poleval.py \
      --input data/POLEVAL-NER_GOLD.json \
      --output poleval2018-predictions-spacy-sq.json \
      --pretrained_path models/roberta_base_fairseq \
      --model models/nkjp_base_sq \
      --max_seq_length 256 \
      --tokenization spacy-ext \
      --squeeze \
      --device cuda:0

    Score, Exact and Overlap:

    python poleval_ner_test.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-spacy-sq.json

    Score main:

    python poleval_ner_test_v2.py \
      --goldfile data/POLEVAL-NER_GOLD.json \
      --userfile poleval2018-predictions-spacy-sq.json \
      --categories-main

    Usage

    Sample usage

    Command:

    python sample.py

    Expected output:

    --------------------
    Marek Nowak z Politechniki Wrocławskiej mieszka przy ul. Sądeckiej.
    0:11     nam_liv_person       Marek Nowak
    14:39    nam_org_organization Politechniki Wrocławskiej
    57:66    nam_fac_road         Sądeckiej
    --------------------
    #PoselAdamNowak Co Pan myśli na temat fuzji Orlenu i Lotosu?
    6:15     nam_liv_person       AdamNowak
    44:50    nam_org_group_team   Orlenu
    53:59    nam_org_group_team   Lotosu

    Docker

    To build base image

    docker build -f Dockerfiles/base/Dockerfile . --tag poldeepner2

    To build specific models on top of base image

    docker build -f Dockerfiles/nkjp_base_sq/Dockerfile . --tag poldeepner2_nkjp_base_sq

    To run container with chosen model

    docker run --publish 8000:8000 poldeepner2_nkjp_base_sq
    docker build -f Dockerfiles/cen_n82_herbert_large_polem_gpu/Dockerfile . --tag poldeepner2:cen_n82_herbert_large_polem_gpu
    
    docker run -p 8001:8001 --gpus all --network host mczuk/poldeepner2:cen_n82_herbert_large_polem_gpu

    Flask server

    Run

    python server.py \
       --pretrained_path models/roberta_base_fairseq \
       --model models/nkjp_base_sq/ \
       --tokenization spacy-ext \
       --device cuda:0 \
       --squeeze

    Process endpoint

    curl -XPOST localhost:8001/predict -d \
         '{"text": "Poznałem Marka Nowaka z Politechniki Wrocławskiej, który mieszka przy ul. Sądeckiej."}'

    Expected output:

    {
      "entities": [
        {
          "begin": 0, 
          "end": 11, 
          "label": "persName", 
          "text": "Marek Nowak"
        }, 
        {
          "begin": 0, 
          "end": 5, 
          "label": "persName_forename", 
          "text": "Marek"
        }, 
        {
          "begin": 6, 
          "end": 11, 
          "label": "persName_surname", 
          "text": "Nowak"
        }, 
        {
          "begin": 14, 
          "end": 39, 
          "label": "orgName", 
          "text": "Politechniki Wroc\u0142awskiej"
        }, 
        {
          "begin": 27, 
          "end": 39, 
          "label": "placeName_settlement", 
          "text": "Wroc\u0142awskiej"
        }, 
        {
          "begin": 53, 
          "end": 67, 
          "label": "geogName", 
          "text": "ul. S\u0105deckiej."
        }
      ], 
      "text": "Marek Nowak z Politechniki Wroc\u0142awskiej mieszka przy ul. S\u0105deckiej."
    }

    Training

    See docs/training.md

    NER with lemmatization for Polish

    Requirements:

    • KRNNT tagger
      docker run -d -p 9003:9003 -it djstrong/krnnt:1.0.0
    • Polem
      docker run -d -p 8000:8000 mczuk/polem:1.0.0

    Install PolDeepNer2:

    pip install -r requirements.txt

    Run sample code:

    python sample_polem.py

    Expected output:

    --------------------
    Spotkałem Marka Nowaka na Politechnice Wrocławskiej, który pracuje w Intelu.
    2:4      10:22        nam_liv_person            Marka Nowaka               Marek Nowak
    5:7      26:51        nam_org_organization      Politechnice Wrocławskiej  Politechnika Wrocławska
    11:12    69:75        nam_org_company           Intelu                     Intel
    
    --------------------
    Wczoraj mieliśmy kontrolę Naczelnej Izby Skarbowej.
    4:7      26:50        nam_org_institution       Naczelnej Izby Skarbowej   Naczelna Izba Skarbowa
    
    (...)

    Credits