Skip to content
Snippets Groups Projects
Commit 4294c097 authored by Wiktor Walentynowicz's avatar Wiktor Walentynowicz :construction_worker_tone1:
Browse files

Conflicts resolved.

parents 86ce17f0 1c6fbb26
No related branches found
No related tags found
1 merge request!39Version 0.7.0
Pipeline #4748 passed
Showing
with 390 additions and 80093 deletions
poldeepner2/__pycache__/* poldeepner2/__pycache__/*
data/POLEVAL-NER_GOLD.json data/POLEVAL-NER_GOLD.json
dist dist
/.resources/
...@@ -3,7 +3,7 @@ image: "python:3.6" ...@@ -3,7 +3,7 @@ image: "python:3.6"
before_script: before_script:
- python --version - python --version
- pip install -r requirements.txt - pip install -r requirements.txt
- python -m spacy download pl_core_news_sm - pip install -r requirements-dev.txt
stages: stages:
- test - test
......
...@@ -10,7 +10,7 @@ ENV LANG en_US.UTF-8 ...@@ -10,7 +10,7 @@ ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8 ENV LC_ALL en_US.UTF-8
# Python 3.6 # Python 3.8
#RUN apt-get install -y software-properties-common vim #RUN apt-get install -y software-properties-common vim
RUN apt-get install -y python3.8 python3-pip RUN apt-get install -y python3.8 python3-pip
RUN python3.8 --version RUN python3.8 --version
......
...@@ -8,4 +8,4 @@ RUN rm kpwr_n82_base.zip ...@@ -8,4 +8,4 @@ RUN rm kpwr_n82_base.zip
EXPOSE 8000 EXPOSE 8000
CMD python3.6 server.py --model models/kpwr_n82_base/kpwr_n82_base --pretrained_path xlmr:models/roberta_base_fairseq CMD python3.8 server.py --model models/kpwr_n82_base/kpwr_n82_base --pretrained_path xlmr:models/roberta_base_fairseq
...@@ -8,4 +8,4 @@ RUN rm roberta_large_fairseq.zip ...@@ -8,4 +8,4 @@ RUN rm roberta_large_fairseq.zip
EXPOSE 8000 EXPOSE 8000
CMD python3.6 server.py --model models/kpwr_n82_large/kpwr_n82_large --pretrained_path xlmr:models/roberta_base_fairseq CMD python3.8 server.py --model models/kpwr_n82_large/kpwr_n82_large --pretrained_path xlmr:models/roberta_base_fairseq
...@@ -10,9 +10,9 @@ ENV LANG en_US.UTF-8 ...@@ -10,9 +10,9 @@ ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8 ENV LC_ALL en_US.UTF-8
# Python 3.6 # Python 3.8
RUN apt-get install -y software-properties-common vim RUN apt-get install -y software-properties-common vim
RUN apt-get install -y python3.6 python3-pip RUN apt-get install -y python3.8 python3-pip
# update pip # update pip
RUN pip3 install pip --upgrade RUN pip3 install pip --upgrade
...@@ -22,7 +22,7 @@ RUN pip3 install wheel ...@@ -22,7 +22,7 @@ RUN pip3 install wheel
WORKDIR "/poldeepner2" WORKDIR "/poldeepner2"
ADD ./requirements.txt /poldeepner2/requirements.txt ADD ./requirements.txt /poldeepner2/requirements.txt
RUN pip3 install -r requirements.txt RUN pip3 install -r requirements.txt
RUN python3.6 -m spacy download pl_core_news_sm RUN python3.8 -m spacy download pl_core_news_sm
RUN apt-get install -y wget RUN apt-get install -y wget
RUN apt-get install -y unzip RUN apt-get install -y unzip
...@@ -43,4 +43,4 @@ COPY . . ...@@ -43,4 +43,4 @@ COPY . .
EXPOSE 8000 EXPOSE 8000
CMD python3.6 server.py --model models/kpwr_n82_base/kpwr_n82_base --pretrained_path xlmr:models/roberta_base_fairseq CMD python3.8 server.py --model models/kpwr_n82_base/kpwr_n82_base --pretrained_path xlmr:models/roberta_base_fairseq
...@@ -10,9 +10,9 @@ ENV LANG en_US.UTF-8 ...@@ -10,9 +10,9 @@ ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8 ENV LC_ALL en_US.UTF-8
# Python 3.6 # Python 3.8
RUN apt-get install -y software-properties-common vim RUN apt-get install -y software-properties-common vim
RUN apt-get install -y python3.6 python3-pip RUN apt-get install -y python3.8 python3-pip
# update pip # update pip
RUN pip3 install pip --upgrade RUN pip3 install pip --upgrade
...@@ -22,7 +22,7 @@ RUN pip3 install wheel ...@@ -22,7 +22,7 @@ RUN pip3 install wheel
WORKDIR "/poldeepner2" WORKDIR "/poldeepner2"
ADD ./requirements.txt /poldeepner2/requirements.txt ADD ./requirements.txt /poldeepner2/requirements.txt
RUN pip3 install -r requirements.txt RUN pip3 install -r requirements.txt
RUN python3.6 -m spacy download pl_core_news_sm RUN python3.8 -m spacy download pl_core_news_sm
RUN apt-get install -y wget RUN apt-get install -y wget
RUN apt-get install -y unzip RUN apt-get install -y unzip
...@@ -43,4 +43,4 @@ COPY . . ...@@ -43,4 +43,4 @@ COPY . .
EXPOSE 8000 EXPOSE 8000
CMD python3.6 server.py --model models/kpwr_n82_large/kpwr_n82_large --pretrained_path xlmr:models/roberta_base_fairseq CMD python3.8 server.py --model models/kpwr_n82_large/kpwr_n82_large --pretrained_path xlmr:models/roberta_base_fairseq
...@@ -8,4 +8,4 @@ RUN rm nkjp_base.zip ...@@ -8,4 +8,4 @@ RUN rm nkjp_base.zip
EXPOSE 8000 EXPOSE 8000
CMD python3.6 server.py --model models/nkjp_base/nkjp_base --pretrained_path xlmr:models/roberta_base_fairseq CMD python3.8 server.py --model models/nkjp_base/nkjp_base --pretrained_path xlmr:models/roberta_base_fairseq
...@@ -18,9 +18,9 @@ It offers a set of pretrained models for Polish. The main features are: ...@@ -18,9 +18,9 @@ It offers a set of pretrained models for Polish. The main features are:
### Requirements ### Requirements
* Python 3.6 * Python 3.8
* CUDA 10.0+ * CUDA 10.0+
* PyTorch 1.7 * PyTorch 1.9
### Virtual environment ### Virtual environment
...@@ -29,7 +29,7 @@ It offers a set of pretrained models for Polish. The main features are: ...@@ -29,7 +29,7 @@ It offers a set of pretrained models for Polish. The main features are:
``` ```
sudo apt-get install python3-pip python3-dev python-virtualenv sudo apt-get install python3-pip python3-dev python-virtualenv
sudo pip install -U pip sudo pip install -U pip
virtualenv -p python3.6 venv virtualenv -p python3.8 venv
source venv/bin/activate source venv/bin/activate
pip install -U pip pip install -U pip
pip install -r requirements.txt pip install -r requirements.txt
...@@ -38,9 +38,9 @@ pip install -r requirements.txt ...@@ -38,9 +38,9 @@ pip install -r requirements.txt
#### Conda #### Conda
``` ```
conda create -n pdn2 python=3.6 conda create -n pdn2 python=3.8
conda activate pdn2 conda activate pdn2
conda install -c anaconda cudatoolkit=10.1 conda install -c anaconda cudatoolkit=10.2
conda install -c anaconda cudnn conda install -c anaconda cudnn
pip install -r requirements.txt pip install -r requirements.txt
``` ```
......
"""A message of shame -- documentation must be completed."""
from __future__ import absolute_import, division, print_function from __future__ import absolute_import, division, print_function
import argparse import argparse
...@@ -9,12 +11,26 @@ from poldeepner2.utils.data_utils import read_tsv ...@@ -9,12 +11,26 @@ from poldeepner2.utils.data_utils import read_tsv
def write_sentence(fout: str, tokens: List[str], labels: List[str]): def write_sentence(fout: str, tokens: List[str], labels: List[str]):
"""A message of shame -- documentation must be completed.
Args:
fout: str
tokens: List[str]
labels: List[str]
"""
for token, label in zip(tokens, labels): for token, label in zip(tokens, labels):
fout.write("%s\t%s\n" % (token, label)) fout.write("%s\t%s\n" % (token, label))
fout.write("\n") fout.write("\n")
def main(args): def main(args):
"""A message of shame -- documentation must be completed.
Args:
args:A message of shame -- documentation must be completed.
"""
sentences_labels = read_tsv(args.input, True) sentences_labels = read_tsv(args.input, True)
with codecs.open(args.output, "w", "utf8") as fout: with codecs.open(args.output, "w", "utf8") as fout:
for sentence, labels in sentences_labels: for sentence, labels in sentences_labels:
...@@ -23,22 +39,33 @@ def main(args): ...@@ -23,22 +39,33 @@ def main(args):
if args.upper: if args.upper:
logging.info("Augment data — upper case") logging.info("Augment data — upper case")
for sentence, labels in sentences_labels: for sentence, labels in sentences_labels:
write_sentence(fout, [token.upper() for token in sentence], labels) write_sentence(fout, [token.upper() for token in sentence],
labels)
if args.lower: if args.lower:
logging.info("Augment data — lower case") logging.info("Augment data — lower case")
for sentence, labels in sentences_labels: for sentence, labels in sentences_labels:
write_sentence(fout, [token.lower() for token in sentence], labels) write_sentence(fout, [token.lower() for token in sentence],
labels)
def parse_args(): def parse_args():
"""A message of shame -- documentation must be completed.
Returns: parser.parse_args()
"""
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description='Process a single TSV with a NER model') description='Process a single TSV with a NER model')
parser.add_argument('--input', required=True, metavar='PATH', help='path to a TSV file') parser.add_argument('--input', required=True, metavar='PATH',
parser.add_argument('--output', required=True, metavar='PATH', help='path to save the augmented dataset') help='path to a TSV file')
parser.add_argument('--lower', required=False, default=False, action="store_true", parser.add_argument('--output', required=True, metavar='PATH',
help='path to save the augmented dataset')
parser.add_argument('--lower', required=False, default=False,
ction="store_true",
help='augment lower-case data') help='augment lower-case data')
parser.add_argument('--upper', required=False, default=False, action="store_true", parser.add_argument('--upper', required=False, default=False,
action="store_true",
help='augment upper-case data') help='augment upper-case data')
return parser.parse_args() return parser.parse_args()
......
[model]
device = cpu
gpu_num = 0
path = /mnt/sda/pdn2scripts/nkjp_base
pretrained_path = /mnt/sda/pdn2scripts/roberta_base
[predict]
device = cpu
save_to_file = true
path = /mnt/sda/pdn2scripts/roberta_base
max_seq_len = 100
path_to_save = predict_res.txt
[evaluate]
device = cpu
gpu_num = 0
path = E:/ClarinProjects/nkjp_base
pretrained_path = ./roberta_base
squeeze = false
max_seq_len = 100
hidden_size = 32
dropout = 0.05
[data]
tag_column_index = 3
eval_path = data/coNLL-2003/test.txt
pred_path = tests/resources/text_krakow.txt
[train]
adam_epsilon = 0.1
data_test = data/coNLL-2003/test.txt
data_train = data/coNLL-2003/train.txt
data_tune = data/coNLL-2003/valid.txt
device = cuda
dropout = 0.05
epoch_save_model = True
eval_batch_size = 16
fp16 = false
fp16_opt_level = ''
freeze_model = True
gradient_accumulation_steps = 5
hidden_size = 32
learning_rate = 0.001
max_grad_norm = 5
max_seq_length = 32
num_train_epochs = 100
output_dir = test_res
pretrained_path = /mnt/sda/pdn2scripts/roberta_base
seed = 42
squeeze = true
train_batch_size = 16
training_mix = False
transfer = None
warmup_proportion = 0.3
weight_decay = 0.1
"""A message of shame -- documentation must be completed."""
import codecs import codecs
import os import os
import torch import torch
# import tqdm NOT USED
from torch.utils.data.dataloader import DataLoader from torch.utils.data.dataloader import DataLoader
from poldeepner2.model.xlmr_for_token_classification import XLMRForTokenClassification from core.model.xlmr_for_token_classification import XLMRForTokenClassification
from poldeepner2.pipeline.tokenization import TokenizerSpaces from core.utils.data_utils import InputExample, convert_examples_to_features, \
from poldeepner2.utils.data_utils import read_params, InputExample, create_dataset, wrap_annotations, \ create_dataset, read_params, wrap_annotations, align_tokens_with_text
align_tokens_with_text from core.utils.tokenization import TokenizerSpaces
from poldeepner2.utils.sequences import convert_examples_to_features
class PolDeepNer2: class PolDeepNer2:
"""A message of shame -- documentation must be completed."""
def __init__(self, model_path, pretrained_path, def __init__(self, model_path, pretrained_path,
device="cpu", squeeze=False, max_seq_length=256, tokenizer=TokenizerSpaces()): device="cpu", squeeze=False, max_seq_length=256,
tokenizer=TokenizerSpaces()):
"""A message of shame -- documentation must be completed.
Args:
model_path:A message of shame -- documentation must be completed.
pretrained_path:A message of shame -- documentation must be
completed.
device:A message of shame -- documentation must be completed.
squeeze:A message of shame -- documentation must be completed.
max_seq_length:A message of shame -- documentation must be
completed.
tokenizer:A message of shame -- documentation must be completed.
"""
if not os.path.exists(model_path): if not os.path.exists(model_path):
raise ValueError("Model not found on path '%s'" % model_path) raise ValueError("Model not found on path '%s'" % model_path)
if not os.path.exists(pretrained_path): if not os.path.exists(pretrained_path):
raise ValueError("RoBERTa language model not found on path '%s'" % pretrained_path) raise ValueError("RoBERTa language model not found on path '%s'"
% pretrained_path)
dropout, num_labels, label_list = read_params(model_path) dropout, num_labels, label_list = read_params(model_path)
self.label_list = label_list self.label_list = label_list
...@@ -26,8 +45,11 @@ class PolDeepNer2: ...@@ -26,8 +45,11 @@ class PolDeepNer2:
n_labels=len(self.label_list) + 1, n_labels=len(self.label_list) + 1,
dropout_p=dropout, dropout_p=dropout,
device=device, device=device,
hidden_size=768 if 'base' in pretrained_path else 1024) hidden_size=768
state_dict = torch.load(open(os.path.join(model_path, 'model.pt'), 'rb')) if 'base' in pretrained_path
else 1024)
state_dict = torch.load(
open(os.path.join(model_path, 'model.pt'), 'rb'))
model.load_state_dict(state_dict) model.load_state_dict(state_dict)
model.eval() model.eval()
model.to(device) model.to(device)
...@@ -39,23 +61,40 @@ class PolDeepNer2: ...@@ -39,23 +61,40 @@ class PolDeepNer2:
@staticmethod @staticmethod
def load_labels(path): def load_labels(path):
return [line.strip() for line in codecs.open(path, "r", "utf8").readlines() if len(line.strip()) > 0] """A message of shame -- documentation must be completed.
Args:
path:A message of shame -- documentation must be completed.
Returns:A message of shame -- documentation must be completed.
def process(self, sentences):
""" """
@param sentences -- array of array of words, [['Jan', 'z', 'Warszawy'], ['IBM', 'i', 'Apple']] return [line.strip() for line in codecs.open(
@param max_seq_length -- the maximum total input sequence length after WordPiece tokenization path, "r", "utf8").readlines() if len(line.strip()) > 0]
@param squeeze -- boolean enabling squeezing multiple sentences into one Input Feature
def process(self, sentences):
"""A message of shame -- documentation must be completed.
@param sentences -- array of array of words,
[['Jan', 'z', 'Warszawy'], ['IBM', 'i', 'Apple']]
@param max_seq_length -- the maximum total input sequence length after
WordPiece tokenization
@param squeeze -- boolean enabling squeezing multiple sentences into
one Input Feature
""" """
examples = [] examples = []
for idx, tokens in enumerate(sentences): for idx, tokens in enumerate(sentences):
guid = str(idx) guid = str(idx)
text_a = ' '.join(tokens) text_a = ' '.join(tokens)
label = ["O"] * len(tokens) label = ["O"] * len(tokens)
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) examples.append(InputExample(guid=guid, text_a=text_a,
text_b=None, label=label))
eval_features = convert_examples_to_features(examples, self.label_list, self.max_seq_length,
self.model.encode_word, self.squeeze) eval_features = convert_examples_to_features(examples,
self.label_list,
self.max_seq_length,
self.model.encode_word,
self.squeeze)
eval_dataset = create_dataset(eval_features) eval_dataset = create_dataset(eval_features)
eval_dataloader = DataLoader(eval_dataset, batch_size=1) eval_dataloader = DataLoader(eval_dataset, batch_size=1)
...@@ -69,7 +108,8 @@ class PolDeepNer2: ...@@ -69,7 +108,8 @@ class PolDeepNer2:
valid_ids = valid_ids.to(self.device) valid_ids = valid_ids.to(self.device)
with torch.no_grad(): with torch.no_grad():
logits = self.model(input_ids, labels=None, labels_mask=None, valid_mask=valid_ids) logits = self.model(input_ids, labels=None,
labels_mask=None, valid_mask=valid_ids)
logits = torch.argmax(logits, dim=2) logits = torch.argmax(logits, dim=2)
logits = logits.detach().cpu().numpy() logits = logits.detach().cpu().numpy()
...@@ -93,11 +133,13 @@ class PolDeepNer2: ...@@ -93,11 +133,13 @@ class PolDeepNer2:
return y_pred return y_pred
def process_text(self, text: str): def process_text(self, text: str):
""" """A message of shame -- documentation must be completed.
@texts: Array of sentences. Each sentence is a string. @texts: Array of sentences. Each sentence is a string.
"John lives in New York. Mary lives in Chicago" "John lives in New York. Mary lives in Chicago"
return:[(PER, 0, 4, "John"), (LOC, 14, 22, "New York"), (PER, 24, 28, "Mary"), (LOC, 38, 45, "Chicago")]] return:[(PER, 0, 4, "John"), (LOC, 14, 22, "New York"),
(PER, 24, 28, "Mary"), (LOC, 38, 45, "Chicago")]]
""" """
sentences = self.tokenizer.tokenize([text]) sentences = self.tokenizer.tokenize([text])
predictions = self.process(sentences) predictions = self.process(sentences)
...@@ -105,11 +147,14 @@ class PolDeepNer2: ...@@ -105,11 +147,14 @@ class PolDeepNer2:
return align_tokens_with_text(text, sentences, annotations) return align_tokens_with_text(text, sentences, annotations)
def process_tokenized(self, tokens: [[str]], text: str): def process_tokenized(self, tokens: [[str]], text: str):
""" """A message of shame -- documentation must be completed.
@tokens: Array of sentences. Each sentence is an array of words. @tokens: Array of sentences. Each sentence is an array of words.
[["John", "lives", "in", "New", "York"], ["Mary", "lives", "in", "Chicago"]] [["John", "lives", "in", "New", "York"],
["Mary", "lives", "in", "Chicago"]]
return: [["B-PER", "O", "O", "B-LOC", "I-LOC"], ["B-PER", "O", "O", "B-LOC"]] return: [["B-PER", "O", "O", "B-LOC", "I-LOC"],
["B-PER", "O", "O", "B-LOC"]]
""" """
predictions = self.process(tokens) predictions = self.process(tokens)
annotations = wrap_annotations(predictions) annotations = wrap_annotations(predictions)
......
This diff is collapsed.
"""A message of shame -- documentation must be completed."""
from __future__ import absolute_import, division, print_function from __future__ import absolute_import, division, print_function
import argparse import argparse
import os import os
from time import time
import time import time
# from time import time F811 redefinition of unused 'time'
from poldeepner2.models import PolDeepNer2 import poldeepner2
from poldeepner2.utils.data_utils import read_tsv from poldeepner2.utils.data_utils import read_tsv
from poldeepner2.utils.seed import setup_seed
from poldeepner2.utils.sequence_labeling import classification_report from poldeepner2.utils.sequence_labeling import classification_report
def main(args): def main(args):
"""A message of shame -- documentation must be completed.
Args:
args:A message of shame -- documentation must be completed.
"""
print("Loading the NER model ...") print("Loading the NER model ...")
ner = PolDeepNer2.load( ner = poldeepner2.load(args.model, device=args.device)
model=args.model,
pretrained_path=args.pretrained_path, for param in ["device", "max_seq_length", "squeeze"]:
device=args.device, value = args.__dict__.get(param, None)
max_seq_length=args.max_seq_length, if value is not None:
squeeze=args.squeeze, value_default = ner.model.config.__dict__.get(param)
seed=args.seed if str(value) != str(value_default):
) print(f"Forced change of the parameter: {param} '{value_default}' => '{value}'")
ner.model.config.__dict__[param] = value
if args.seed is not None:
setup_seed(args.seed)
print("Processing ...") print("Processing ...")
sentences_labels = read_tsv(os.path.join(args.input), True) sentences_labels = read_tsv(os.path.join(args.input), True)
...@@ -42,22 +55,27 @@ def main(args): ...@@ -42,22 +55,27 @@ def main(args):
print(f"Total time : {time_processing:>8.4} second(s)") print(f"Total time : {time_processing:>8.4} second(s)")
print(f"Data size: : {data_size/1000000:>8.4} M characters") print(f"Data size: : {data_size/1000000:>8.4} M characters")
print(f"Speed: : {data_size / 1000000 / (time_processing/60):>8.4} M characters/minute") print(f"Speed: : {data_size / 1000000 / (time_processing/60):>8.4} M characters/minute")
print(f"Number of token labels : {len(ner.label_list):>8} ") print(f"Number of token labels : {len(ner.model.config.labels):>8} ")
def parse_args(): def parse_args():
"""A message of shame -- documentation must be completed.
Returns: parser.parse_args()
"""
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description='Process a single TSV with a NER model') description='Process a single TSV with a NER model')
parser.add_argument('--input', required=True, metavar='PATH', help='path to a file with a list of files') parser.add_argument('--input', required=True, metavar='PATH', help='path to a file with a list of files')
parser.add_argument('--model', required=True, metavar='PATH', help='path to NER model') parser.add_argument('--model', required=True, metavar='PATH', help='path or name of the model')
parser.add_argument('--pretrained_path', required=False, metavar='PATH', help='pretrained XLM-Roberta model path') parser.add_argument('--max_seq_length', required=False, default=None, metavar='N', type=int,
parser.add_argument('--max_seq_length', required=False, default=512, metavar='N', type=int, help='override default values of the max_seq_length')
help='the maximum total input sequence length after WordPiece tokenization.') parser.add_argument('--device', default=None, metavar='cpu|cuda',
parser.add_argument('--device', required=False, default="cpu", metavar='cpu|cuda', help='override default value of the device')
help='device type used for processing') group = parser.add_mutually_exclusive_group(required=False)
parser.add_argument('--squeeze', required=False, default=False, action="store_true", group.add_argument("--squeeze", dest="squeeze", default=None, action='store_true')
help='try to squeeze multiple examples into one Input Feature') group.add_argument("--no-squeeze", dest="squeeze", default=None, action='store_false')
parser.add_argument('--seed', required=False, default=377, metavar='N', type=int, parser.add_argument('--seed', required=False, default=None, metavar='N', type=int,
help='a seed used to initialize a number generator') help='a seed used to initialize a number generator')
return parser.parse_args() return parser.parse_args()
......
"""Script for evaluating models on a pre-defined set of data."""
import configparser
import os
import time
from poldeepner2.utils.data_utils import NerProcessor, create_dataset, \
convert_examples_to_features
from poldeepner2.utils.train_utils import evaluate_model
def main():
config_file = "config.cfg"
config = configparser.ConfigParser()
config.read(config_file)
pretrained_model = config['evaluate']['pretrained_path']
device = config['evaluate']['device']
squeeze = config.getboolean('evaluate', 'squeeze')
tag_column_index = config.getint('data', 'tag_column_index')
processor = NerProcessor()
data_path = config['data']['eval_path']
datasets = [data_path]
labels_list = \
processor.get_labels(datasets, config.getint('data',
'tag_column_index'))
num_labels = len(labels_list) + 1
hidden_size = config.getint('evaluate', 'hidden_size')
dropout = config.getfloat('train', 'dropout')
hidden_size = 1024 if 'large' in pretrained_model \
else (768 if 'base' in pretrained_model else hidden_size)
device = device
pretrained_path = config['model']['pretrained_path']
if pretrained_path.startswith("hf:"):
from poldeepner2.model.hf_for_token_calssification \
import HfModelForTokenClassification
pretrained_dir = pretrained_path.split(':')[1]
model = HfModelForTokenClassification(
pretrained_path=pretrained_dir, n_labels=num_labels,
hidden_size=hidden_size, dropout_p=dropout,
device=device)
elif pretrained_path.startswith("mt5:"):
from poldeepner2.model.mt5_for_token_calssification \
import Mt5ModelForTokenClassification
variant = pretrained_path.split(':')[1]
model = Mt5ModelForTokenClassification(
variant=variant, n_labels=num_labels,
hidden_size=hidden_size, dropout_p=dropout, device=device)
else:
from poldeepner2.model.xlmr_for_token_classification \
import XLMRForTokenClassification
pretrained_dir = pretrained_path
if ":" in pretrained_dir:
pretrained_dir = pretrained_dir.split(':')[1]
if not os.path.exists(pretrained_dir):
raise ValueError("RoBERTa language model not found on path '%s'"
% pretrained_dir)
model = XLMRForTokenClassification(
pretrained_path=pretrained_dir, n_labels=num_labels,
hidden_size=hidden_size, dropout_p=dropout,
device=device)
max_seq_len = config.getint('evaluate', 'max_seq_len')
eval_examples = processor.get_examples(datasets[0], tag_column_index,
'eval')
eval_features = convert_examples_to_features(
eval_examples, labels_list, max_seq_len, model.encode_word,
squeeze=squeeze)
eval_data = create_dataset(eval_features)
time_start = time.time()
f1, report = evaluate_model(model, eval_data, labels_list, 16, device)
time_end = time.time()
print(f' f1: {f1}')
print(f' report {report}')
print(f'time {time_end - time_start}')
if __name__ == "__main__":
main()
import os
from pathlib import Path
from poldeepner2.models import PolDeepNer2
from poldeepner2.utils.file_utils import download_file
resources = {
"pdn2-v07-kpwr-n82-base-01": {
"url": "https://s3.clarin-pl.eu/users/czuk/_public/pdn2/v07/pdn2-v07-kpwr-n82-base-01.zip",
"compression": "zip",
"extractToSubfolder": False
},
"pdn2-v07-cen-n82-base-01": {
"url": "https://s3.clarin-pl.eu/users/czuk/_public/pdn2/v07/pdn2-v07-cen-n82-base-01.zip",
"compression": "zip",
"extractToSubfolder": False
},
}
def load(path_or_name: str, device: str = None, resources_path: str = ".resources") -> PolDeepNer2:
if Path(path_or_name).exists():
path = path_or_name
else:
path = os.path.join(resources_path, path_or_name)
if not os.path.exists(path):
if path_or_name in resources:
extract_to_subfolder = resources[path_or_name].get("extractToSubfolder", False)
download_file(resources[path_or_name]["url"], path, resources[path_or_name]["compression"],
extract_to_subfolder)
else:
raise ValueError(f"Unknown resource name or invalid path: {path_or_name}")
return PolDeepNer2(path, device=device)
"""A message of shame -- documentation must be completed."""
from poldeepner2.data.span import Span from poldeepner2.data.span import Span
from poldeepner2.data.token import Token from poldeepner2.data.token import Token
from poldeepner2.utils.annotation import Annotation from poldeepner2.utils.annotation import Annotation
class Document: class Document:
"""A message of shame -- documentation must be completed."""
def __init__(self, content: str,
tokens: [Token] = [], sentences: [Span] = [],
annotations: [Annotation] = []):
"""A message of shame -- documentation must be completed.
Args:
content:A message of shame -- documentation must be completed.
tokens:A message of shame -- documentation must be completed.
sentences:A message of shame -- documentation must be completed.
annotations:A message of shame -- documentation must be completed.
def __init__(self, content: str, tokens: [Token] = [], sentences: [Span] = [], annotations: [Annotation] = []): """
self.content = content self.content = content
self.tokens = tokens self.tokens = tokens
self.annotations = annotations self.annotations = annotations
......
"""A message of shame -- documentation must be completed."""
from dataclasses import dataclass from dataclasses import dataclass
@dataclass @dataclass
class Span: class Span:
""" """A message of shame -- documentation must be completed.
Args: Args:
orth (str): orth (str):A message of shame -- documentation must be completed.
start (int): Index of the first token. start (int): Index of the first token.
end (int): Index of the last token +1. end (int): Index of the last token +1.
""" """
start: int start: int
end: int end: int
def __str__(self): def __str__(self):
"""A message of shame -- documentation must be completed.
Returns:A message of shame -- documentation must be completed.
"""
return f"Span(begin={self.begin},end={self.end})" return f"Span(begin={self.begin},end={self.end})"
"""A message of shame -- documentation must be completed."""
from dataclasses import dataclass from dataclasses import dataclass
@dataclass @dataclass
class Token: class Token:
""" """A message of shame -- documentation must be completed.
Args: Args:
orth (str): orth (str):
start (int): Index of the first orth character in the original text. start (int): Index of the first orth character in the original text.
...@@ -12,7 +15,9 @@ class Token: ...@@ -12,7 +15,9 @@ class Token:
ws (str): White spaces after the token in the original text. ws (str): White spaces after the token in the original text.
morph (str): morph (str):
eos (str): True if the token ends a sentence. eos (str): True if the token ends a sentence.
""" """
orth: str orth: str
start: int start: int
end: int end: int
...@@ -22,4 +27,9 @@ class Token: ...@@ -22,4 +27,9 @@ class Token:
eos: bool = False eos: bool = False
def __str__(self): def __str__(self):
"""A message of shame -- documentation must be completed.
Returns:A message of shame -- documentation must be completed.
"""
return f"Token(orth={self.orth},lemma={self.lemma},morph={self.morph})" return f"Token(orth={self.orth},lemma={self.lemma},morph={self.morph})"
"""A message of shame -- documentation must be completed."""
import logging import logging
def debug_tokens_and_labels(tokenized_sentences, predictions): def debug_tokens_and_labels(tokenized_sentences, predictions):
"""A message of shame -- documentation must be completed.
Args:
tokenized_sentences:A message of shame -- documentation must be
completed.
predictions:A message of shame -- documentation must be completed.
"""
for tokens, labels in zip(tokenized_sentences, predictions): for tokens, labels in zip(tokenized_sentences, predictions):
for token, label in zip(tokens, labels): for token, label in zip(tokens, labels):
logging.debug(f"TOKENIZATION: {token}\t{label}") logging.debug(f"TOKENIZATION: {token}\t{label}")
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment