Skip to content
Snippets Groups Projects

Table of Contents

  1. Data format
  2. Choose architecture
  3. Config files
  4. Scripts

Data format

To train a model or evaluate an existing one, the expected input consists of text files:

  • train.txt
  • dev.txt
  • test.txt

These files should be located in the same directory and will be fed into the model during training.

The expected file format is as follows:

  • Each line consists of a pair <token> <tag>, separated by a space.
  • Consecutive lines represent the entities/words to be tagged in the same sentence, in their original order.
  • The beginning of a new sentence is indicated by a metadata line: # sent_id = <n>, where <n> is the unique identifier for the sentence in the file.
  • A newline character signifies the end of each sentence.
  • Currently supported tags follow the IOB format.

Example of train.txt file:

# sent_id = 0
SOCCER O
- O
JAPAN B-LOC
GET O
LUCKY O
WIN O
, O
CHINA B-LOC
IN O
SURPRISE O
DEFEAT O
. O

# sent_id = 1
Nadim B-PER
Ladki I-PER
...

Example training files can be also found in root directory /notebooks/example_data

Choose architecture

When training new model from scratch user is expected to choose architecture of the model. It is however recommended to use following architecture:

  • BertEmbedder = Bert Model + Mean Token Pooling
  • CharacterEmbedder = Combo character embedder
  • Classifier = CRF + BiLSTM

It can be created using following part of config:

"data": {
    "use_char_level_embeddings": true, 
    "use_start_end_token": true, 
    "tokenize_entities": true, 
    ... # other parameters
        }
...  
"model": {
    "bert_embedder": {
        "pretrained_model_name": "allegro/herbert-base-cased", # or any other mode from huggingface
        "pretrained_model_type": "AutoModel", 
        "projection_dimension": None,
        "freeze_bert": True,
        "token_pooling": True,
        "pooling_strategy": "max"
                     },
    "char_embedder": {"type" : "combo",
                      "char_embedding_dim":  64
                     },
    "classifier": {"type" : "crf",
                   "to_tag_space" :  "bilstm"},
    "dropout": 0.1
            },
...

NerBuilder.png

The image above provides an overview of possible architectures for NER. Each architecture comprises four main components that can be customized either through configuration files or directly during instantiation:

NerTokenizer Module for preparing data for training. When creating an instance, specify:

  • the backbone tokenizer corresponding to the BERT model in the BERT embedder
  • the mapping between character and id and between tag and id.
  • whether start and end tokens should be added to sentence which is required if CRF layer is used.
  • whether entities should be tokenised before the back bone BERT model. I.e. whether the entity as input to the BERT model is to be additionally tokenised. If so, the token pooling strategy should be defined for the BERT embbedder.
  • whether character level embeddings will be used
  • language in the case of string handling. This is the language of the LAMBO model used to segment the input string.

Character Embedder A module for creating entity representations based on letters. Currently supported is Character Embedder such as in the COMBO model.

Bert EmbedderA module that creates entity vector representations of a given dimension as the last layer of the BERT model. When additional tokenisation of entities is requested, this module aggregates token vectors to the entity level via max pooling or mean pooling.

Classifier A module that concatenates vector representations from BERT and optionally Character Embedder through the BiLSTM or Transformer layer and transforms them to tag space using the Linear layer or CRF layer.

Config files

For quick prototyping, model creation and training parameter settings can be set using configuration files in JSON format. Template for building config files

{
 "data": {
   "path_data": "./data/pl/",
   "use_char_level_embeddings": true,
   "use_start_end_token": true,
   "tokenize_entities": true,
   "batch_size": 32,
   "encoding": "utf-8",
   "num_workers": 1
 },

 "model": {
   "bert_embedder": {
       "pretrained_model_name": "allegro/herbert-base-cased",
       "pretrained_model_type": "AutoModel",
       "projection_dimension": null,
       "freeze_bert": true,
       "token_pooling": true,
       "pooling_strategy": "max"
                    },
   "char_embedder": {"type" : "combo",
                     "char_embedding_dim":  64
                    },
   "classifier": {"type" : "crf",
                  "to_tag_space" :  "bilstm"},
   "dropout": 0.1
           },

 "loss": "ce",
 "learning_rate": 0.001,

 "callbacks": {"FixedProgressBar": true,
               "LearningRateMonitor": {"logging_interval":"epoch"},
               "ModelCheckpoint": {"monitor": "validation_f1",
                                   "mode": "max",
                                   "save_top_k": 1,
                                   "save_weights_only": true,
                                   "filename" : "best_model"},
               "EarlyStopping": {"monitor": "validation_f1",
                                 "mode": "max",
                                 "min_delta": 0.001,
                                 "patience": 6}},

 "trainer": {"devices": [0],
             "max_epochs": 50,
             "accelerator": "cuda",
             "log_every_n_steps": 10}
}

The configuration file should contain nested dictionaries detailing the parameters for various modules and hyperparameters. Refer to Config Files documentation for more information how to build valid config file.

Most parameters are self-explanatory, but some require additional notes:

  • "model"-"bert_embedder"-"pretrained_model_name" specifies the name of the model from the Hugging Face library.
  • "model"-"bert_embedder"-"pretrained_model_type" is the class used for loading the model. Currently, two types are supported: BertModel or AutoModel.
  • "model"-"bert_embedder"-"projection_dimension" specifies the desired dimension of the output vectors from the Bert Embedder.
  • "model"-"bert_embedder"-"pooling_strategy" can take either 'mean' or 'max' values. It defines the type of pooling applied to the output vectors from the Bert Embedder to obtain word representations. This requires both token_pooling and tokenize_entities to be set to true.
  • "model"-"char_embedder"-"type" defines how to obtain word representations based on characters. Currently, it takes either 'combo' or 'contextualized' values. In both cases, char_embedding_dim should be specified. For the 'contextualized' option, you'll also need to specify how many characters are considered by adjusting the context_window parameter.
  • "model"-"classifier"-"type" can take one of two values: crf or vanilla, which defines the last layer of the model. Additionally, you'll need to specify to_tag_space, which outlines additional layers in the classifier module. It can take values like transformer, bilstm, or linear.
  • "model"-"dropout" specifies the dropout value applied to the output vectors from the Bert Embedder.
  • loss takes either 'ce' for Cross Entropy Loss or 'Focal' for Focal Loss. If the classifier type is CRF, this parameter is omitted.
  • callbacks are callbacks used during training along with their parameters. Supported callbacks are LearningRateMonitor, ModelCheckpoint, EarlyStopping, and FixedProgressBar, which fixes a bug related to the progress bar on some terminals.

Scripts

There is possibility to train and evaluate model directly from terminal. You have access to 3 scripts:

  • find_lr.py - script for finding optimal learning rate
Arguments

--config_path The path to the JSON configuration file that defines various settings for model and training. This argument is required.

Example: --config_path="./config.json"

--data_path: The path to the data directory. If this argument is not provided, the data path is taken from the configuration file.

Example: --data_path="./data/"

--check_config A flag to enable additional configuration validation. If this flag is provided, the script will perform additional checks on the configuration settings.

Example: --check_config

Example usage
python find_lr.py --config_path="./config.json" --check_config
Arguments

--config_path Path to the configuration file for training the model.

Default: ./configs/default_config.json

Example: --config_path="./configs/my_config.json"

--n_reruns Number of times the model training should be rerun.

Default: 1

Example: --n_reruns=5

--data_path Path to the data directory. If not provided, it is taken from the configuration file.

Example: --data_path="./data/"

--serialization_dir Directory where the model should be saved.

Default: ./models/

Example: --serialization_dir="./my_models/"

--check_config A flag to indicate whether to check the configuration constraints.

Example: --check_config

--use_wandb_logger A flag to specify whether to use Weights and Biases (wandb) for logging. Otherwise tensorboard logger will be used.

Example: --use_wandb_logger

--wandb_project_name Name of the Weights and Biases (wandb) project for logging.

Default: NER_ipi_pan3

Example: --wandb_project_name="My_NER_Project"

Example usage
python train.py --config_path="./configs/my_config.json" --n_reruns=3
  • eval.py - script for evaluating model. Model path is considered to represent model as whole. It should contain: best_model.ckpt - model weights, char_to_id.json if model uses character embeddings, config.json which was created during training, and label_to_id.json which is mapping from tags to ids.
Arguments

--config_path Path to the configuration file for training the model.

--model_path: The path to the folder containing the pre-trained NER model and its associated files.

Default: ./models/pl_example

Example: --model_path="./models/my_pretrained_model"

--data_file_path: The path to the dataset file for prediction.

Default: ./data/pl/test.txt

Example: --data_file_path="./data/custom_test.txt"

--device: The computational device for prediction. Use -1 for CPU. Any other integer will correspond to a specific CUDA device.

Default: 0

Example: --device=1

--batch_size: The number of data points processed in each batch during prediction.

Default: 24

Example: --batch_size=32

--encoding: The encoding used to read the dataset file.

Default: utf-8

Example: --encoding="ascii"

Example usage
python eval.py --model_path="./models/custom_model" --data_file_path="./data/custom_test.txt" --device=-1