BERT Inspect
BERT-based interpretable features extractor. This package combines Bert fine-tuning and integrated-based analisis of bert model into single nlp worker
Outputs
├── clouds
│ ├── class1
│ │ ├── keyword_1.png
│ │ ├── keyword_1.json
│ │ ├── keyword_2.png
│ │ ├── keyword_2.json
│ └── class2
│ │ ├── keyword_1.png
│ │ ├── keyword_1.json
│ │ ├── keyword_2.png
│ │ ├── keyword_2.json
├── positive.json
└── negative.json
-
positive.json
: Json file with top average positive attibutions of tokens for each class -
negative.json
: Json file with top average positive attibutions of tokens for each class -
clouds/<classname>/*.png
: Rendered word-cloud of words defining context for each top token -
clouds/<classname>/*.json
: Json-formated word-cloud of words defining context for each top token
Mount directories
/workdir/cache
: Place where base_model
will be stored
Config
Fine tuning stage
-
base_model
- Pretrained hugginface BERT model (default dkleczek/bert-base-polish-cased-v1) -
pooling
- Defines what type of pooling is applied to tokens embedded by bert before classification head (one of mean, cls, max) -
layers_frozen
- Defines how many BERT layers are frozen during fine-tuning (Max 12, meaning that only classification head will be trained) -
max_epochs
- Maximum number of training epochs -
early_stopping
- If true, training will be stopped if f1 will not raise for 3 epochs -
truncation
- What type of truncation will be applied to documents longer than 510 tokens (end / front) -
valid_frac
- What fraction of data will be used for validation -
learning_rate
- Learning rate -
classificator_size
- Number of hidden neurons in classification head -
dropout
- Dropout rate -
batch_size
- Batch size (~10 is maxium on 2080 Ti with no frozen layer) -
num_workers
- Number of CPU workers feeding the data from disk -
weighted_sampling
- If true, data will be sampled with returns in a way to equalize getting sample of any class
Analysis stage
-
num_steps
- Number of Integrated Gradient steps -
attention_layer_id
- Layer from which attentions will be extracted (0 is most intepretable, as no word diffusion is made at that stage) -
internal_batch
- Batch for Integrated Gradient steps -
device_name
- Name of a device on which Embedding & Itegrated Gradients will be performed (default cuda:0) -
subword_tokens
- If set to true, subword tokens will be used (eg. Kowal ##ski).