README.md 1.72 KB
Newer Older
Michal Pogoda's avatar
Michal Pogoda committed
1 2
# Bert Insights
## Usage
3
1. Compute representation on dataset (Can take some time...).
Michal Pogoda's avatar
Michal Pogoda committed
4
```bash
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
compute.sh --data_dir <path_to_dataset> --output_dir <output_path> --model <path to model.pt> --gpu_device <gpu to use> [additional options]
```
2. Use precomputed data to generate informations (relatively fast process)
```bash
compute.sh --data_dir <path_to_dataset> --output_dir <output_path> --model <path to model.pt> --gpu_device <gpu to use> [additional options]
```

## Important Notes
Package depends on non-public `bert_document_classifier` repository. Docker needs to access it to embedd the package into image. To do that it needs to have public ssh key entered on `https://gitlab.clarin-pl.eu/`. By default private key will be taken from `~/.ssh/id_rsa`, however you can pass `--ssh_key path/to/file` if you want to use other key. Also, you need to have a docker version that supports DOCKER_BUILDKIT enviroment, as that's how ssh key is securly passed to docker build enviroment.

## Dataset format
Format is identical to the one used to train `bert_document_classifier` (please refer to https://gitlab.clarin-pl.eu/mipo57/bert_document_classifier). However, only single-label format is accepted. BERT Insights works with multilabel models with no problem, however it needs a single-label format to compute representation. The easiest way to achive that, is to duplicate examples that have more than one label and remove examples without a label.
Eg.:
```json
{
    "train": {
        "1.txt": ["A", "B"],
        "2.txt": [],
    },
    "test": {
        "3.txt": ["A"],
    }
}
```
should be transformed to something like:
```json
{
    "train": {
        "1_1.txt": "A",
        "1_2.txt": "B"
    },
    "test": {
        "3.txt": "A",
    }
}
Michal Pogoda's avatar
Michal Pogoda committed
40
```