@@ -43,9 +43,9 @@ Now you need to create a segmenter by providing the language your text is in, e.
...
@@ -43,9 +43,9 @@ Now you need to create a segmenter by providing the language your text is in, e.
```
```
lambo = Lambo.get('English')
lambo = Lambo.get('English')
```
```
This will (if necessary) download the appropriate model from the online repository and load it. Note that you can use any language name (e.g. `Ancient_Greek`) or ISO 639-1 code (e.g. `fi`) from `[languages.txt](src/lambo/resources/languages.txt)`.
This will (if necessary) download the appropriate model from the online repository and load it. Note that you can use any language name (e.g. `Ancient_Greek`) or ISO 639-1 code (e.g. `fi`) from [`languages.txt`](src/lambo/resources/languages.txt).
Alternatively, you can select a specific model by defining LAMBO variant (`LAMBO` or `LAMBO_no_pretraining`) and training dataset from `[languages.txt](src/lambo/resources/languages.txt)`:
Alternatively, you can select a specific model by defining LAMBO variant (`LAMBO` or `LAMBO_no_pretraining`) and training dataset from [`languages.txt`](src/lambo/resources/languages.txt):
```
```
lambo = Lambo.get('LAMBO-UD_Polish-PDB')
lambo = Lambo.get('LAMBO-UD_Polish-PDB')
```
```
...
@@ -121,9 +121,9 @@ You don't have to rely on the models trained so far in COMBO. You can use the in
...
@@ -121,9 +121,9 @@ You don't have to rely on the models trained so far in COMBO. You can use the in
-`run_evaluation.py` -- evaluate existing models using UD gold standard.
-`run_evaluation.py` -- evaluate existing models using UD gold standard.
Note that you can also extend LAMBO by modifying the data files that specify string that will be treated specially:
Note that you can also extend LAMBO by modifying the data files that specify string that will be treated specially:
-`[emoji.tab](src/lambo/resources/emoji.tab)` includes a list of emojis (they will always be treated as separate tokens),
-[`emoji.tab`](src/lambo/resources/emoji.tab) includes a list of emojis (they will always be treated as separate tokens),
-`[pauses.txt](src/lambo/resources/pauses.txt)` include a list of verbal pauses (they will also be separated, but not split),
-[`pauses.txt`](src/lambo/resources/pauses.txt) include a list of verbal pauses (they will also be separated, but not split),
-`[turn_regexp.txt](src/lambo/resources/turn_regexp.txt)` enumerates regular expressions used to split turns (such as double newline),
-[`turn_regexp.txt`](src/lambo/resources/turn_regexp.txt) enumerates regular expressions used to split turns (such as double newline),