Skip to content
Snippets Groups Projects
Commit 41374773 authored by Łukasz Pszenny's avatar Łukasz Pszenny Committed by Łukasz Pszenny
Browse files

instruction adjustment after switching to UD_29 and download fix from new source

Update README.md

small fix in performance.md

small fix in performance.md - got rid of models that were trained on data with sentences of form "____"
parent 4dc8c92f
1 merge request!44Switching to UD 2.9
...@@ -24,7 +24,7 @@ Run the following commands in your Python console to make predictions with a pre ...@@ -24,7 +24,7 @@ Run the following commands in your Python console to make predictions with a pre
```python ```python
from combo.predict import COMBO from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base") nlp = COMBO.from_pretrained("polish-herbert-base-ud29")
sentence = nlp("COVID-19 to ostra choroba zakaźna układu oddechowego wywołana zakażeniem wirusem SARS-CoV-2.") sentence = nlp("COVID-19 to ostra choroba zakaźna układu oddechowego wywołana zakażeniem wirusem SARS-CoV-2.")
``` ```
Predictions are accessible as a list of token attributes: Predictions are accessible as a list of token attributes:
......
...@@ -9,14 +9,21 @@ from requests import adapters, exceptions ...@@ -9,14 +9,21 @@ from requests import adapters, exceptions
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
_URL = "http://mozart.ipipan.waw.pl/~mklimaszewski/models/{name}.tar.gz" DATA_TO_PATH = {
"enhanced" : "iwpt_2020",
"iwpt2021" : "iwpt_2021",
"ud25" : "ud_25",
"ud27" : "ud_27",
"ud29" : "ud_29"}
_URL = "http://s3.clarin-pl.eu/models/combo/{data}/{model}.tar.gz"
_HOME_DIR = os.getenv("HOME", os.curdir) _HOME_DIR = os.getenv("HOME", os.curdir)
_CACHE_DIR = os.getenv("COMBO_DIR", os.path.join(_HOME_DIR, ".combo")) _CACHE_DIR = os.getenv("COMBO_DIR", os.path.join(_HOME_DIR, ".combo"))
def download_file(model_name, force=False): def download_file(model_name, force=False):
_make_cache_dir() _make_cache_dir()
url = _URL.format(name=model_name) data = model_name.split("-")[-1]
url = _URL.format(name=model_name, data=DATA_TO_PATH[data])
local_filename = url.split("/")[-1] local_filename = url.split("/")[-1]
location = os.path.join(_CACHE_DIR, local_filename) location = os.path.join(_CACHE_DIR, local_filename)
if os.path.exists(location) and not force: if os.path.exists(location) and not force:
......
...@@ -2,35 +2,35 @@ ...@@ -2,35 +2,35 @@
COMBO provides pre-trained models for: COMBO provides pre-trained models for:
- morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org) ([Zeman et al. 2020](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)), - morphosyntactic prediction (i.e. part-of-speech tagging, morphosyntactic analysis, lemmatisation and dependency parsing) trained on the treebanks from [Universal Dependencies repository](https://universaldependencies.org) ([Zeman et al. 2020](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424)),
- enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html) ([Bouma et al. 2020](https://www.aclweb.org/anthology/2020.iwpt-1.16.pdf)). - enhanced dependency parsing trained on IWPT 2020 shared task [data](https://universaldependencies.org/iwpt20/data.html) ([Bouma et al. 2020](https://www.aclweb.org/anthology/2020.iwpt-1.16.pdf)) and IWPT 2021 shared task [data](https://universaldependencies.org/iwpt21/data.html).
## Pre-trained models ## Pre-trained models
**Morphosyntactic prediction models** trained on the selected UD treebanks version 2.7 and their **evaluation results** are listed in [Model performance (UD2.7)](https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/blob/master/docs/performance.md) table. **Morphosyntactic prediction models** trained on the selected UD treebanks version 2.9 and their **evaluation results** are listed in [Model performance (UD2.9)](https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/blob/master/docs/performance.md) table.
**Morphosyntactic prediction models** trained on the seleted UD treebanks version 2.5 and **enhanced parsing models** are listed in the spreadsheets: [UD2.5-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=0) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324). **Morphosyntactic prediction models** trained on the seleted UD treebanks version 2.5, version 2.7 and **enhanced parsing models** are listed in the spreadsheets: [UD2.7-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1459988845), [UD2.5-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=0) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
### License ### License
Models are distributed under the same license as datasets used for their training. Models are distributed under the same license as datasets used for their training.
See [Universal Dependencies v2.7 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7) and [Universal Dependencies v2.5 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5) for details. See [Universal Dependencies v2.9 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.9), [Universal Dependencies v2.7 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7) and [Universal Dependencies v2.5 License Agreement](https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5) for details.
## Automatic download ## Automatic download
The pre-trained models can be automatically downloaded with the `from_pretrained` method in the Python mode. Select the model name of a pre-trained model (see the column **Model name** in [Model performance (UD2.7)](https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/blob/master/docs/performance.md), [UD2.5-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=0) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324)) and pass the name as an attribute of the `from_pretrained` method: The pre-trained models can be automatically downloaded with the `from_pretrained` method in the Python mode. Select the model name of a pre-trained model (see the column **Model name** in [Model performance (UD2.9)](https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/blob/master/docs/performance.md), [UD2.7-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1459988845), [UD2.5-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=0) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324)) and pass the name as an attribute of the `from_pretrained` method:
```python ```python
from combo.predict import COMBO from combo.predict import COMBO
nlp = COMBO.from_pretrained("polish-herbert-base") nlp = COMBO.from_pretrained("polish-herbert-base-ud29")
``` ```
If the model name doesn't match any model on the pre-trained model lists, COMBO looks for a model in local env. If the model name doesn't match any model on the pre-trained model lists, COMBO looks for a model in local env.
## Manual download ## Manual download
If you want to use COMBO in the command-line mode, you need to manually download a pre-trained model. The pre-trained models can be manually downloaded to a local disk with the `wget` package. The links to the pre-trained models are listed in the column **Model name** in [Model performance (UD2.7)](https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/blob/master/docs/performance.md), or **Model link** in [UD2.5-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=0) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324). If you want to use COMBO in the command-line mode, you need to manually download a pre-trained model. The pre-trained models can be manually downloaded to a local disk with the `wget` package. The links to the pre-trained models are listed in the column **Model name** in [Model performance (UD2.9)](https://gitlab.clarin-pl.eu/syntactic-tools/combo/-/blob/master/docs/performance.md), or **Model link** in [UD2.7-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1459988845),[UD2.5-trained COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=0) and [enhanced COMBO models](https://docs.google.com/spreadsheets/d/1WFYc2aLRa1jw7le030HOacv9fc4zmtqiZtRQY6gl5mc/edit#gid=1757180324).
```bash ```bash
wget http://mozart.ipipan.waw.pl/~mklimaszewski/models/polish-herbert-base.tar.gz wget http://s3.clarin-pl.eu/models/combo/ud_29/polish-herbert-base-ud29.tar.gz
``` ```
The path to the downloaded model should be passed as a parameter for COMBO in CLI (see [prediction doc](prediction.md)). The path to the downloaded model should be passed as a parameter for COMBO in CLI (see [prediction doc](prediction.md)).
This diff is collapsed.
...@@ -5,7 +5,7 @@ The pre-trained models can be automatically downloaded with the `from_pretrained ...@@ -5,7 +5,7 @@ The pre-trained models can be automatically downloaded with the `from_pretrained
```python ```python
from combo.predict import COMBO from combo.predict import COMBO
nlp = COMBO.from_pretrained(`polish-herbert-base`) nlp = COMBO.from_pretrained("polish-herbert-base-ud29")
sentence = nlp("Sentence to parse.") sentence = nlp("Sentence to parse.")
``` ```
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment