Skip to content
Snippets Groups Projects
Commit 75ba8870 authored by piotrmp's avatar piotrmp
Browse files

Language list updated in line with UD 2.13.

parent f5151bf6
Branches
Tags
No related merge requests found
Pipeline #16989 passed with stage
in 22 seconds
...@@ -28,6 +28,8 @@ if __name__=='__main__': ...@@ -28,6 +28,8 @@ if __name__=='__main__':
parts = line.split() parts = line.split()
model = parts[0] model = parts[0]
language = parts[1] language = parts[1]
if model != 'UD_Polish-PDB':
continue
if (outpath / (model + '.pth')).exists(): if (outpath / (model + '.pth')).exists():
continue continue
print(str(i) + '/' + str(len(lines)) + '========== ' + model + ' ==========') print(str(i) + '/' + str(len(lines)) + '========== ' + model + ' ==========')
......
# Format: <UD training corpus> <ISO 639-1 code (for OSCAR pretraining)> <Language name> <Recommended (chosen by size)> # Format: <UD training corpus> <ISO 639-1 code (for OSCAR pretraining)> <Language name> <Recommended (chosen by size)>
UD_Afrikaans-AfriBooms af Afrikaans UD_Afrikaans-AfriBooms af Afrikaans
UD_Ancient_Greek-PROIEL ? Ancient_Greek * UD_Ancient_Greek-PROIEL ? Ancient_Greek *
UD_Ancient_Greek-PTNK ? Ancient_Greek
UD_Ancient_Greek-Perseus ? Ancient_Greek UD_Ancient_Greek-Perseus ? Ancient_Greek
UD_Ancient_Hebrew-PTNK ? Ancient_Hebrew UD_Ancient_Hebrew-PTNK ? Ancient_Hebrew
#UD_Arabic-NYUAD ar Arabic #UD_Arabic-NYUAD ar Arabic
...@@ -24,7 +25,7 @@ UD_Danish-DDT da Danish ...@@ -24,7 +25,7 @@ UD_Danish-DDT da Danish
UD_Dutch-Alpino nl Dutch * UD_Dutch-Alpino nl Dutch *
UD_Dutch-LassySmall nl Dutch UD_Dutch-LassySmall nl Dutch
UD_English-Atis en English UD_English-Atis en English
UD_English-ESL en English UD_English-ESLSpok en English
UD_English-EWT en English * UD_English-EWT en English *
UD_English-GUM en English UD_English-GUM en English
UD_English-GUMReddit en English UD_English-GUMReddit en English
...@@ -35,29 +36,29 @@ UD_Estonian-EWT et Estonian ...@@ -35,29 +36,29 @@ UD_Estonian-EWT et Estonian
UD_Faroese-FarPaHC fo Faroese UD_Faroese-FarPaHC fo Faroese
UD_Finnish-FTB fi Finnish UD_Finnish-FTB fi Finnish
UD_Finnish-TDT fi Finnish * UD_Finnish-TDT fi Finnish *
UD_French-FTB fr French
UD_French-GSD fr French * UD_French-GSD fr French *
UD_French-ParTUT fr French UD_French-ParTUT fr French
UD_French-ParisStories fr French UD_French-ParisStories fr French
UD_French-Rhapsodie fr French UD_French-Rhapsodie fr French
UD_French-Sequoia fr French UD_French-Sequoia fr French
UD_Galician-CTG gl Galician UD_Galician-CTG gl Galician
UD_German-GSD de German UD_German-GSD de German *
UD_German-HDT de German * UD_German-HDT de German
UD_Gothic-PROIEL ? Gothic UD_Gothic-PROIEL ? Gothic
UD_Greek-GDT el Greek UD_Greek-GDT el Greek
UD_Hebrew-HTB he Hebrew * UD_Hebrew-HTB he Hebrew *
UD_Hebrew-IAHLTwiki he Hebrew UD_Hebrew-IAHLTwiki he Hebrew
UD_Hindi-HDTB hi Hindi UD_Hindi-HDTB hi Hindi
UD_Hindi_English-HIENCS ? Hindi_English
UD_Hungarian-Szeged hu Hungarian UD_Hungarian-Szeged hu Hungarian
UD_Icelandic-GC is Icelandic UD_Icelandic-GC is Icelandic
UD_Icelandic-IcePaHC is Icelandic * UD_Icelandic-IcePaHC is Icelandic *
UD_Icelandic-Modern is Icelandic UD_Icelandic-Modern is Icelandic
UD_Indonesian-GSD id Indonesian UD_Indonesian-GSD id Indonesian
UD_Irish-IDT ga Irish UD_Irish-IDT ga Irish *
UD_Irish-TwittIrish ga Irish
UD_Italian-ISDT it Italian * UD_Italian-ISDT it Italian *
UD_Italian-MarkIT it Italian UD_Italian-MarkIT it Italian
UD_Italian-Old it Italian
UD_Italian-ParTUT it Italian UD_Italian-ParTUT it Italian
UD_Italian-PoSTWITA it Italian UD_Italian-PoSTWITA it Italian
UD_Italian-TWITTIRO it Italian UD_Italian-TWITTIRO it Italian
...@@ -75,17 +76,18 @@ UD_Latin-UDante la Latin ...@@ -75,17 +76,18 @@ UD_Latin-UDante la Latin
UD_Latvian-LVTB lv Latvian UD_Latvian-LVTB lv Latvian
UD_Lithuanian-ALKSNIS lt Lithuanian * UD_Lithuanian-ALKSNIS lt Lithuanian *
UD_Lithuanian-HSE lt Lithuanian UD_Lithuanian-HSE lt Lithuanian
UD_Maghrebi_Arabic_French-Arabizi ? Maghrebi_Arabic_French
UD_Maltese-MUDT mt Maltese UD_Maltese-MUDT mt Maltese
UD_Marathi-UFAL mr Marathi UD_Marathi-UFAL mr Marathi
UD_Naija-NSC ? Naija UD_Naija-NSC ? Naija
UD_Norwegian-Bokmaal no Norwegian UD_Norwegian-Bokmaal no Norwegian
UD_Norwegian-Nynorsk nn Norwegian * UD_Norwegian-Nynorsk nn Norwegian *
UD_Norwegian-NynorskLIA nn Norwegian
UD_Old_Church_Slavonic-PROIEL ? Old_Church_Slavonic UD_Old_Church_Slavonic-PROIEL ? Old_Church_Slavonic
UD_Old_East_Slavic-Birchbark ? Old_East_Slavic UD_Old_East_Slavic-Birchbark ? Old_East_Slavic
UD_Old_East_Slavic-RNC ? Old_East_Slavic UD_Old_East_Slavic-RNC ? Old_East_Slavic
UD_Old_East_Slavic-Ruthenian ? Old_East_Slavic
UD_Old_East_Slavic-TOROT ? Old_East_Slavic * UD_Old_East_Slavic-TOROT ? Old_East_Slavic *
UD_Old_French-SRCMF ? Old_French UD_Old_French-PROFITEROLE ? Old_French
UD_Persian-PerDT fa Persian * UD_Persian-PerDT fa Persian *
UD_Persian-Seraji fa Persian UD_Persian-Seraji fa Persian
UD_Polish-LFG pl Polish UD_Polish-LFG pl Polish
...@@ -95,10 +97,12 @@ UD_Portuguese-Bosque pt Portuguese ...@@ -95,10 +97,12 @@ UD_Portuguese-Bosque pt Portuguese
UD_Portuguese-CINTIL pt Portuguese * UD_Portuguese-CINTIL pt Portuguese *
UD_Portuguese-GSD pt Portuguese UD_Portuguese-GSD pt Portuguese
UD_Portuguese-PetroGold pt Portuguese UD_Portuguese-PetroGold pt Portuguese
UD_Portuguese-Porttinari pt Portuguese
UD_Romanian-Nonstandard ro Romanian * UD_Romanian-Nonstandard ro Romanian *
UD_Romanian-RRT ro Romanian UD_Romanian-RRT ro Romanian
UD_Romanian-SiMoNERo ro Romanian UD_Romanian-SiMoNERo ro Romanian
UD_Russian-GSD ru Russian UD_Russian-GSD ru Russian
UD_Russian-Poetry ru Russian
UD_Russian-SynTagRus ru Russian * UD_Russian-SynTagRus ru Russian *
UD_Russian-Taiga ru Russian UD_Russian-Taiga ru Russian
UD_Scottish_Gaelic-ARCOSG gd Scottish_Gaelic UD_Scottish_Gaelic-ARCOSG gd Scottish_Gaelic
...@@ -109,7 +113,6 @@ UD_Spanish-AnCora es Spanish * ...@@ -109,7 +113,6 @@ UD_Spanish-AnCora es Spanish *
UD_Spanish-GSD es Spanish UD_Spanish-GSD es Spanish
UD_Swedish-LinES sv Swedish UD_Swedish-LinES sv Swedish
UD_Swedish-Talbanken sv Swedish * UD_Swedish-Talbanken sv Swedish *
UD_Swedish_Sign_Language-SSLC ? Swedish_Sign_Language
UD_Tamil-TTB ta Tamil UD_Tamil-TTB ta Tamil
UD_Telugu-MTG te Telugu UD_Telugu-MTG te Telugu
UD_Turkish-Atis tr Turkish UD_Turkish-Atis tr Turkish
......
""" """
Rough procedure to generate languages.txt from a UD folder. Includes all languages that have a test and dev portions. Rough procedure to generate languages.txt from a UD folder. Includes all languages that have a test and dev portions.
Uses the previous version of the file to translate language names to ISO codes. Selects the largest corpus as preferred Uses the previous version of the file to translate language names to ISO codes. Selects the largest corpus as preferred
for the language. May require manual adjustment to exclude abnormal treebanks (e.g. UD_Arabic-NYUAD) or add missing for the language. May require manual adjustment to exclude abnormal treebanks or add missing
ISO codes for new languages. ISO codes for new languages.
Manual changes in version 2.13:
- excluding UD_Arabic-NYUAD
- adding '?' as ISO code for the new languages outside the standard
- selected UD_German-GSD as default in place of UD_German-HDT, which lacks spacing information
- corrected language code for UD_Norwegian-Bokmaal from nn to no
""" """
from pathlib import Path from pathlib import Path
old_languages_txt = '' old_languages_txt = '/Users/piotr/projects/lambo/src/lambo/resources/languages.txt'
new_ud_treebanks = '' new_ud_treebanks = '/Users/piotr/data/ud-treebanks-v2.13'
codedict = {} codedict = {}
for line in open(old_languages_txt): for line in open(old_languages_txt):
...@@ -18,9 +23,9 @@ for line in open(old_languages_txt): ...@@ -18,9 +23,9 @@ for line in open(old_languages_txt):
code = parts[1] code = parts[1]
codedict[lang] = code codedict[lang] = code
ud11path = Path(new_ud_treebanks) udpath = Path(new_ud_treebanks)
subdirs = [x for x in ud11path.iterdir() if x.is_dir()] subdirs = [x for x in udpath.iterdir() if x.is_dir()]
subdirs.sort() subdirs.sort()
sizes = {} sizes = {}
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment