README.md 3.04 KB
Newer Older
Arkadiusz Janz's avatar
Arkadiusz Janz committed
1 2
# cclutils

Arkadiusz Janz's avatar
Arkadiusz Janz committed
3
A convenient API based on Corpus2 library for reading, writing, and processing
Arkadiusz Janz's avatar
Arkadiusz Janz committed
4 5
textual corpora represented as CCL (XML) documents.

Arkadiusz Janz's avatar
Arkadiusz Janz committed
6 7 8 9 10 11 12 13 14 15 16 17
###### Requirements
python3.6, corpus2

To install Corpus2 you have to add new source for APT:

```bash
wget -q -O - http://apt.clarin-pl.eu/KEY.gpg | apt-key add -
echo 'deb https://apt.clarin-pl.eu/ /' > /etc/apt/sources.list.d/clarin.list

apt-get update && apt-get install corpus2-python3.6
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
18 19 20 21 22 23 24
Install
=======

```bash
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
25 26
IO
======
Arkadiusz Janz's avatar
Arkadiusz Janz committed
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

###### Read CCL file

```python
import cclutils

filepath = './example.xml'
document = cclutils.read(filepath)
```

###### Read CCL with relations (REL file):

```python

cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
```

###### Specify tagset

```python
document = cclutils.read(cclpath, relpath, 'nkjp')
```

###### Write CCL

```python
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
```

or with relations:

```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
```

specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
```

###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```

Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:

```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
    ...
    for sentence in paragraph.sentences():
        ...
        for token in sentence.tokens():
            ...
```

We can also create a generator for iterating only the tokens in a more Pythonic way:

```python
document = cclutils.read('./example.xml')

# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
    for sentence in paragraph.sentences()
    for token in sentence.tokens())
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
105 106 107 108 109 110 111 112 113 114 115 116 117
To avoid loading large CCL documents to RAM (DOM parsers) we can read them
iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

```python
it = read_chunks_it(ccl_path)
for paragraph in it:
    pass
    
it = read_sentences_it(ccl_path)
for sentence in it:
    pass
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
118 119 120
Token manipulation
==================

Arkadiusz Janz's avatar
Arkadiusz Janz committed
121
1. Get Part-of-Speech (simple, returns complete <ctag>)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
122 123

```python
Arkadiusz Janz's avatar
Arkadiusz Janz committed
124
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
125 126
>>> get_pos(token, tagset)
'subst:pl:inst:f'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
127 128 129 130 131 132 133

```

2. Get Part-of-Speech (main_only, returns only the main part of <ctag>)

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
134 135
>>> get_pos(token, tagset, main_only=True)
'subst'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
136 137
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
138
3. Get coarse-grained PoS (NKJP only for now)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
139 140 141

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
142
>>> get_coarse_pos(token, tagset)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157
'noun'
```

4. Convert to coarse-grained PoS (NKJP only for now)

```python
>>> convert_to_coarse_pos('subst')
'noun'
```

5. Get token lemma

```python
>>> get_lexeme_lemma(token)
'samolot'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
158
```