README.md 3.24 KB
Newer Older
Arkadiusz Janz's avatar
Arkadiusz Janz committed
1 2
# cclutils

Arkadiusz Janz's avatar
Arkadiusz Janz committed
3
A convenient API based on Corpus2 library for reading, writing, and processing
Arkadiusz Janz's avatar
Arkadiusz Janz committed
4 5
textual corpora represented as CCL (XML) documents.

Arkadiusz Janz's avatar
Arkadiusz Janz committed
6 7 8 9 10 11 12 13 14 15 16 17
###### Requirements
python3.6, corpus2

To install Corpus2 you have to add new source for APT:

```bash
wget -q -O - http://apt.clarin-pl.eu/KEY.gpg | apt-key add -
echo 'deb https://apt.clarin-pl.eu/ /' > /etc/apt/sources.list.d/clarin.list

apt-get update && apt-get install corpus2-python3.6
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
18 19 20 21 22 23 24
Install
=======

```bash
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
25 26
IO
======
Arkadiusz Janz's avatar
Arkadiusz Janz committed
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

###### Read CCL file

```python
import cclutils

filepath = './example.xml'
document = cclutils.read(filepath)
```

###### Read CCL with relations (REL file):

```python

cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
```

###### Specify tagset

```python
document = cclutils.read(cclpath, relpath, 'nkjp')
```

###### Write CCL

```python
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
```

or with relations:

```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
```

specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
```

###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```

Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:

```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
    ...
    for sentence in paragraph.sentences():
        ...
        for token in sentence.tokens():
            ...
```

We can also create a generator for iterating only the tokens in a more Pythonic way:

```python
document = cclutils.read('./example.xml')

# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
    for sentence in paragraph.sentences()
    for token in sentence.tokens())
Arkadiusz Janz's avatar
Arkadiusz Janz committed
103 104 105
    
for token in tokens:
    ...
Arkadiusz Janz's avatar
Arkadiusz Janz committed
106 107
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
108 109 110 111 112 113 114 115 116 117 118 119 120
To avoid loading large CCL documents to RAM (DOM parsers) we can read them
iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

```python
it = read_chunks_it(ccl_path)
for paragraph in it:
    pass
    
it = read_sentences_it(ccl_path)
for sentence in it:
    pass
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
121 122 123
Token manipulation
==================

Arkadiusz Janz's avatar
Arkadiusz Janz committed
124
1. Get Part-of-Speech (simple, returns complete <ctag>)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
125 126

```python
Arkadiusz Janz's avatar
Arkadiusz Janz committed
127
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
128 129
>>> get_pos(token, tagset)
'subst:pl:inst:f'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
130 131 132 133 134 135 136

```

2. Get Part-of-Speech (main_only, returns only the main part of <ctag>)

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
137 138
>>> get_pos(token, tagset, main_only=True)
'subst'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
139 140
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
141
3. Get coarse-grained PoS (NKJP only for now)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
142 143 144

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
145
>>> get_coarse_pos(token, tagset)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
'noun'
```

4. Convert to coarse-grained PoS (NKJP only for now)

```python
>>> convert_to_coarse_pos('subst')
'noun'
```

5. Get token lemma

```python
>>> get_lexeme_lemma(token)
'samolot'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
161 162 163 164 165 166 167 168 169 170
```

6. Check if a token is preceded by whitespace. Add or remove a whitespace.

```python
>>> token.after_space()
True
>>> token.set_wa(False)
>>> token.after_space()
False
Arkadiusz Janz's avatar
Arkadiusz Janz committed
171
```