README.md 3.75 KB
Newer Older
Arkadiusz Janz's avatar
Arkadiusz Janz committed
1 2
# cclutils

Arkadiusz Janz's avatar
Arkadiusz Janz committed
3
A convenient API based on Corpus2 library for reading, writing, and processing
Arkadiusz Janz's avatar
Arkadiusz Janz committed
4 5
textual corpora represented as CCL (XML) documents.

Arkadiusz Janz's avatar
Arkadiusz Janz committed
6 7 8 9 10 11 12 13 14 15 16 17
###### Requirements
python3.6, corpus2

To install Corpus2 you have to add new source for APT:

```bash
wget -q -O - http://apt.clarin-pl.eu/KEY.gpg | apt-key add -
echo 'deb https://apt.clarin-pl.eu/ /' > /etc/apt/sources.list.d/clarin.list

apt-get update && apt-get install corpus2-python3.6
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
18 19 20 21 22 23 24 25 26 27 28
It is also possible to use a docker:

```
FROM clarinpl/python:3.6

RUN apt-get update && apt-get install -y \
    corpus2-python3.6

RUN pip install --upgrade pip && pip install cclutils
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
29 30 31 32 33 34 35
Install
=======

```bash
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
36 37
IO
======
Arkadiusz Janz's avatar
Arkadiusz Janz committed
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79

###### Read CCL file

```python
import cclutils

filepath = './example.xml'
document = cclutils.read(filepath)
```

###### Read CCL with relations (REL file):

```python

cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
```

###### Specify tagset

```python
document = cclutils.read(cclpath, relpath, 'nkjp')
```

###### Write CCL

```python
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
```

or with relations:

```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
```

specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
```

###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```

Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:

```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
    ...
    for sentence in paragraph.sentences():
        ...
        for token in sentence.tokens():
            ...
```

We can also create a generator for iterating only the tokens in a more Pythonic way:

```python
document = cclutils.read('./example.xml')

# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
    for sentence in paragraph.sentences()
    for token in sentence.tokens())
Arkadiusz Janz's avatar
Arkadiusz Janz committed
114 115 116
    
for token in tokens:
    ...
Arkadiusz Janz's avatar
Arkadiusz Janz committed
117 118
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
119 120 121 122 123 124 125 126 127 128 129 130 131
To avoid loading large CCL documents to RAM (DOM parsers) we can read them
iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

```python
it = read_chunks_it(ccl_path)
for paragraph in it:
    pass
    
it = read_sentences_it(ccl_path)
for sentence in it:
    pass
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
132 133 134
Token manipulation
==================

Arkadiusz Janz's avatar
Arkadiusz Janz committed
135
1. Get Part-of-Speech (simple, returns complete <ctag>)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
136 137

```python
Arkadiusz Janz's avatar
Arkadiusz Janz committed
138
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
139 140
>>> get_pos(token, tagset)
'subst:pl:inst:f'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
141 142 143 144 145 146 147

```

2. Get Part-of-Speech (main_only, returns only the main part of <ctag>)

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
148 149
>>> get_pos(token, tagset, main_only=True)
'subst'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
150 151
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
152
3. Get coarse-grained PoS (NKJP only for now)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
153 154 155

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
156
>>> get_coarse_pos(token, tagset)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
'noun'
```

4. Convert to coarse-grained PoS (NKJP only for now)

```python
>>> convert_to_coarse_pos('subst')
'noun'
```

5. Get token lemma

```python
>>> get_lexeme_lemma(token)
'samolot'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
172 173 174 175 176 177 178 179 180 181
```

6. Check if a token is preceded by whitespace. Add or remove a whitespace.

```python
>>> token.after_space()
True
>>> token.set_wa(False)
>>> token.after_space()
False
Arkadiusz Janz's avatar
Arkadiusz Janz committed
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196
```

Sentence manipulation
=====================

1. Prints out sentences of a given document

```python
document = cclutils.read('./example.xml')

sentences = (sentence for paragraph in document.paragraphs()
    for sentence in paragraph.sentences())
    
for sentence in sentences:
    print(cclutils.sentence2str(sentence))
Arkadiusz Janz's avatar
Arkadiusz Janz committed
197
```