README.md 17.9 KB
Newer Older
Arkadiusz Janz's avatar
Arkadiusz Janz committed
1 2
# cclutils

Arkadiusz Janz's avatar
Arkadiusz Janz committed
3
A convenient API based on Corpus2 library for reading, writing, and processing
Arkadiusz Janz's avatar
Arkadiusz Janz committed
4 5
textual corpora represented as CCL (XML) documents.

Arkadiusz Janz's avatar
Arkadiusz Janz committed
6 7 8 9 10 11 12 13 14 15 16 17
###### Requirements
python3.6, corpus2

To install Corpus2 you have to add new source for APT:

```bash
wget -q -O - http://apt.clarin-pl.eu/KEY.gpg | apt-key add -
echo 'deb https://apt.clarin-pl.eu/ /' > /etc/apt/sources.list.d/clarin.list

apt-get update && apt-get install corpus2-python3.6
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
18 19 20 21 22 23 24 25 26 27 28
It is also possible to use a docker:

```
FROM clarinpl/python:3.6

RUN apt-get update && apt-get install -y \
    corpus2-python3.6

RUN pip install --upgrade pip && pip install cclutils
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
29 30 31 32 33 34 35
Install
=======

```bash
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
36 37
IO
======
Arkadiusz Janz's avatar
Arkadiusz Janz committed
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79

###### Read CCL file

```python
import cclutils

filepath = './example.xml'
document = cclutils.read(filepath)
```

###### Read CCL with relations (REL file):

```python

cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
```

###### Specify tagset

```python
document = cclutils.read(cclpath, relpath, 'nkjp')
```

###### Write CCL

```python
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
```

or with relations:

```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
```

specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
```

###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```

Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:

```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
    ...
    for sentence in paragraph.sentences():
        ...
        for token in sentence.tokens():
            ...
```

We can also create a generator for iterating only the tokens in a more Pythonic way:

```python
document = cclutils.read('./example.xml')

# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
    for sentence in paragraph.sentences()
    for token in sentence.tokens())
Arkadiusz Janz's avatar
Arkadiusz Janz committed
114 115 116
    
for token in tokens:
    ...
Arkadiusz Janz's avatar
Arkadiusz Janz committed
117 118
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
119 120 121 122 123 124 125 126 127 128 129 130 131
To avoid loading large CCL documents to RAM (DOM parsers) we can read them
iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

```python
it = read_chunks_it(ccl_path)
for paragraph in it:
    pass
    
it = read_sentences_it(ccl_path)
for sentence in it:
    pass
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
132 133 134
Token manipulation
==================

Arkadiusz Janz's avatar
Arkadiusz Janz committed
135
1. Get Part-of-Speech (simple, returns complete <ctag>)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
136 137

```python
Arkadiusz Janz's avatar
Arkadiusz Janz committed
138
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
139 140
>>> get_pos(token, tagset)
'subst:pl:inst:f'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
141 142 143 144 145 146 147

```

2. Get Part-of-Speech (main_only, returns only the main part of <ctag>)

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
148 149
>>> get_pos(token, tagset, main_only=True)
'subst'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
150 151
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
152
3. Get coarse-grained PoS (NKJP only for now)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
153 154 155

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
156
>>> get_coarse_pos(token, tagset)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
'noun'
```

4. Convert to coarse-grained PoS (NKJP only for now)

```python
>>> convert_to_coarse_pos('subst')
'noun'
```

5. Get token lemma

```python
>>> get_lexeme_lemma(token)
'samolot'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
172 173 174 175 176 177 178 179 180 181
```

6. Check if a token is preceded by whitespace. Add or remove a whitespace.

```python
>>> token.after_space()
True
>>> token.set_wa(False)
>>> token.after_space()
False
Arkadiusz Janz's avatar
Arkadiusz Janz committed
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196
```

Sentence manipulation
=====================

1. Prints out sentences of a given document

```python
document = cclutils.read('./example.xml')

sentences = (sentence for paragraph in document.paragraphs()
    for sentence in paragraph.sentences())
    
for sentence in sentences:
    print(cclutils.sentence2str(sentence))
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441
```

Reading annotations
===================
Extracting annotations from CCL document is available with
`cclutils.extras.annotations` module built at the top of the core ``cclutils``
functionality.

The main function of this module is ``get_document_annotations`` which reads
annotations from CCL document (from file or ``corpus2.DocumentPtr`` object).

```python
from cclutils.extras.annotations import get_document_annotations
```

The annotations are organized with use of two classes:
1. ``AnnotatedExpression``: represents single annotation (annotated expression),
    located in specified paragraph and sentence. Module supports annotations
    describing single word and multiword expressions (more than one token).
1. ``DocumentAnnotations``: keeps annotations of entire document, provides
   methods to facilitate gathering and accessing annotations.


#### Read annotations of a given document
1. Read all annotations
    ```python
    >>> anns = get_document_annotations(cclutils.read('tests/data/ccl02.xml'))
    >>> anns
    <DocumentAnnotations for 10 annotated expressions: [<AnnotatedExpression for
        annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla',
        'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for
        annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa',
        'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation
        'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>,
        <AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)';
        ('hotel',) at position: ch2>s2>t0>, <AnnotatedExpression for annotation
        'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>,
        <AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)';
        ('śniadanie',) at position: ch2>s2>t3>, <AnnotatedExpression for annotation
        'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>,
        <AnnotatedExpression for annotation 'designation': 'designation:('dla',
        'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>, <AnnotatedExpression
        for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position:
        ch2>s2>t13>, <AnnotatedExpression for annotation 'food': 'food:('pełnym',
        'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]>
    ```

1. Read only specified annotations
    ```python
    >>> anns = get_document_annotations(cclutils.read('tests/data/ccl02.xml'), annotations={'designation'})
    >>> anns
    <DocumentAnnotations for 2 annotated expressions: [<AnnotatedExpression for
        annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla',
        'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for
        annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko')
        at position: ch2>s2>t10,t11>]>
    ```

#### Get annotations in one of preferred forms
1. Get annotations index containing full information about annotations
    * key is a tuple containing following values: (annotation channel name,
        sentence id, paragraph id, channel numeric value)
    ```python
    >>> anns.expressions_index
    defaultdict(list,
                {('designation',
                's1',
                'ch1',
                1): <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>,
                ('room_type',
                's1',
                'ch1',
                1): <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>,
                ('region',
                's1',
                'ch1',
                1): <AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>,
                ('attraction',
                's2',
                'ch2',
                1): <AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>,
                ('hotel_name',
                's2',
                'ch2',
                1): <AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>,
                ('food',
                's2',
                'ch2',
                1): <AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>,
                ('room_type',
                's2',
                'ch2',
                1): <AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>,
                ('designation',
                's2',
                'ch2',
                1): <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>,
                ('attraction',
                's2',
                'ch2',
                2): <AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>,
                ('food',
                's2',
                'ch2',
                2): <AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>})

    ```
1. Get annotations grouped by annotation channel name, in one of formats:
    * annotation object
    * orths
    * preferred lexemes
    * annotation base lemma
    ```python
    >>> anns.group_by_chan_name()
    defaultdict(list,
                {'designation': [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>,
                <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>],
                'room_type': [<AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>,
                <AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>],
                'region': [<AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>],
                'attraction': [<AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>,
                <AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>],
                'hotel_name': [<AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>],
                'food': [<AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>,
                <AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]})

    >>> anns.group_by_chan_name(as_orths=True)
    defaultdict(list,
                {'designation': [('dla', 'dwóch', 'osób'), ('dla', 'dzieci')],
                'room_type': [('dla', 'dwóch', 'osób'), ('łazienką',)],
                'region': [('Gdańsk',)],
                'attraction': [('Hotel',), ('spa',)],
                'hotel_name': [('Hotel',)],
                'food': [('śniadaniem',), ('pełnym', 'wyżywieniem')]})

    >>> anns.group_by_chan_name(as_lexemes=True)
    defaultdict(list,
                {'designation': [('dla', 'dwa', 'osoba'), ('dla', 'dziecko')],
                'room_type': [('dla', 'dwa', 'osoba'), ('łazienka',)],
                'region': [('Gdańsk',)],
                'attraction': [('hotel',), ('spa',)],
                'hotel_name': [('hotel',)],
                'food': [('śniadanie',), ('pełny', 'wyżywienie')]})

    >>> anns.group_by_chan_name(as_ann_base=True)
    defaultdict(list,
                {'designation': ['dla dwóch osób', 'dla dziecka'],
                'room_type': ['dla dwóch osób', 'łazienka'],
                'region': [''],
                'attraction': ['hotel', 'spa'],
                'hotel_name': ['Hotel'],
                'food': ['śniadanie', 'pełne wyżywienie']})
    ```

1. Get annotations grouped by token (token position), in one of formats (usage
same as in case of ``group_by_chan_name`` method):
    * annotation object
    * orths
    * preferred lexemes
    * annotation base lemma
    ```python
    >>> anns.group_by_token()
    {(1,
    's1',
    'ch1'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type'
    : 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>],
    (2,
    's1',
    'ch1'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type'
    : 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>],
    (3,
    's1',
    'ch1'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type'
    : 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>],
    (4,
    's1',
    'ch1'): [<AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>],
    (0,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, <AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel
    ',) at position: ch2>s2>t0>],
    (3,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>],
    (7,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>],
    (10,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>],
    (11,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>],
    (13,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>],
    (17,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>],
    (18,
    's2',
    'ch2'): [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]}
    ```

1. Get annotations grouped by token, with original document order (tokens
   order):
    ```python
    >>> anns.group_by_token(retain_order=True)
    OrderedDict([((1, 's1', 'ch1'),
                [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>,
                <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>]),
                ((2, 's1', 'ch1'),
                [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>,
                <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>]),
                ((3, 's1', 'ch1'),
                [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>,
                <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>]),
                ((4, 's1', 'ch1'),
                [<AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>]),
                ((0, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>,
                <AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>]),
                ((3, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>]),
                ((7, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>]),
                ((10, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>]),
                ((11, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>]),
                ((13, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>]),
                ((17, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]),
                ((18, 's2', 'ch2'),
                [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>])])
    ```

#### Get token by token position
1. When using above methods, you may want to get ``corpus2.Token`` object
    referenced by position:
    ```python
    >>> anns.token_by_position_index[(17, 's2', 'ch2')]
    <corpus2.Token; proxy of <Swig Object of type 'Corpus2::Token *' at 0x7f71edfced80> >
    ```