README.md 2.76 KB
Newer Older
Arkadiusz Janz's avatar
Arkadiusz Janz committed
1 2
# cclutils

Arkadiusz Janz's avatar
Arkadiusz Janz committed
3
A convenient API based on Corpus2 library for reading, writing, and processing
Arkadiusz Janz's avatar
Arkadiusz Janz committed
4 5
textual corpora represented as CCL (XML) documents.

Arkadiusz Janz's avatar
Arkadiusz Janz committed
6 7 8 9 10 11 12
Install
=======

```bash
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
13 14
IO
======
Arkadiusz Janz's avatar
Arkadiusz Janz committed
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

###### Read CCL file

```python
import cclutils

filepath = './example.xml'
document = cclutils.read(filepath)
```

###### Read CCL with relations (REL file):

```python

cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
```

###### Specify tagset

```python
document = cclutils.read(cclpath, relpath, 'nkjp')
```

###### Write CCL

```python
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
```

or with relations:

```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
```

specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
```

###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```

Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:

```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
    ...
    for sentence in paragraph.sentences():
        ...
        for token in sentence.tokens():
            ...
```

We can also create a generator for iterating only the tokens in a more Pythonic way:

```python
document = cclutils.read('./example.xml')

# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
    for sentence in paragraph.sentences()
    for token in sentence.tokens())
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
93 94 95 96 97 98 99 100 101 102 103 104 105
To avoid loading large CCL documents to RAM (DOM parsers) we can read them
iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

```python
it = read_chunks_it(ccl_path)
for paragraph in it:
    pass
    
it = read_sentences_it(ccl_path)
for sentence in it:
    pass
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
106 107 108
Token manipulation
==================

Arkadiusz Janz's avatar
Arkadiusz Janz committed
109
1. Get Part-of-Speech (simple, returns complete <ctag>)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
110 111

```python
Arkadiusz Janz's avatar
Arkadiusz Janz committed
112
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
113 114
>>> get_pos(token, tagset)
'subst:pl:inst:f'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
115 116 117 118 119 120 121

```

2. Get Part-of-Speech (main_only, returns only the main part of <ctag>)

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
122 123
>>> get_pos(token, tagset, main_only=True)
'subst'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
124 125
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
126
3. Get coarse-grained PoS (NKJP only for now)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
127 128 129

```python
>>> tagset = cclutils.get_tagset('nkjp')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
>>> get_coarse_pos(token, tagset, main_only=True)
'noun'
```

4. Convert to coarse-grained PoS (NKJP only for now)

```python
>>> convert_to_coarse_pos('subst')
'noun'
```

5. Get token lemma

```python
>>> get_lexeme_lemma(token)
'samolot'
Arkadiusz Janz's avatar
Arkadiusz Janz committed
146
```