README.md 2.12 KB
Newer Older
Arkadiusz Janz's avatar
Arkadiusz Janz committed
1 2
# cclutils

Arkadiusz Janz's avatar
Arkadiusz Janz committed
3
A convenient API based on Corpus2 library for reading, writing, and processing
Arkadiusz Janz's avatar
Arkadiusz Janz committed
4 5
textual corpora represented as CCL (XML) documents.

Arkadiusz Janz's avatar
Arkadiusz Janz committed
6 7
IO
======
Arkadiusz Janz's avatar
Arkadiusz Janz committed
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

###### Read CCL file

```python
import cclutils

filepath = './example.xml'
document = cclutils.read(filepath)
```

###### Read CCL with relations (REL file):

```python

cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
```

###### Specify tagset

```python
document = cclutils.read(cclpath, relpath, 'nkjp')
```

###### Write CCL

```python
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
```

or with relations:

```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
```

specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
```

###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```

Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:

```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
    ...
    for sentence in paragraph.sentences():
        ...
        for token in sentence.tokens():
            ...
```

We can also create a generator for iterating only the tokens in a more Pythonic way:

```python
document = cclutils.read('./example.xml')

# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
    for sentence in paragraph.sentences()
    for token in sentence.tokens())
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
86 87 88 89 90 91 92 93 94 95 96 97 98
To avoid loading large CCL documents to RAM (DOM parsers) we can read them
iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

```python
it = read_chunks_it(ccl_path)
for paragraph in it:
    pass
    
it = read_sentences_it(ccl_path)
for sentence in it:
    pass
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
99 100 101 102 103 104 105 106 107
Token manipulation
==================

1. Get Part-of-Speech (simple)

```python
tagset = cclutils.get_tagset('nkjp')
...
pos = get_pos(token, tagset)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
108
```