Commit fe947b88 authored by Arkadiusz Janz's avatar Arkadiusz Janz

Update README.md

parent 9a6169dc
......@@ -3,7 +3,8 @@
A convenient API based on Corpus2 library for reading, writing, and processing
textual corpora represented as CCL (XML) documents.
#### IO
IO
======
###### Read CCL file
......@@ -46,4 +47,49 @@ cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
```
###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```
Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:
```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
...
for sentence in paragraph.sentences():
...
for token in sentence.tokens():
...
```
We can also create a generator for iterating only the tokens in a more Pythonic way:
```python
document = cclutils.read('./example.xml')
# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
for sentence in paragraph.sentences()
for token in sentence.tokens())
```
Token manipulation
==================
1. Get Part-of-Speech (simple)
```python
tagset = cclutils.get_tagset('nkjp')
...
pos = get_pos(token, tagset)
```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment