Skip to content
Snippets Groups Projects
Commit fe947b88 authored by Arkadiusz Janz's avatar Arkadiusz Janz
Browse files

Update README.md

parent 9a6169dc
No related branches found
No related tags found
No related merge requests found
......@@ -3,7 +3,8 @@
A convenient API based on Corpus2 library for reading, writing, and processing
textual corpora represented as CCL (XML) documents.
#### IO
IO
======
###### Read CCL file
......@@ -47,3 +48,48 @@ specify the tagset:
```python
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
```
###### Get tagset object
```python
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
```
Document structure
==================
CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
can use special API functions:
```python
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
...
for sentence in paragraph.sentences():
...
for token in sentence.tokens():
...
```
We can also create a generator for iterating only the tokens in a more Pythonic way:
```python
document = cclutils.read('./example.xml')
# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
for sentence in paragraph.sentences()
for token in sentence.tokens())
```
Token manipulation
==================
1. Get Part-of-Speech (simple)
```python
tagset = cclutils.get_tagset('nkjp')
...
pos = get_pos(token, tagset)
```
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment