Update README.md

fe947b88 · Arkadiusz Janz · 9a6169dc · fe947b88
Commit fe947b88 authored May 22, 2020 by Arkadiusz Janz
--- a/README.md
+++ b/README.md
@@ -3,7 +3,8 @@
 A convenient API based on Corpus2 library for reading, writing, and processing
 textual corpora represented as CCL (XML) documents.

-#### IO
+IO
+======

 ###### Read CCL file

@@ -47,3 +48,48 @@ specify the tagset:
 ```python
 cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
 ```
+
+###### Get tagset object
+```python
+tagset = cclutils.get_tagset('nkjp')
+tagset = cclutils.get_tagset('spacy')
+...
+```
+
+Document structure
+==================
+CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```),
+sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we
+can use special API functions:
+
+```python
+document = cclutils.read('./example.xml')
+for paragraph in document.paragraphs():
+    ...
+    for sentence in paragraph.sentences():
+        ...
+        for token in sentence.tokens():
+            ...
+```
+
+We can also create a generator for iterating only the tokens in a more Pythonic way:
+
+```python
+document = cclutils.read('./example.xml')
+
+# tokens is a generator:
+tokens = (token for paragraph in document.paragraphs()
+    for sentence in paragraph.sentences()
+    for token in sentence.tokens())
+```
+
+Token manipulation
+==================
+
+1. Get Part-of-Speech (simple)
+
+```python
+tagset = cclutils.get_tagset('nkjp')
+...
+pos = get_pos(token, tagset)
+```
\ No newline at end of file