diff --git a/README.md b/README.md index 6d8b46cb4a81b17044abd1e05982e65a29f16544..d10e8a0339e63b1ecd27df785bed17e8d736d226 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,8 @@ A convenient API based on Corpus2 library for reading, writing, and processing textual corpora represented as CCL (XML) documents. -#### IO +IO +====== ###### Read CCL file @@ -46,4 +47,49 @@ cclutils.write(document, './out.xml', rel_path='./out.rel.xml') specify the tagset: ```python cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy') +``` + +###### Get tagset object +```python +tagset = cclutils.get_tagset('nkjp') +tagset = cclutils.get_tagset('spacy') +... +``` + +Document structure +================== +CCL format specifies basic segmentation structure, mainly paragraphs (```<chunk>```), +sentences (```<sentence>```), and tokens (```<token>```). To iterate the document we +can use special API functions: + +```python +document = cclutils.read('./example.xml') +for paragraph in document.paragraphs(): + ... + for sentence in paragraph.sentences(): + ... + for token in sentence.tokens(): + ... +``` + +We can also create a generator for iterating only the tokens in a more Pythonic way: + +```python +document = cclutils.read('./example.xml') + +# tokens is a generator: +tokens = (token for paragraph in document.paragraphs() + for sentence in paragraph.sentences() + for token in sentence.tokens()) +``` + +Token manipulation +================== + +1. Get Part-of-Speech (simple) + +```python +tagset = cclutils.get_tagset('nkjp') +... +pos = get_pos(token, tagset) ``` \ No newline at end of file