cclutils
A convenient API based on Corpus2 library for reading, writing, and processing textual corpora represented as CCL (XML) documents.
Install
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
IO
Read CCL file
import cclutils
filepath = './example.xml'
document = cclutils.read(filepath)
Read CCL with relations (REL file):
cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
Specify tagset
document = cclutils.read(cclpath, relpath, 'nkjp')
Write CCL
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
or with relations:
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
specify the tagset:
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Get tagset object
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
Document structure
CCL format specifies basic segmentation structure, mainly paragraphs (<chunk>
),
sentences (<sentence>
), and tokens (<token>
). To iterate the document we
can use special API functions:
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
...
for sentence in paragraph.sentences():
...
for token in sentence.tokens():
...
We can also create a generator for iterating only the tokens in a more Pythonic way:
document = cclutils.read('./example.xml')
# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
for sentence in paragraph.sentences()
for token in sentence.tokens())
To avoid loading large CCL documents to RAM (DOM parsers) we can read them iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):
it = read_chunks_it(ccl_path)
for paragraph in it:
pass
it = read_sentences_it(ccl_path)
for sentence in it:
pass
Token manipulation
- Get Part-of-Speech (simple)
tagset = cclutils.get_tagset('nkjp')
...
pos = get_pos(token, tagset)