Skip to content
Snippets Groups Projects
Piwot's avatar
d46fe12a
Name Last commit Last update
cclutils
README.md
setup.py

cclutils

A convenient API based on Corpus2 library for reading, writing, and processing textual corpora represented as CCL (XML) documents.

Requirements

python3.6, corpus2

To install Corpus2 you have to add new source for APT:

wget -q -O - http://apt.clarin-pl.eu/KEY.gpg | apt-key add -
echo 'deb https://apt.clarin-pl.eu/ /' > /etc/apt/sources.list.d/clarin.list

apt-get update && apt-get install corpus2-python3.6

Install

pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/

IO

Read CCL file
import cclutils

filepath = './example.xml'
document = cclutils.read(filepath)
Read CCL with relations (REL file):

cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
Specify tagset
document = cclutils.read(cclpath, relpath, 'nkjp')
Write CCL
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')

or with relations:

cclutils.write(document, './out.xml', rel_path='./out.rel.xml')

specify the tagset:

cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Get tagset object
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...

Document structure

CCL format specifies basic segmentation structure, mainly paragraphs (<chunk>), sentences (<sentence>), and tokens (<token>). To iterate the document we can use special API functions:

document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
    ...
    for sentence in paragraph.sentences():
        ...
        for token in sentence.tokens():
            ...

We can also create a generator for iterating only the tokens in a more Pythonic way:

document = cclutils.read('./example.xml')

# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
    for sentence in paragraph.sentences()
    for token in sentence.tokens())
    
for token in tokens:
    ...

To avoid loading large CCL documents to RAM (DOM parsers) we can read them iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

it = read_chunks_it(ccl_path)
for paragraph in it:
    pass
    
it = read_sentences_it(ccl_path)
for sentence in it:
    pass

Token manipulation

  1. Get Part-of-Speech (simple, returns complete )
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_pos(token, tagset)
'subst:pl:inst:f'
  1. Get Part-of-Speech (main_only, returns only the main part of )
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_pos(token, tagset, main_only=True)
'subst'
  1. Get coarse-grained PoS (NKJP only for now)
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_coarse_pos(token, tagset)
'noun'
  1. Convert to coarse-grained PoS (NKJP only for now)
>>> convert_to_coarse_pos('subst')
'noun'
  1. Get token lemma
>>> get_lexeme_lemma(token)
'samolot'
  1. Check if a token is preceded by whitespace. Add or remove a whitespace.
>>> token.after_space()
True
>>> token.set_wa(False)
>>> token.after_space()
False