Skip to content
Snippets Groups Projects
Select Git revision
  • ee3c0efd08362d399273f1c990a9ef64f8945d44
  • master default protected
  • develop
  • erase-ann
  • 1.1
5 results

cclutils

  • Clone with SSH
  • Clone with HTTPS
  • Arkadiusz Janz's avatar
    Arkadiusz Janz authored
    ee3c0efd
    History
    Name Last commit Last update
    cclutils
    README.md
    setup.py

    cclutils

    A convenient API based on Corpus2 library for reading, writing, and processing textual corpora represented as CCL (XML) documents.

    Install

    pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/

    IO

    Read CCL file
    import cclutils
    
    filepath = './example.xml'
    document = cclutils.read(filepath)
    Read CCL with relations (REL file):
    
    cclpath = './example.xml'
    relpath = './exampel.rel.xml'
    document = cclutils.read(cclpath, relpath)
    Specify tagset
    document = cclutils.read(cclpath, relpath, 'nkjp')
    Write CCL
    document = cclutils.read(filepath)
    ...
    cclutils.write(document, './out.xml')

    or with relations:

    cclutils.write(document, './out.xml', rel_path='./out.rel.xml')

    specify the tagset:

    cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
    Get tagset object
    tagset = cclutils.get_tagset('nkjp')
    tagset = cclutils.get_tagset('spacy')
    ...

    Document structure

    CCL format specifies basic segmentation structure, mainly paragraphs (<chunk>), sentences (<sentence>), and tokens (<token>). To iterate the document we can use special API functions:

    document = cclutils.read('./example.xml')
    for paragraph in document.paragraphs():
        ...
        for sentence in paragraph.sentences():
            ...
            for token in sentence.tokens():
                ...

    We can also create a generator for iterating only the tokens in a more Pythonic way:

    document = cclutils.read('./example.xml')
    
    # tokens is a generator:
    tokens = (token for paragraph in document.paragraphs()
        for sentence in paragraph.sentences()
        for token in sentence.tokens())

    To avoid loading large CCL documents to RAM (DOM parsers) we can read them iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):

    it = read_chunks_it(ccl_path)
    for paragraph in it:
        pass
        
    it = read_sentences_it(ccl_path)
    for sentence in it:
        pass

    Token manipulation

    1. Get Part-of-Speech (simple, returns complete )
    >>> tagset = cclutils.get_tagset('nkjp')
    >>> pos = get_pos(token, tagset)
    subst:pl:inst:f
    
    1. Get Part-of-Speech (main_only, returns only the main part of )
    >>> tagset = cclutils.get_tagset('nkjp')
    >>> pos = get_pos(token, tagset, main_only=True)
    subst
    1. Get coarse-grained PoS
    >>> tagset = cclutils.get_tagset('nkjp')
    >>> pos = get_coarse_pos(token, tagset, main_only=True)
    noun