cclutils
A convenient API based on Corpus2 library for reading, writing, and processing textual corpora represented as CCL (XML) documents.
Requirements
python3.6, corpus2
To install Corpus2 you have to add new source for APT:
wget -q -O - http://apt.clarin-pl.eu/KEY.gpg | apt-key add -
echo 'deb https://apt.clarin-pl.eu/ /' > /etc/apt/sources.list.d/clarin.list
apt-get update && apt-get install corpus2-python3.6
It is also possible to use a docker:
FROM clarinpl/python:3.6
RUN apt-get update && apt-get install -y \
corpus2-python3.6
RUN pip install --upgrade pip && pip install cclutils
Install
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
IO
Read CCL file
import cclutils
filepath = './example.xml'
document = cclutils.read(filepath)
Read CCL with relations (REL file):
cclpath = './example.xml'
relpath = './exampel.rel.xml'
document = cclutils.read(cclpath, relpath)
Specify tagset
document = cclutils.read(cclpath, relpath, 'nkjp')
Write CCL
document = cclutils.read(filepath)
...
cclutils.write(document, './out.xml')
or with relations:
cclutils.write(document, './out.xml', rel_path='./out.rel.xml')
specify the tagset:
cclutils.write(document, './out.xml', rel_path='./out.rel.xml', tagset='spacy')
Get tagset object
tagset = cclutils.get_tagset('nkjp')
tagset = cclutils.get_tagset('spacy')
...
Document structure
CCL format specifies basic segmentation structure, mainly paragraphs (<chunk>
),
sentences (<sentence>
), and tokens (<token>
). To iterate the document we
can use special API functions:
document = cclutils.read('./example.xml')
for paragraph in document.paragraphs():
...
for sentence in paragraph.sentences():
...
for token in sentence.tokens():
...
We can also create a generator for iterating only the tokens in a more Pythonic way:
document = cclutils.read('./example.xml')
# tokens is a generator:
tokens = (token for paragraph in document.paragraphs()
for sentence in paragraph.sentences()
for token in sentence.tokens())
for token in tokens:
...
To avoid loading large CCL documents to RAM (DOM parsers) we can read them iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):
it = read_chunks_it(ccl_path)
for paragraph in it:
pass
it = read_sentences_it(ccl_path)
for sentence in it:
pass
Token manipulation
- Get Part-of-Speech (simple, returns complete )
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_pos(token, tagset)
'subst:pl:inst:f'
- Get Part-of-Speech (main_only, returns only the main part of )
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_pos(token, tagset, main_only=True)
'subst'
- Get coarse-grained PoS (NKJP only for now)
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_coarse_pos(token, tagset)
'noun'
- Convert to coarse-grained PoS (NKJP only for now)
>>> convert_to_coarse_pos('subst')
'noun'
- Get token lemma
>>> get_lexeme_lemma(token)
'samolot'
- Check if a token is preceded by whitespace. Add or remove a whitespace.
>>> token.after_space()
True
>>> token.set_wa(False)
>>> token.after_space()
False
Sentence manipulation
- Prints out sentences of a given document
document = cclutils.read('./example.xml')
sentences = (sentence for paragraph in document.paragraphs()
for sentence in paragraph.sentences())
for sentence in sentences:
print(cclutils.sentence2str(sentence))
Reading annotations
Extracting annotations from CCL document is available with
cclutils.extras.annotations
module built at the top of the core cclutils
functionality.
The main function of this module is get_document_annotations
which reads
annotations from CCL document (from file or corpus2.DocumentPtr
object).
from cclutils.extras.annotations import get_document_annotations
The annotations are organized with use of two classes:
-
AnnotatedExpression
: represents single annotation (annotated expression), located in specified paragraph and sentence. Module supports annotations describing single word and multiword expressions (more than one token). -
DocumentAnnotations
: keeps annotations of entire document, provides methods to facilitate gathering and accessing annotations.
Read annotations of a given document
-
Read all annotations
>>> anns = get_document_annotations(cclutils.read('tests/data/ccl02.xml')) >>> anns <DocumentAnnotations for 10 annotated expressions: [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>, <AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, <AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, <AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>, <AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>, <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>, <AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>, <AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]>
-
Read only specified annotations
>>> anns = get_document_annotations(cclutils.read('tests/data/ccl02.xml'), annotations={'designation'}) >>> anns <DocumentAnnotations for 2 annotated expressions: [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>]>
Get annotations in one of preferred forms
-
Get annotations index containing full information about annotations
- key is a tuple containing following values: (annotation channel name, sentence id, paragraph id, channel numeric value)
>>> anns.expressions_index defaultdict(list, {('designation', 's1', 'ch1', 1): <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, ('room_type', 's1', 'ch1', 1): <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, ('region', 's1', 'ch1', 1): <AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>, ('attraction', 's2', 'ch2', 1): <AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, ('hotel_name', 's2', 'ch2', 1): <AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, ('food', 's2', 'ch2', 1): <AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>, ('room_type', 's2', 'ch2', 1): <AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>, ('designation', 's2', 'ch2', 1): <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>, ('attraction', 's2', 'ch2', 2): <AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>, ('food', 's2', 'ch2', 2): <AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>})
-
Get annotations grouped by annotation channel name, in one of formats:
- annotation object
- orths
- preferred lexemes
- annotation base lemma
>>> anns.group_by_chan_name() defaultdict(list, {'designation': [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>], 'room_type': [<AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>], 'region': [<AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>], 'attraction': [<AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, <AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>], 'hotel_name': [<AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>], 'food': [<AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>, <AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]}) >>> anns.group_by_chan_name(as_orths=True) defaultdict(list, {'designation': [('dla', 'dwóch', 'osób'), ('dla', 'dzieci')], 'room_type': [('dla', 'dwóch', 'osób'), ('łazienką',)], 'region': [('Gdańsk',)], 'attraction': [('Hotel',), ('spa',)], 'hotel_name': [('Hotel',)], 'food': [('śniadaniem',), ('pełnym', 'wyżywieniem')]}) >>> anns.group_by_chan_name(as_lexemes=True) defaultdict(list, {'designation': [('dla', 'dwa', 'osoba'), ('dla', 'dziecko')], 'room_type': [('dla', 'dwa', 'osoba'), ('łazienka',)], 'region': [('Gdańsk',)], 'attraction': [('hotel',), ('spa',)], 'hotel_name': [('hotel',)], 'food': [('śniadanie',), ('pełny', 'wyżywienie')]}) >>> anns.group_by_chan_name(as_ann_base=True) defaultdict(list, {'designation': ['dla dwóch osób', 'dla dziecka'], 'room_type': ['dla dwóch osób', 'łazienka'], 'region': [''], 'attraction': ['hotel', 'spa'], 'hotel_name': ['Hotel'], 'food': ['śniadanie', 'pełne wyżywienie']})
-
Get annotations grouped by token (token position), in one of formats (usage same as in case of
group_by_chan_name
method):- annotation object
- orths
- preferred lexemes
- annotation base lemma
>>> anns.group_by_token() {(1, 's1', 'ch1'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type' : 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>], (2, 's1', 'ch1'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type' : 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>], (3, 's1', 'ch1'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type' : 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>], (4, 's1', 'ch1'): [<AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>], (0, 's2', 'ch2'): [<AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, <AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel ',) at position: ch2>s2>t0>], (3, 's2', 'ch2'): [<AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>], (7, 's2', 'ch2'): [<AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>], (10, 's2', 'ch2'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>], (11, 's2', 'ch2'): [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>], (13, 's2', 'ch2'): [<AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>], (17, 's2', 'ch2'): [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>], (18, 's2', 'ch2'): [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]}
-
Get annotations grouped by token, with original document order (tokens order):
>>> anns.group_by_token(retain_order=True) OrderedDict([((1, 's1', 'ch1'), [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>]), ((2, 's1', 'ch1'), [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>]), ((3, 's1', 'ch1'), [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>, <AnnotatedExpression for annotation 'room_type': 'room_type:('dla', 'dwóch', 'osób')'; ('dla', 'dwa', 'osoba') at position: ch1>s1>t1,t2,t3>]), ((4, 's1', 'ch1'), [<AnnotatedExpression for annotation 'region': 'region:('Gdańsk',)'; ('Gdańsk',) at position: ch1>s1>t4>]), ((0, 's2', 'ch2'), [<AnnotatedExpression for annotation 'attraction': 'attraction:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>, <AnnotatedExpression for annotation 'hotel_name': 'hotel_name:('Hotel',)'; ('hotel',) at position: ch2>s2>t0>]), ((3, 's2', 'ch2'), [<AnnotatedExpression for annotation 'food': 'food:('śniadaniem',)'; ('śniadanie',) at position: ch2>s2>t3>]), ((7, 's2', 'ch2'), [<AnnotatedExpression for annotation 'room_type': 'room_type:('łazienką',)'; ('łazienka',) at position: ch2>s2>t7>]), ((10, 's2', 'ch2'), [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>]), ((11, 's2', 'ch2'), [<AnnotatedExpression for annotation 'designation': 'designation:('dla', 'dzieci')'; ('dla', 'dziecko') at position: ch2>s2>t10,t11>]), ((13, 's2', 'ch2'), [<AnnotatedExpression for annotation 'attraction': 'attraction:('spa',)'; ('spa',) at position: ch2>s2>t13>]), ((17, 's2', 'ch2'), [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>]), ((18, 's2', 'ch2'), [<AnnotatedExpression for annotation 'food': 'food:('pełnym', 'wyżywieniem')'; ('pełny', 'wyżywienie') at position: ch2>s2>t17,t18>])])
Get token by token position
- When using above methods, you may want to get
corpus2.Token
object referenced by position:>>> anns.token_by_position_index[(17, 's2', 'ch2')] <corpus2.Token; proxy of <Swig Object of type 'Corpus2::Token *' at 0x7f71edfced80> >