Commit fa986670 authored by Arkadiusz Janz's avatar Arkadiusz Janz

Merge branch 'master' of gitlab.clarin-pl.eu:ajanz/cclutils

parents 505e07a6 f18a0312
......@@ -3,6 +3,25 @@
A convenient API based on Corpus2 library for reading, writing, and processing
textual corpora represented as CCL (XML) documents.
###### Requirements
python3.6, corpus2
To install Corpus2 you have to add new source for APT:
```bash
wget -q -O - http://apt.clarin-pl.eu/KEY.gpg | apt-key add -
echo 'deb https://apt.clarin-pl.eu/ /' > /etc/apt/sources.list.d/clarin.list
apt-get update && apt-get install corpus2-python3.6
```
Install
=======
```bash
pip install cclutils --extra-index-url https://pypi.clarin-pl.eu/
```
IO
======
......@@ -81,15 +100,72 @@ document = cclutils.read('./example.xml')
tokens = (token for paragraph in document.paragraphs()
for sentence in paragraph.sentences()
for token in sentence.tokens())
for token in tokens:
...
```
To avoid loading large CCL documents to RAM (DOM parsers) we can read them
iteratively, chunk by chunk, or sentence by sentence (SAX-based approach):
```python
it = read_chunks_it(ccl_path)
for paragraph in it:
pass
it = read_sentences_it(ccl_path)
for sentence in it:
pass
```
Token manipulation
==================
1. Get Part-of-Speech (simple)
1. Get Part-of-Speech (simple, returns complete <ctag>)
```python
tagset = cclutils.get_tagset('nkjp')
...
pos = get_pos(token, tagset)
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_pos(token, tagset)
'subst:pl:inst:f'
```
2. Get Part-of-Speech (main_only, returns only the main part of <ctag>)
```python
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_pos(token, tagset, main_only=True)
'subst'
```
3. Get coarse-grained PoS (NKJP only for now)
```python
>>> tagset = cclutils.get_tagset('nkjp')
>>> get_coarse_pos(token, tagset)
'noun'
```
4. Convert to coarse-grained PoS (NKJP only for now)
```python
>>> convert_to_coarse_pos('subst')
'noun'
```
5. Get token lemma
```python
>>> get_lexeme_lemma(token)
'samolot'
```
6. Check if a token is preceded by whitespace. Add or remove a whitespace.
```python
>>> token.after_space()
True
>>> token.set_wa(False)
>>> token.after_space()
False
```
\ No newline at end of file
......@@ -159,3 +159,52 @@ def get_tagset(tagset):
if isinstance(tagset, str):
tagset = corpus2.get_named_tagset(tagset)
return tagset
def read_chunks_it(filepath, tagset='nkjp'):
""" Returns a iterable chunk generator.
Args:
filepath: a path to CCL file
tagset: the name of the tagset that is used in the document or a tagset object itself.
Returns:
a iterable chunk generator.
"""
tagset = get_tagset(tagset)
reader = corpus2.TokenReader_create_path_reader('ccl', tagset, filepath)
while True:
chunk = reader.get_next_chunk()
if not chunk:
break
yield chunk
del reader
def read_sentences_it(filepath, tagset='nkjp'):
""" Returns a iterable sentence generator.
Args:
filepath: a path to CCL file
tagset: the name of the tagset that is used in the document or a tagset object itself.
Returns:
a iterable sentence generator.
"""
tagset = get_tagset(tagset)
reader = corpus2.TokenReader_create_path_reader('ccl', tagset, filepath)
while True:
sentence = reader.get_next_sentence()
if not sentence:
break
yield sentence
del reader
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment