From 80bdf63c9466107122d4e29d98a7078d8eb71822 Mon Sep 17 00:00:00 2001 From: Arkadiusz Janz <arkadiusz.janz@pwr.edu.pl> Date: Fri, 22 May 2020 15:01:47 +0000 Subject: [PATCH] Update README.md --- README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/README.md b/README.md index d10e8a0..9cbfe30 100644 --- a/README.md +++ b/README.md @@ -83,6 +83,19 @@ tokens = (token for paragraph in document.paragraphs() for token in sentence.tokens()) ``` +To avoid loading large CCL documents to RAM (DOM parsers) we can read them +iteratively, chunk by chunk, or sentence by sentence (SAX-based approach): + +```python +it = read_chunks_it(ccl_path) +for paragraph in it: + pass + +it = read_sentences_it(ccl_path) +for sentence in it: + pass +``` + Token manipulation ================== -- GitLab