Skip to content
Snippets Groups Projects
Commit e4006d1a authored by Michał Marcińczuk's avatar Michał Marcińczuk
Browse files

For token transformed into a large number of subtokens try to tokenize lowered form.

parent 83dcdfdf
Branches
1 merge request!41Dev v07
Pipeline #6121 failed with stage
in 1 minute and 46 seconds
......@@ -112,6 +112,8 @@ class FeatureGenerator:
labels = ["O"] * len(tokens)
for word, label_1 in zip(tokens, labels):
subtokens = self.encode_method(word.strip())
if len(subtokens) > 6:
subtokens = self.encode_method(word.strip().lower())
if len(subtokens) > 6:
logging.warning(f"Token {word} was truncated to 6 subtokens: {subtokens}")
subtokens = subtokens[:6]
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment