Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
C
combo
Manage
Activity
Members
Labels
Plan
Issues
20
Issue boards
Milestones
Wiki
Redmine
Code
Merge requests
2
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Syntactic Tools
combo
Commits
25ccba60
Commit
25ccba60
authored
1 year ago
by
Martyna Wiącek
Browse files
Options
Downloads
Patches
Plain Diff
Fixed multiword prediction + bug that made the code write empty predictions
parent
7a95b22c
Branches
Branches containing commit
Tags
Tags containing commit
1 merge request
!47
Fixed multiword prediction + bug that made the code write empty predictions
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
combo/data/tokenizers/lambo_tokenizer.py
+26
-4
26 additions, 4 deletions
combo/data/tokenizers/lambo_tokenizer.py
with
26 additions
and
4 deletions
combo/data/tokenizers/lambo_tokenizer.py
+
26
−
4
View file @
25ccba60
...
...
@@ -28,7 +28,7 @@ def _sentence_tokens(token: Token,
split_subwords
:
Optional
[
bool
]
=
None
)
->
List
[
Token
]:
if
split_subwords
and
len
(
token
.
subwords
)
>
0
:
subword_idxs
=
[
next
(
_token_idx
())
for
_
in
range
(
len
(
token
.
subwords
))]
multiword
=
(
token
.
text
,
(
subword_idxs
[
0
],
subword_idxs
[
1
]))
multiword
=
(
token
.
text
,
(
subword_idxs
[
0
],
subword_idxs
[
-
1
]))
tokens
=
[
Token
(
idx
=
s_idx
,
text
=
subword
,
multiword
=
multiword
)
for
(
s_idx
,
subword
)
in
zip
(
subword_idxs
,
token
.
subwords
)]
return
tokens
...
...
@@ -74,12 +74,14 @@ class LamboTokenizer(Tokenizer):
for
turn
in
document
.
turns
:
sentence_tokens
=
[]
for
sentence
in
turn
.
sentences
:
_reset_idx
()
for
token
in
sentence
.
tokens
:
sentence_tokens
.
extend
(
_sentence_tokens
(
token
,
split_subwords
))
tokens
.
append
(
sentence_tokens
)
elif
split_level
.
upper
()
==
"
SENTENCE
"
:
for
turn
in
document
.
turns
:
for
sentence
in
turn
.
sentences
:
_reset_idx
()
sentence_tokens
=
[]
for
token
in
sentence
.
tokens
:
sentence_tokens
.
extend
(
_sentence_tokens
(
token
,
split_subwords
))
...
...
@@ -87,6 +89,7 @@ class LamboTokenizer(Tokenizer):
else
:
for
turn
in
document
.
turns
:
for
sentence
in
turn
.
sentences
:
_reset_idx
()
for
token
in
sentence
.
tokens
:
tokens
.
extend
(
_sentence_tokens
(
token
,
split_subwords
))
tokens
=
[
tokens
]
...
...
@@ -116,13 +119,32 @@ class LamboTokenizer(Tokenizer):
if
turns
:
sentence_tokens
=
[]
for
sentence
in
turn
.
sentences
:
_reset_idx
()
if
not
turns
:
sentence_tokens
=
[]
for
token
in
sentence
.
tokens
:
if
len
(
token
.
subwords
)
>
0
and
split_subwords
:
sentence_tokens
.
extend
([
s
for
s
in
token
.
subwords
])
else
:
sentence_tokens
.
append
(
token
.
text
)
# @TODO this is a very dirty fix for Lambo model's shortcomings
# I noticed that for longer words with multiwords it tends to remove the last letter in the last multiword
# so this is a quick workaround to fix it
# check if subwords in token.subwords are consistent with token.text
if
""
.
join
(
token
.
subwords
)
!=
token
.
text
:
fixed_subwords
=
[]
text_it
=
0
for
i
,
subword
in
enumerate
(
token
.
subwords
):
if
token
.
text
[
text_it
:
text_it
+
len
(
subword
)]
==
subword
:
if
i
==
len
(
token
.
subwords
)
-
1
and
(
text_it
+
len
(
subword
)
<
len
(
token
.
text
)):
subword
=
token
.
text
[
text_it
:]
fixed_subwords
.
append
(
subword
)
text_it
+=
len
(
subword
)
else
:
fixed_subwords
.
append
(
token
.
text
[
text_it
:
text_it
+
len
(
subword
)])
text_it
+=
len
(
subword
)
token
.
subwords
=
fixed_subwords
# sentence_tokens.extend(_sentence_tokens(token, split_subwords))
# else:
sentence_tokens
.
extend
(
_sentence_tokens
(
token
,
split_subwords
))
if
not
turns
:
sentences
.
append
(
sentence_tokens
)
if
turns
:
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment