Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
C
combo
Manage
Activity
Members
Labels
Plan
Issues
20
Issue boards
Milestones
Wiki
Redmine
Code
Merge requests
2
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Syntactic Tools
combo
Commits
ac6cab41
Commit
ac6cab41
authored
2 years ago
by
piotrmp
Browse files
Options
Downloads
Patches
Plain Diff
LAMBO segmentation prototype.
parent
045232b5
Branches
Branches containing commit
No related merge requests found
Pipeline
#6081
failed with stage
in 5 minutes and 19 seconds
Changes
2
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
combo/predict.py
+5
-1
5 additions, 1 deletion
combo/predict.py
combo/utils/lambo.py
+23
-4
23 additions, 4 deletions
combo/utils/lambo.py
with
28 additions
and
5 deletions
combo/predict.py
+
5
−
1
View file @
ac6cab41
...
...
@@ -59,7 +59,11 @@ class COMBO(predictor.Predictor):
def
predict
(
self
,
sentence
:
Union
[
str
,
List
[
str
],
List
[
List
[
str
]],
List
[
data
.
Sentence
]]):
if
isinstance
(
sentence
,
str
):
return
self
.
predict_json
({
"
sentence
"
:
sentence
})
if
isinstance
(
self
.
_tokenizer
,
lambo
.
LamboTokenizer
):
segmented
=
self
.
_tokenizer
.
segment
(
sentence
)
return
self
.
predict
(
segmented
)
else
:
return
self
.
predict_json
({
"
sentence
"
:
sentence
})
elif
isinstance
(
sentence
,
list
):
if
len
(
sentence
)
==
0
:
return
[]
...
...
This diff is collapsed.
Click to expand it.
combo/utils/lambo.py
+
23
−
4
View file @
ac6cab41
...
...
@@ -2,12 +2,31 @@ from typing import List
from
allennlp.data.tokenizers.tokenizer
import
Tokenizer
from
allennlp.data.tokenizers.token_class
import
Token
from
lambo.segmenter.lambo
import
Lambo
class
LamboTokenizer
(
Tokenizer
):
def
__init__
(
self
,
language
:
str
=
"
??
"
,)
->
None
:
self
.
la
nguage
=
language
def
__init__
(
self
,
model
:
str
=
"
en
"
,)
->
None
:
self
.
la
mbo
=
Lambo
.
get
(
model
)
# Simple tokenisation: ignoring sentence split
def
tokenize
(
self
,
text
:
str
)
->
List
[
Token
]:
#TODO
return
None
\ No newline at end of file
result
=
[]
document
=
self
.
lambo
.
segment
(
text
)
for
turn
in
document
.
turns
:
for
sentence
in
turn
.
sentences
:
for
token
in
sentence
.
tokens
:
result
.
append
(
Token
(
token
.
text
))
return
result
# Full segmentation: divide into sentences and tokens
def
segment
(
self
,
text
:
str
)
->
List
[
List
[
Token
]]:
result
=
[]
document
=
self
.
lambo
.
segment
(
text
)
for
turn
in
document
.
turns
:
for
sentence
in
turn
.
sentences
:
resultS
=
[]
for
token
in
sentence
.
tokens
:
resultS
.
append
(
Token
(
token
.
text
))
result
.
append
(
resultS
)
return
result
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Preview
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment