Commit fe5841af authored by Grzegorz Kostkowski's avatar Grzegorz Kostkowski

Merge branch 'develop' into 'master'

Refactor and improve code, use corpus2 1.9.0

See merge request !5
parents ab9f69f7 0d7b724d
Pipeline #4487 passed with stage
in 35 seconds
# Unreleased
### 0.9.0
## Added
- Added possibility to store added elinker entities keys as a list in token property.
This is optional and can be configured using `entities_list_prop_name` config
option. Config file has been changed to use this option.
- Other config keys:
- `extended_search`
- `ann_only_first_occ`
- using new config manager handling various config sources, with support for
LPMN task config (extracted),
- better logging module (extracted),
- Reimplement and extend functionality of document filter to apply filters
selectively: Current implementation allows to specify type of token (annotation)
for which certain filter should be applied.
## Changed
## Deprecated
- Using new (1.9.0) version of corpus2 with hashable tokens, reverting recent
changes to use native Python dictionaries and sets of tokens, as it is the
fastest option,
- rafactor of many modules (elinker, document_context, config, ...)
- reimplemented elinker_config to cooperate with new config managerl elinker_plugin
is now much more flexible,
- Style of format for `url_key_format` key in `config.ini`,
- Replacing dependencies: using `cclutils` instead of `corpus-ccl`,
- Reimplementing `_get_chans` to improve performance,
- Configure annotating many occurences of token, create writer config section,
- Config keys related with writing annotations are now placed in `writer` section,
- Better optional filtering by source, reformat and update config,
- Crosswiki linking will no longer use wsd tokens.
## Removed
- Dependency on `corpus-ccl`,
## Fixed
- SPARQL query for syn_id,
- broken configuration,
## Security
# 0.8.1
## Changed
......
......@@ -3,7 +3,7 @@ FROM clarinpl/python:3.6
WORKDIR /home/elinker
RUN apt-get update && apt-get install -y \
corpus2-python3.6
corpus2-python3.6>=1.9.0
ADD ./ /home/elinker/
......@@ -11,4 +11,3 @@ RUN pip install -r requirements.txt
RUN python setup.py install
CMD ["elinker", "-d", "./example/article.wosedon.xml", "-o", "./example/article.elinker.xml", "-c", "./entity_linker/config/config.ini"]
# DESCRIPTION
# Description
This tool annotates a document with URLs of corresponidng entities.
Linker parses annotated tokens/groups of tokens to spot expressions which
......@@ -18,19 +18,150 @@ of whole expression.
As a result, *.ccl document is returned with included URL annotations.
# DEPENDENCIES
# Knowledge bases
Currently, two types of knowledge bases are supported:
1. SPARQL database (designed for AllegroGraph, but will also work with any other triple-store); recommended,
1. Neo4j (legacy).
# Dependencies
All dependencies included in [requirements.txt](requirements.txt).
# USAGE
# Usage
## CLI
### CLI interface
```
usage: elinker [-h] [-d DOC] [-l LANGS] [-o OUTPUT] [-t TAGSET]
[-c CONFIG_FILE] [--db-type DB_TYPE]
[--db-endpoint DB_ENDPOINT] [--db-user DB_USER]
[--db-password DB_PASSWORD] [--crosswiki_disambiguation]
Tool for annotating document with set of corresponding URIs from knowledge
base.
```bash
elinker -d path/to/ccl/doc.xml -o path/for/results/output_doc.xml
optional arguments:
-h, --help show this help message and exit
-d DOC, --doc DOC Path to *.ccl document to process.
-l LANGS, --langs LANGS
Language(s) of given document. If not given then
default languages will be used (specified in config
file).
-o OUTPUT, --output OUTPUT
Path to output file with annotated entities.
-t TAGSET, --tagset TAGSET
Tagset name.
-c CONFIG_FILE, --config CONFIG_FILE
Path to config *.ini file. If omited then default
configuration will be used.
--db-type DB_TYPE Type of database (neo4j | allegrograph)
--db-endpoint DB_ENDPOINT
database endpoint
--db-user DB_USER database user
--db-password DB_PASSWORD
database password
--crosswiki_disambiguation
Determines if crosswiki disambiguation should be
performed
```
For help and detailed description type: ```elinker --help```.
# EXAMPLE
There are three files in the [example](./example) directory:
- ``article.txt`` : raw text used as input for elinker,
- ``article.wosedon.xml`` : tokenized, lemmatized and disambiguated article,
- ``article.wosedon.elinker.xml`` : result of elinker, with URIs from LOD,
e.g. ```<prop key="PlWN:url_0">http://plwordnet.pwr.wroc.pl/wordnet/synset/4622</prop>```
### Examples
1. Link entities in document using default config
```bash
elinker \
-d example/example2.wosedon.mwe.ne.xml \
-o example2.elinker.xml
```
1. Link entities in english document (_example4en.wosedon.xml_)
```
elinker \
-d example/example4en.wosedon.xml \
-t spacy \
-l en \
-o example4en.elinker.xml \
-c /path/to/config/config.ini
```
1. Link entities, with db access specified in CLI command
```
elinker \
-d example/article.wosedon.xml \
-o article-ag.elinker.xml \
--db-type allegrograph \
--db-endpoint 'http://some-sparql-endpoint' \
--db-user 'user' \
--db-password 'password'
```
1. Link entities, using publicly available LOD endpoints:
```
elinker \
-d example/article.wosedon.xml \
-o article-ag.elinker.xml \
--db-type allegrograph \
--db-endpoint 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'
```
## In Python code
Use [elinker_plugin](./entity_linker/elinker_plugin.py).
## As a LPMN service
See [elinker service repo](https://gitlab.clarin-pl.eu/nlpworkers/elinker).
# Example
[example](./example) directory contains sample files to use for linking.
[example/out](./example/out) contains generated CCL files with linked entities.
For more info check [this page](./example/README.md).
# Configuration
## Config sources
Tool read configuration from various sources. In case of overlapping for a config
key, configuration from source with the highest priority will be used.
Precedence of config sources (most important at the bottom):
1. default values from config module (`ConfigDefEntry` instances),
1. values specified in config _INI_ file,
1. values specified in passed kwargs (in case of calling main function from
other external Python module),
1. values passed with CLI command.
Additionally, precedence of different config files is distinguished:
1. default config file (config.ini in package),
1. config file passed in kwargs,
1. config file passed with CLI command.
## Available config options
| option | type | default value | available in LPMN task? | description |
| - | - | - | - | - |
| ann_only_first_occ | bool | false | &check; | if true, then in case of many occurences of certain token, only first one will contain generated annotations |
| crosswiki_disambiguation_types | list | [_ne_, _mwe_, _cw_] | &check; | types of annotations for which crosswiki disambiguation will be applied |
| crosswiki_disambiguation | bool | true | &check; | if true, then crosswiki index will be used to choose correct entity from KB |
| crosswiki_file | text | | &cross; | name of TSV file with crosswiki index |
| db_endpoint | text | undefined | &cross; | database endpoint |
| db_password | text | undefined | &cross; | database password |
| db_type | text | _allegrograph_ | &cross; | type of database to use; supported types: `neo4j`, `allegrograph` |
| db_user | text | undefined | &cross; | name of database user |
| enable_filters | bool | true | &check; | if true, then token and lemma filters will be applied |
| entities_list_prop_name | text | _entities_ | &check; | if specified, then property with such name will be added to every token. The property will contain list of all |keys | of entities added by elinker.
| entities_list_prop_sep | text | | &check; | separator for `entities_list_prop_name` list |
| exclude_ignored | text | | &cross; | |
| extended_search | bool | true | &check; | if true, then extra equivalent relations will be matched in the database |
| ignored_pos | list | | &check; | list of PoS of tokens to ignore |
| ignore_shorter_than | text | | &check; | minimal length of token lemma; shorter will be ignored |
| kw_ignored_ann | text | | &cross; | |
| langs | list | [_pl_] | &check; | languages corresponding to text literals in KB to use during linking |
| log_file | text | | &cross; | if specified, then tool execution will be logged to specified file |
| logging_level | text | | &cross; | logging level |
| mark_without_ann | bool | false | &check; | if false then only token belonging to any (wsd / mwe /ne) annotation will be linked |
| mwe_base_prop_key | text | | &check; | name of property key storing base form of multiword expression |
| mwe_chan_name | text | | &check; | name of annotation channel storing multiword expressions |
| named_entity_chan_names | text | | &check; |list of NER annotations to recognize in document |
| permitted_sources | text | all | &check; | list of names of known LOD sources in KB to use; if not specified, then all sources will be accepted. |
| sort_entities | bool | true | &check; | if true, then list of uris in token will be sorted alphabetically |
| stop_list | text | | &cross; | name of txt file with list of stop words; stop word will be compared with token base lemma |
| synset_prop_key | text | | &check; | name of property key storing sysnet id (disambiguation info) |
| tagset | text | _nkjp_ | &check; | name of a tagset |
| url_key_format | text | | &check; | format of token property key storing URI of linked entity |
| use_wsd_synsets | bool | true | &check; |if true, then wsd synset's id will be used to match entity in KB |
| use_wsd_tokens | bool | true | &check; | if true then include disambiguated (wsd) tokens in entity linking process |
| without_ann_only_mono | bool | false | &cross; | deprecated; If true and `mark_without_ann` is true, then will use only monosemic labels in knowledge base |
Note: in case of lists, comma or newline character is used. For convenience,
it is also posible to specify single value in every place where list is expected
(instead of the list with single item).
"""
Module defines CLI interface for elinker tool.
"""
TOOL_DESCRIPTION = (
"Tool for annotating document with set of corresponding URIs from knowledge base."
)
def define_elinker_cli_args(parser, cfg_keys) -> None:
"""
Sets arguments for passed parser.
"""
parser.description = TOOL_DESCRIPTION
parser.add_argument(
'-d',
'--doc',
dest=cfg_keys.O_DOC,
action='store',
help='Path to *.ccl document to process.')
parser.add_argument(
'-l',
'--langs',
dest=cfg_keys.O_DEFAULT_LANGS,
action='store',
help=(
"Language(s) of given document. If not given then default languages "
"will be used (specified in config file)."
)
)
parser.add_argument(
'-o',
'--output',
action='store',
help='Path to output file with annotated entities.')
parser.add_argument(
'-t',
'--tagset',
dest=cfg_keys.O_TAGSET, action='store',
help=(
"Path to config *.ini file. If omited then default configuration will be used."
)
)
parser.add_argument(
'-c',
'--config',
dest=cfg_keys.O_CONFIG_FILE,
action='store',
help='''Path to config *.ini file. If
omited then default configuration will be used.''')
# database info
parser.add_argument(
'--db-type', dest=cfg_keys.O_DB_TYPE, action='store',
help="Type of database (neo4j | allegrograph)"
)
parser.add_argument(
'--db-endpoint', dest=cfg_keys.O_DB_ENDPOINT, action='store',
help='database endpoint'
)
parser.add_argument(
'--db-user', dest=cfg_keys.O_DB_USER, action='store', help='database user')
parser.add_argument(
'--db-password', dest=cfg_keys.O_DB_PASSWORD, action='store',
help='database password'
)
parser.add_argument(
'--crosswiki_disambiguation',
dest=cfg_keys.O_CROSSWIKI_DISAMBIGUATION,
action='store_true',
help='Determines if crosswiki disambiguation should be performed')
[general]
tagset = nkjp
[parser]
; one or more (separated by comma) names of channels for named entities
named_entity_chan_names = nam,nam_adj,nam_eve,nam_fac,nam_liv,nam_loc,nam_num,nam_org,nam_oth,nam_pr,EVENT,FAC,FACILITY,GPE,LANGUAGE,LAW,LOC,NORP,ORG,PERSON,PRODUCT,WORK_OF_ART
; liner2 nam channel
; liner nam + liner2 top9 + spacy
named_entity_chan_names = nam
nam_adj
nam_eve
nam_fac
nam_liv
nam_loc
nam_num
nam_org
nam_oth
nam_pr
EVENT
FAC
FACILITY
GPE
LANGUAGE
LAW
LOC
NORP
ORG
PERSON
PRODUCT
WORK_OF_ART
; liner2 nam channel
; named_entity_chan_names = nam
; liner2 top9 channels
; named_entity_chan_names = nam_adj,nam_eve,nam_fac,nam_liv,nam_loc,nam_num,nam_org,nam_oth,nam_pr
; spacy nam categories
; named_entity_chan_names = EVENT,FAC,FACILITY,GPE,LANGUAGE,LAW,LOC,NORP,ORG,PERSON,PRODUCT,WORK_OF_ART
; liner2 top9 channels
; named_entity_chan_names = nam_adj
; nam_eve
; nam_fac
; nam_liv
; nam_loc
; nam_num
; nam_org
; nam_oth
; nam_pr
; spacy nam categories
; named_entity_chan_names = EVENT
; FAC
; FACILITY
; GPE
; LANGUAGE
; LAW
; LOC
; NORP
; ORG
; PERSON
; PRODUCT
; WORK_OF_ART
mwe_chan_name = mwe
mwe_base_prop_key = mwe_base
synset_prop_key = sense:ukb:syns_id
; name of file in data directory abolute path to file
; stop_list_file = stop_list.txt
; stop_list_file = nkjp_stop_list.txt
; stop_list_file = nkjp_idf_dict_stop_list.txt
stop_list_file = nkjp_idf_dict_plwn_nouns_stop_list.txt
; stop_list = stop_list.txt
; stop_list = nkjp_stop_list.txt
; stop_list = nkjp_idf_dict_stop_list.txt
stop_list = nkjp_idf_dict_plwn_nouns_stop_list.txt
; comma delilited list of part-of-speech - tokens with such PoS will
; be ignored; PoS must be relevant to used tagset
ignored_pos = interp,prep,fin,pact,pcon,inf,praet,ppas,qub,conj,adj,xxx,ADV,ADJ,VERB,PUNCT
ignored_pos = adj
conj
fin
inf
interp
pact
pcon
ppas
praet
prep
qub
xxx
ADJ
ADV
PUNCT
VERB
[linker]
; Following attributes are available for url_key_format:
; %s - source name
; %d - index (ordinal number)
; url_key_format = elinker:%s:%d
url_key_format = %s:url_%d
; Ignore lemmas shorter than n characters. Leave empty to disable
; this feature.
ignore_shorter_than=5
ignore_shorter_than = 5
; WSD annotations are most common token annotation. If disabled
; then only tokens/expressions with mwe/ne/wsd annotation
; will be marked (URL will be matched). Otherwise, for tokens
......@@ -38,20 +92,25 @@ ignore_shorter_than=5
; Note: this option works with 'without_ann_only_mono' which
; determines if lemma matching should be restricted only to
; labels which are marked (in WSS) as monosemic.
mark_without_ann=false
mark_without_ann = false
; If enabled then tokens without annotation will be considered
; when matching URLs: token's lemma will be assigned to WSS labels
; but only to these which are monosemic
without_ann_only_mono=false
without_ann_only_mono = false
; flag to include/exclude disambiguated (wsd) tokens in entity linking
; process
use_wsd_toks = true
use_wsd_tokens = true
; flag to enable searching by synsets for disambiguated(wsd) tokens.
; If disabled then searching is still performed, but with lemmas of
; certain tokens, not synsets
use_wsd_synsets = true
; if specified, then extra equivalent relations will be matched in the database
extended_search = true
[filter]
; this flags determines whether token and lemma filters will be applied
enable_filters = true
; if enabled then words/phrases not appropriate for keyword
; (e. g. Warszawa 2012) won't be linked with URLs
exclude_ignored = true
......@@ -61,36 +120,100 @@ kw_ignored_ann = kw_ignored
; list of permitted sources of output list, it is used to filter final list
; of generated concepts. Names of sources must be same as in used graph source.
; If empty, then all found concepts will be used.
; Use comma as list separator
;
; sources without sources with non-commercial licences (CC-BY-NC)
permitted_sources = AGROVOC,BNCF,BNF,DBPEDIA,DDC,DNB,EUROVOC,GEMET,GEONAMES,GEOWORDNET,HRMO,IATE,LINKEDDATA,LOC,LCSH,MESH,NALT,PlWN,SCHEMA,SUMO,UAT,UMBEL,WIKIPEDIA,WOLTER
; sources with open (CC-BY / PD / other) licences
; permitted_sources = AGROVOC,BNCF,BNF,DNB,EUROVOC,GEMET,GEONAMES,GEOWORDNET,HRMO,IATE,LINKEDDATA,LOC,LCSH,MESH,NALT,PlWN,SCHEMA,SUMO,UMBEL
permitted_sources =
; sources without sources with non-commercial licences (CC-BY-NC)
; permitted_sources = AGROVOC
; BNCF
; BNF
; DBPEDIA
; DDC
; DNB
; EUROVOC
; GEMET
; GEONAMES
; GEOWORDNET
; HRMO
; IATE
; LINKEDDATA
; LOC
; LCSH
; MESH
; NALT
; PlWN
; SCHEMA
; SUMO
; UAT
; UMBEL
; WIKIPEDIA
; WOLTER
; sources with open (CC-BY / PD / other) licences
; permitted_sources = AGROVOC
; BNCF
; BNF
; DNB
; EUROVOC
; GEMET
; GEONAMES
; GEOWORDNET
; HRMO
; IATE
; LINKEDDATA
; LOC
; LCSH
; MESH
; NALT
; PlWN
; SCHEMA
; SUMO
; UMBEL
[writer]
; Format of the key of property created in token for every linked entity.
; Following attributes are available for url_key_format:
; {src} - source name
; {idx} - index (ordinal number)
; url_key_format = {src}:url_{idx}
url_key_format = e{idx}
; if specified, then property with such name will be added to every token.
; The property will store list of keys of entities added by elinker.
entities_list_prop_name = entities
; separator used in list of entities keys (entities_list_prop_name);
; if not given, then defaults to ' '
entities_list_prop_sep = %(space)s
; if enabled, then in case of many occurences of certain token, only first one
; will contain generated annotations.
ann_only_first_occ = false
; if enabled, then list of uris in token will be sorted alphabetically; defaults to false
sort_entities = true
[graph-dump]
; has priority over database source
; graph_file = graph.graphml
[database]
; used only when graph-dump is not specified
; endpoint = http://neo4j:neodb@10.17.135.47:7474/data/db
; endpoint = http://neo4j:neodb@127.0.0.1:7474/data/db
endpoint = http://neo4j:neodb@10.17.50.128:7474/data/db
user = neo4j
password = neodb
db_type = neo4j
db_endpoint =
db_user =
db_password =
; supported db types: neo4j, allegrograph
db_type = allegrograph
; if language will not be specified for certain document then
; these langs will be used for matching labels from database.
; Note: given lang labels must be present in database
; Use ',' as delimiter.
default_langs=pl,en
langs = pl,en
[crosswiki]
crosswiki_disambiguation = true
dict_path = /home/elinker/entity_linker/data/crosswiki_u_s-en-uenc.tsv
; name of crosswiki file located in data package
crosswiki_file = crosswiki_u_s-en-uenc.tsv
; set which tokens types to disambiguate with crosswiki
; ne - named entities
; mwe - multi word expressions
; cw - common words
crosswiki_disambiguation_types = ne,mwe,cw
[logging]
; logging level; defaults to "WARNING"
logging_level = INFO
; path to file where logs will be saved; optional
; log_file = elinker.log
import os
import logging
import configparser
from configparser import NoOptionError
import pkg_resources
from corpus_ccl import cclutils as ccl
from entity_linker.context.document_context import DocumentContext
logging.basicConfig()
log = logging.getLogger(__name__)
log.setLevel(logging.DEBUG)
DEFAULT_CONFIG_FILE = 'config.ini'
# config keys
S_GENERAL = 'general'
O_TAGSET = 'tagset'
S_LINKER = 'linker'
O_KEY_FORMAT = 'url_key_format'
O_IGNORE_SHORTER_THAN = 'ignore_shorter_than'
O_USE_WSD = 'use_wsd_toks'
O_USE_WSD_SYN = 'use_wsd_synsets'
S_PARSER = 'parser'
O_NE_CHANS = 'named_entity_chan_names'
O_MWE_CHAN = 'mwe_chan_name'
O_MWE_BASE = 'mwe_base_prop_key'
O_SYNS_ID_KEY = 'synset_prop_key'
O_IGNORED_POS = 'ignored_pos'
O_STOP_LIST = 'stop_list_file'
O_MARK_WITHOUT_ANN = 'mark_without_ann'
O_WITHOUT_ANN_ONLY_MONO = 'without_ann_only_mono'
S_FILTER = 'filter'
O_EXCLUDE_IGNORED = 'exclude_ignored'
O_KW_IGNORED_ANN = 'kw_ignored_ann'
O_PERMITTED_SOURCES = 'permitted_sources'
S_DATABASE = 'database'
O_ENDPOINT = 'endpoint'
O_USER = 'user'
O_PASSWORD = 'password'
O_DEFAULT_LANGS = 'default_langs'
O_DB_TYPE = 'db_type'
S_GRAPH_DUMP = 'graph-dump'
O_GRAPH_FILE = 'graph_file'
S_CROSSWIKI = 'crosswiki'
O_CWIKI_FILE = 'dict_path'
O_CROSSWIKI_DISAMBIGUATION = 'crosswiki_disambiguation'
O_CROSSWIKI_DISAMBIGUATION_TYPES = 'crosswiki_disambiguation_types'
CONFIG_MODULE = 'entity_linker.config'
RESOURCE_MODULE = 'entity_linker.data'
DAO_DB_CONF_ENDPOINT = 'endpoint'
DAO_DB_CONF_USER = 'user'
DAO_DB_CONF_PASSWORD = 'password'
DAO_CONFIG_DB_PROPS = [
DAO_DB_CONF_ENDPOINT, DAO_DB_CONF_USER, DAO_DB_CONF_PASSWORD
]
DAO_FILE_CONF_GRAPH = 'graph_file'
DAO_CONFIG_FILE_PROPS = [DAO_FILE_CONF_GRAPH]
# constants for factory method
DAO_FILE = 'file'