Commit ba3803e3 authored by Grzegorz Kostkowski's avatar Grzegorz Kostkowski

Merge branch 'current' into 'master'

Current

See merge request !6
parents 38093e32 836c5629
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don’t work, or not
# install all needed dependencies.
#Pipfile.lock
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# Unreleased
# Added
# 0.9
# Added
- possibility to run kwazon for batch (directory or list of files) - added
method in KwazonPlugin
# Changed
- reduce usage of RAM: splitting calculation of similarity for document's
vector and vectors for nodes into 8 batches
- changing tool to handle mapping with many sources (thesauruses) - concepts
from different sources can point to the same category and such occurence
(if from same token) should be counted only once + doc_context.py refactor
- concept - category mapping now can be read from zipped binary or zipped plain
text
- including all WSS sources when reading URLs from ccl doc
- preferred static configuration: with weighted graph (edge weights) and
concept - cat mapping for all WSS sources
# 0.8.2
# Changed
- improving performance (reducing execution time in gt_pprmc.py and loader.py)
# 0.8.1
## Changed
- fixed some problems
## Added
# 0.8
## Added
- list urls mapped to BN's (Biblioteka Narodowa) descriptors
- Handling special categories (BN descriptors) - e.g. including at least n
best descriptors in resulted ranking and marking them with '$' character
in ranking
- tests for np_rank_utils
- add IDF-weighting for lemmas in documets
- add handling of graph with weights set for edges
## Changed
- including number of distinct concepts for categy when calculating init pers
## Removed
Weighting with tf-idf for spec cats
# 0.7
## Added
- tests for ResultContainer
- chandle binary vectors included in zip archive
## Fixed/changed
- value for single occurence for category depends on number of different
categories related to analyzed concept - it's more fair approach
- fixed version of vectors for categories
- fixed version of concept - categiroes mapping - with reduced relationship
leading to common categories
## Removed
- Remove plwn mappings to dbpedia categories
# 0.6
## Added
- Parameters now can be changed between tasks - adding possibility to reload
config from config file
- Adding possibility to apply weights for resulted ranking. Possible only when
filtering with MTC (main topic classification) categories is enabled.
## Fixed
- faster mechanizm of generating results (sorting etc.)
## Removed
\ No newline at end of file
# KeyWord-Assignment-Tool (kwazon)
# INFORMACJE WSTĘPNE
Narzędzie KeyWord-Assignment-Tool (kwat) służy do przypisywania deskryptywnych
słów kluczowych do dokumentu na podstawie grafu zasobów połączonych.
W aktualnej wersji używany graf zawiera wyłącznie kategorie DBPEDII.
# ZALEŻNOŚCI
Aktualne zależności są dostępne w pliku [requirements.txt](requirements.txt).
# INSTALACJA
Uwaga!
Zasoby o dużym rozmiarze:
* graf
* mapowanie pojęć na kategorie
* wektory pojęć
* wektory kategorii
nie będą instalowane w paczce. Zamiast tego powinny być skopiowane poza procesem
instalacji paczki, a ścieżki do nich zawarte w pliku config.ini powiiny być
zaaktualizowane. W celu pobrania zasobów należy po przeprowadzonej instalacji
użyć skryptu ```tools/install_resources.sh```.
Szczegółowe informacje dot. instalacji i uruchomienia są zawarte w
[wiki](https://gitlab.clarin-pl.eu/team-semantics/kwazon/wikis/Instalacja-i-uruchomienie)
# SPOSÓB UŻYCIA
Narzędzie obsługuje dokumenty w formacie *.ccl.
Uwaga! W bieżącej wersji narzędzie nie przeproawadza dopasowania odnośników URL
do pojęć w tekście dlatego wprowadzany dokument powinien być uprzednio opatrzony
takimi odnośnikami URL do pojęć z sieci połączonych zasobów (WSS) oraz
adnotacjami 'keyword first instance'.
Na wejście programu należy podać przynajmniej ścieżkę pliku oraz ścieżkę gdzie
zostanie zapisany wynikowy plik ze słowami kluczowymi:
kwazon path/to/doc/doc.xml out/path/out.txt
W celu informacji na temat pełnej listy parametrów oraz szczegółowych instrukcji
na temat użycia należy uruchomić: ```kwazon --help```
W przypadku przetwarzania wielu dokumentów narzędzie można uruchomić z poziomu
sesji ipythona, uprzednio ładując do pamięci potrzebne komponenty (obsługa
wielu dokumentów (tryb "batch") z poziomu linii poleceń zostanie dodana w
przyszłości).
Szczegółowe informacje dot. instalacji i uruchomienia są zawarte w
[wiki](https://gitlab.clarin-pl.eu/team-semantics/kwazon/wikis/Instalacja-i-uruchomienie)
## Sposób użycia w sesji ipython (stara metoda):
```python
>>> from loaders import loader
>>> import keyword_assignment as ka
>>> graph_path = 'path_to_graph.graphml'
>>> output = 'output_file.txt'
>>> concepts_categories_path = 'path_to_concepts_categories_file.txt'
>>> graph, concept_nodes_map = loader\
.load_all(graph_path, concepts_categories_path)
>>> ka.run(graph, loader.load_doc(ccl_doc), output,\
concept_nodes_map=concept_nodes_map)
```
## Sposób użycia w sesji ipython (nowa metoda - z wykorzystaniem cachowania w ipythonie):
```python
# utrwalenie
>>> from keyword_assignment_tool import kwazon_plugin
>>> kw_plugin = kwazon_plugin.KwazonPlugin()
>>> ...
>>> doc_path = ...
>>> kw_plugin.run_kwazon_prepared_debug(*kwazon_plugin._make_out_paths(doc_path, 'extra_info'))
>>> ...
>>> kwazon_plugin.store_in_cache(kw_plugin)
# odtworzenie z chache
>>> from keyword_assignment_tool import kwazon_plugin
>>> kwazon_plugin.restore_from_cache()
>>> cached_vars = {'graph':graph, 'concept_cat_map': concept_cat_map, 'v_id_lod_url_map': v_id_lod_url_map, 'lod_url_v_id_map': lod_url_v_id_map, 'cats_keyed_vectors': cats_keyed_vectors, 'v_id_vector
_map': v_id_vector_map}
>>> kw_plugin = kwazon_plugin.initialize_plugin_from_cache(cached_vars)
```
STORE:
graph = kw_plugin._graph
concept_cat_map = kw_plugin._concept_cat_map
concept_nodes_map = kw_plugin._concept_nodes_map
v_id_lod_url_map = kw_plugin._v_id_lod_url_map
lod_url_v_id_map = kw_plugin._lod_url_v_id_map
cats_keyed_vectors = kw_plugin._cats_keyed_vectors
v_id_vector_map = kw_plugin._v_id_vector_map
%store graph
%store concept_cat_map
%store concept_nodes_map
%store v_id_lod_url_map
%store lod_url_v_id_map
%store cats_keyed_vectors
%store v_id_vector_map
RESTORE:
%store -r
cached_vars = {'graph':graph, 'concept_cat_map': concept_cat_map, 'concept_nodes_map': concept_nodes_map, 'v_id_lod_url_map': v_id_lod_url_map, 'lod_url_v_id_map': lod_url_v_id_map, 'cats_keyed_vectors': cats_keyed_vectors, 'v_id_vector_map': v_id_vector_map}
\ No newline at end of file
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import division
import sys
import time
import operator
import numpy as np
from collections import defaultdict
from graph_tool.centrality import pprmc
from wosedon.ranking.wsd_ranking import WSDRanking
from wosedon.algorithms.wsdalgorithminterface import WSDAlgorithmInterface
import logging
from keyword_assignment_tool.utils import np_rank_utils as npu
_log = logging.getLogger(__name__)
class GTPPRMC(WSDAlgorithmInterface):
DEFAULT_INIT_PERS = 1.0
def __init__(self, str_name='PPR-MonteCarlo', details_writer=None):
'''
@param details_writer - optional;file writer for additional debug info.'''
super(GTPPRMC, self).__init__(str_name)
self._context_node_dict = None
self._details_writer = details_writer
def prepare_v(self, context, graph, init_pers=None):
'''
Prepares pers_prop and weights_prop maps.
@param context - list of nodes or dict where node is a key and value
is float representing individual personalisation value.
If individual
@param graph - BaseGraph instance
@param init_pers - float value, initial personalisation value used when
individual personalisation is not given.
If parameter not given then 'DEFAULT_INIT_PERS'
will be used.
@return graph_tool PropertyMap with personalisation values.
'''
# _log.debug("Initialize nodes:")
p = graph.use_graph_tool().new_vertex_property('double', val=0.0)
if isinstance(context, list): # individual pers not given
set_init_pers = init_pers if init_pers else self.DEFAULT_INIT_PERS
context = {n: set_init_pers for n in context}
context_list = None
if self._details_writer:
self._details_writer.write("INIT PERSONALISATION VALUES (for {} cats):\n"\
.format(len(context)))
context_list = sorted(context.items(), key=operator.itemgetter(1))
else:
context_list = context.iteritems()
for node, pers in context_list:
p[node.use_graph_tool()] = pers
if self._details_writer:
self._details_writer.write("{}\t{}\n".format(
graph.use_graph_tool().vp.lod_url[node.use_graph_tool()],
pers))
return p
def prepare_weights_map(self,
vertex_id_weight_map,
graph,
min_weight=0.5,
v_id_v_map=None):
w = graph.new_vertex_property('double', val=0.0)
# w = graph.use_graph_tool().new_vertex_property('double', val=min_weight)
v_id_weight_arr = npu.dict_as_np_arr(vertex_id_weight_map)
sorted_v_weight_arr = npu.sort_dict_like_arr(
v_id_weight_arr, column_idx=0, descending=False)
# check if weight arr contains same nodes as in graph
if np.array_equal(sorted_v_weight_arr[:, 0], graph.get_vertices()):
# assign entire array as property map
_log.info("setting weights array as graph's property map...")
w.a = sorted_v_weight_arr[:, 1]
else:
# manually assign weights
_log.info("setting weights node-by-node...")
for node, weight in vertex_id_weight_map.iteritems():
if isinstance(node, int):
node = graph.vertex(node)
w[node] = weight
return w
def run(self, context, graph, options, resources):
'''!
Run pprmc algorithm.
@param context - dict with string as key and list as value
@param graph - object of BaseGraph class
@param options - object of AlgorithmOptions class
@param resources - object of Resources class
@return dict with string as key and float as value
'''
# logger.info("start")
# concept_node_map = context
# v_id_lod_url_map = None
g = graph.use_graph_tool()
v_id_weight_map = None
edge_weights_prop = None
use_only_cats_list = None
if resources:
if 'edge_weights_prop' in resources:
edge_weights_prop = resources['edge_weights_prop']
else:
edge_weights_prop = g.new_edge_property('float', 1.0)
if 'vertex_id_weight_map' in resources:
v_id_weight_map = resources['vertex_id_weight_map']
# if 'use_only_cats' in resources:
# use_only_cats_list = resources['use_only_cats']
# debug print
# print("CATEGORIES for read CONCEPTS:")
# for c, nodes in concept_node_map.iteritems():
# for n in nodes:
# _log.debug("{} : {}".format(c,
# g.vp.lod_url[n.use_graph_tool()]))
# nodes_context = [n for nodes_list in concept_node_map.values() \
# for n in nodes_list]
# nodes_context = context.nodes
if not context.nodes or len(context.nodes) < 1:
return {} # or raise Exception?
# nodes_context = {n for nodes_set in \
# [n for c, n in concept_node_map.iteritems()] for n in nodes_set}
# nodes_freq_context_map = self._as_frequency_init_pers(nodes_context)
# nodes_freq_context_map = self._as_frequency_init_pers2(
# context, min_occurences=3)
nodes_freq_context_map = self._as_frequency_init_pers2(context, g)
pers_prop = self.prepare_v(nodes_freq_context_map, graph,
options.init_pers())
weights_prop = self.prepare_weights_map(\
v_id_weight_map, g) if v_id_weight_map else None
# pers_prop = self.prepare_v(nodes_context, graph, options.init_pers())
# logger.debug("RANKING:\n{}".format(url_rank_map.__repr__()))
_log.info("start pprmc algorithm...")
ranking = pprmc(
g,
pers=pers_prop,
epsilon=options.damping_factor(),
rw_count=options.max_iter(),
weight=weights_prop,
edge_weight=edge_weights_prop)
_log.info("pprmc algorithm finished.")
return ranking
def _as_frequency_init_pers(self, node_context):
'''Counts occurence of every node and uses it to calculate init
personalisation for every node.
@param node_context - list of nodes used as initial nodes.
Should contain every all occurences of certain note (not unique list).
@return dict where node is a key and value is float representing
individual personalisation value. '''
node_freq_map = defaultdict(int)
if node_context and len(node_context) > 0:
for node in node_context:
node_freq_map[node] += 1
all_nodes_no = len(node_freq_map)
node_freq_map_keys = node_freq_map.keys()
for node in node_freq_map_keys:
node_freq_map[node] = 1.0 * node_freq_map[node] / all_nodes_no
return node_freq_map
def _as_frequency_init_pers2(self, context, g, min_occurences=None):
'''Counts occurence of every node and uses it to calculate init
personalisation for every node.
@param node_context - list of nodes used as initial nodes.
Should contain every all occurences of certain note (not unique list).
@return dict where node is a key and value is float representing
individual personalisation value. '''
node_freq_map = defaultdict(float)
if context and len(context.node_cum_counter_map) > 0:
# scores_sum = len(context.node_cum_counter_map)
# print('len context.node_cum_counter_map',
# len(context.node_cum_counter_map))
scores_sum = 0.0
for node in context.nodes:
# occurences = context.node_cum_counter_map[node]
occurences = context.node_cum_counter_map[g.vertex_index[node]]
if min_occurences and occurences < min_occurences:
continue
score = 1.0 * occurences
scores_sum += score
node_freq_map[node] = score
# print('scores_sum', scores_sum)
node_freq_map = {
k: v / scores_sum
for k, v in node_freq_map.iteritems()
}
# print(node_freq_map)
return node_freq_map
[general]
; Note: all paths included in this file should contain names of files in 'data'
; directory or should be absolute paths
; Keep in mind that only files included in instalation setup.py file will be
; available in 'data' directory.
;graph_file_path = graph-2018-11-13-categories-broaders-narrowers-e-weight.graphml
graph_file_path = graph-2018-11-13-categories-broader-narrower-weighted-001-filtered-meaningless.graphml
; path to file with concept -> category mapping: It can be zipped,
; pickled dict (*.bin.zip), zipped plain text (*.txt.zip) or plain text (*.txt)
;concepts_categories_mapping = concept_category_index_only_dbpedia_concepts_reduced_v5_at_least_one.txt.zip
concepts_categories_mapping = concept_category_index_all_sources_extended_reduced_at_least_one_v7.bin.zip
; path to txt file with wikipedia2vec vectors
;categories_vectors = categories_vectors_v2.txt.zip
categories_vectors = cats_keyed_vectors_v2.bin.zip
;concepts_vectors = concepts_vectors.txt.zip
concepts_vectors = concepts_keyed_vectors.bin.zip
; path to file with list of categories which will be used to filter list
; of results
; use_only_categories = MTC_depth_2-BN-descr-extended.txt
; categories_weights = cat_weights_MTC2.txt
edge_weights_prop = 'rel_w'
; IDF-based weights (normalised) for lemmas (orths)
orth_idf_weights = idf-dict2_norm_sorted.tsv
[results]
as_json = true
; path to file containing mapping: category from graph -> label.
; These labels will be used in results. If label for certain category is not
; given then such category will be represented using label extracted from url
;cats_labels = categories_pl_labels_v3.txt
cats_labels = categories_pl_labels_lowercase_v4.txt
n_best_keywords = 10
; minimal value of score from pprmc algorithm for any keyword.
; Only keywords with value above given will be returned in keyword list
; Note: Only used when 'n_best_keywords' is not given.
score_min_threshold = 0.0003
; some kind of special categories (urls) which should be included/excluded
; from resulted ranking. The way of usage is defined in 'spec_cats_strategy'.
; In current version these spec-cats(urls) are representing BN descriptors
;special_cats_path = MTC_depth_3_filtered_with_BN_descr.txt
; informs how to use bn_descr_cats_path categories when constructing resluted
; ranking:
; - use original ranking('all'),
; - take only them('only'),
; - exclude them from ranking('exclude'),
; - take original ranking and only first n from given categories(int:natural
; number) - REGARDLESS of the score value.
;spec_cats_strategy = 1
[algorithm]
iterations = 100
damping_factor = 0.1
init_personalisation = 1.0
[database]
; used only when concepts_categories_mapping is not specified
endpoint = http://10.108.41.103:8080/db/data/cypher
user = neo4j
password = neodb
import os
import ConfigParser
import pkg_resources
DEFAULT_CONFIG_FILE = 'config.ini'
# config keys
S_GENERAL = 'general'
O_RESOURCES_DIR = 'data_dir'
O_GRAPH_FILE = 'graph_file_path'
O_CONCEPTS_CAT_MAPPING = 'concepts_categories_mapping'
O_CATS_VECTORS = 'categories_vectors'
O_CONCEPTS_VECTORS = 'concepts_vectors'
O_USE_ONLY_CATS = 'use_only_categories'
O_CATS_WEIGHTS = 'categories_weights'
O_EDGE_WEIGHTS = 'edge_weights_prop'
O_ORTH_IDF_WEIGHTS = 'orth_idf_weights'
S_ALGORITHM = 'algorithm'
O_ITERATIONS = 'iterations'
O_DAMPING_FACTOR = 'damping_factor'
O_INIT_PERS = 'init_personalisation'
S_DATABASE = 'database'
O_ENDPOINT = 'endpoint'
O_USER = 'user'
O_PASSWORD = 'password'
S_RESULTS = 'results'
O_JSON_FORMAT = 'as_json'
O_N_BEST_KW = 'n_best_keywords'
O_MIN_SCORE = 'score_min_threshold'
O_SPEC_CATS_PATH = 'special_cats_path'
O_SPEC_CATS_STRATEGY = 'spec_cats_strategy'