keyword_assignment_tool/config/config-set-10a/derived_resources.ini · 85965e2072ad19803ac293231b224ebab1ff2cbc · Team-Semantics / kwazon

Grzegorz Kostkowski authored May 24, 2020

Following improvements added:
* building graph from random walks
* clustering:
    - spectral clustering
    - dbscan
    - similarity measures:
        - katz
        - betweenness
    - methods of estimating number of clusters:
        - based on layout
        - amos (it have quite high requirements for graph)
* re-rank:
    - reranking original graph or from random walks
* minimal cat score ratio - to minimalize number of initial nodes
  for random walk algorithm
* add optional penalty for all categories - based on position on list of cat
  cum scores and similarity to document (vector similarity) - together with
  above improvement, try to remove misleading categories form list of
  candidates for initial nodes in random walk
* introduce excluded sources (for now, GEONAMES and GEOWORDNET) to utilize
  information about geographic places - it's not sufficient to ignore URL
  from such source: if among urls for certain token, there is url from
  excluded source, then ALL urls in this token should be ignored

* FIXME: temporary, only for dbpedia - it should be parametrized
* TODO: weights for clusters

15f75ec1