Skip to content
  • Grzegorz Kostkowski's avatar
    Many improvements · 15f75ec1
    Grzegorz Kostkowski authored
    Following improvements added:
    * building graph from random walks
    * clustering:
        - spectral clustering
        - dbscan
        - similarity measures:
            - katz
            - betweenness
        - methods of estimating number of clusters:
            - based on layout
            - amos (it have quite high requirements for graph)
    * re-rank:
        - reranking original graph or from random walks
    * minimal cat score ratio - to minimalize number of initial nodes
      for random walk algorithm
    * add optional penalty for all categories - based on position on list of cat
      cum scores and similarity to document (vector similarity) - together with
      above improvement, try to remove misleading categories form list of
      candidates for initial nodes in random walk
    * introduce excluded sources (for now, GEONAMES and GEOWORDNET) to utilize
      information about geographic places - it's not sufficient to ignore URL
      from such source: if among urls for certain token, there is url from
      excluded source, then ALL urls in this token should be ignored
    
    * FIXME: temporary, only for dbpedia - it should be parametrized
    * TODO: weights for clusters
    15f75ec1