-
Grzegorz Kostkowski authored
Following improvements added: * building graph from random walks * clustering: - spectral clustering - dbscan - similarity measures: - katz - betweenness - methods of estimating number of clusters: - based on layout - amos (it have quite high requirements for graph) * re-rank: - reranking original graph or from random walks * minimal cat score ratio - to minimalize number of initial nodes for random walk algorithm * add optional penalty for all categories - based on position on list of cat cum scores and similarity to document (vector similarity) - together with above improvement, try to remove misleading categories form list of candidates for initial nodes in random walk * introduce excluded sources (for now, GEONAMES and GEOWORDNET) to utilize information about geographic places - it's not sufficient to ignore URL from such source: if among urls for certain token, there is url from excluded source, then ALL urls in this token should be ignored * FIXME: temporary, only for dbpedia - it should be parametrized * TODO: weights for clusters
15f75ec1