There will be a small downtime on Friday 16.04 between 15:00 and 18:00 (Gitlab upgrade).

If necessary contact me at mateusz.gniewkowski@pwr.edu.pl

README.md 6.86 KB
Newer Older
Arkadiusz Janz's avatar
Arkadiusz Janz committed
1
# **basegraph**: A simple module for representing wordnet graphs!
Arkadiusz Janz's avatar
Arkadiusz Janz committed
2 3 4

The module is integrated with GraphTool library (https://graph-tool.skewed.de).
It provides a simple interface to access the graphs representing wordnets with
Arkadiusz Janz's avatar
Arkadiusz Janz committed
5
nodes reflecting synsets (or lexical units) and edges describing
Arkadiusz Janz's avatar
Arkadiusz Janz committed
6 7
the lexico-semantic structure.

Arkadiusz Janz's avatar
Arkadiusz Janz committed
8
The **basegraph** module offers 3 simple classes.
Arkadiusz Janz's avatar
Arkadiusz Janz committed
9

Arkadiusz Janz's avatar
Arkadiusz Janz committed
10 11 12 13 14
*BaseGraph* object is a wrapper for GraphTool.Graph class objects and contains
a convenient API to use the graphs. It also holds the reference to a raw GT Graph
object (we can access it using `use_graph_tool` function). The *BaseGraph* consists
of *BaseNodes* and *BaseEdges* representing graph vertices and links. The *BaseNode*
class object holds a reference to a raw GraphTool.Vertex object, but also provides
Arkadiusz Janz's avatar
Arkadiusz Janz committed
15
a convenient API to access its properties. We can wrap every single raw vertex in
Arkadiusz Janz's avatar
Arkadiusz Janz committed
16
the graph using *BaseNode* class and easily access the properties of the object
Arkadiusz Janz's avatar
Arkadiusz Janz committed
17
(just like we do with plain Python objects) instead of using the inconvenient
Arkadiusz Janz's avatar
Arkadiusz Janz committed
18
API provided by GT. The same holds for *BaseEdge* class.
Arkadiusz Janz's avatar
Arkadiusz Janz committed
19

Arkadiusz Janz's avatar
Arkadiusz Janz committed
20 21 22 23 24 25
#### Installation

```
pip install --extra-index-url https://pypi.clarin-pl.eu basegraph
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
26 27 28 29 30 31
or

```
python3.6 setup.py install
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
32
#### Dependencies?
Arkadiusz Janz's avatar
Arkadiusz Janz committed
33
- Python3.6
Arkadiusz Janz's avatar
Arkadiusz Janz committed
34 35
- GraphTool only

Arkadiusz Janz's avatar
Arkadiusz Janz committed
36
#### Basic Usage
Arkadiusz Janz's avatar
Arkadiusz Janz committed
37

Arkadiusz Janz's avatar
Arkadiusz Janz committed
38 39 40
1. Load the graph:

```python
Arkadiusz Janz's avatar
Arkadiusz Janz committed
41 42 43 44
from basegraph import BaseGraph

bg = BaseGraph()
bg.unpickle('data/graph_syn.xml.gz')
Arkadiusz Janz's avatar
Arkadiusz Janz committed
45 46 47 48 49 50 51 52 53 54 55 56 57
```

2. Iterate over all nodes or edges:

```python
for node in bg.all_nodes():
    pass
    
for node in bg.all_edges():
    pass

```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
58
3. a: General node properties
Arkadiusz Janz's avatar
Arkadiusz Janz committed
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77

```python

# returns all edges associated with the node
node.all_edges()

# returns all nodes associated with the node
node.all_neighbours()

# the degree of incoming links
node.in_degree()

# the degree of outgoing links
node.out_degree()

# access underlying GraphTool object
node.use_graph_tool()
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
78
3. b: Synset properties
Arkadiusz Janz's avatar
Arkadiusz Janz committed
79 80 81 82 83

- (safe only for synset graphs, be careful when using with mixed graphs)

```python

Arkadiusz Janz's avatar
Arkadiusz Janz committed
84 85
synset = node.synset

Arkadiusz Janz's avatar
Arkadiusz Janz committed
86 87
synset.synset_id
synset.lu_set
Arkadiusz Janz's avatar
Arkadiusz Janz committed
88

Arkadiusz Janz's avatar
Arkadiusz Janz committed
89 90
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
91
3. c: Lexical Unit properties
Arkadiusz Janz's avatar
Arkadiusz Janz committed
92 93 94 95 96 97 98
```python

lu.lu_id
lu.lemma
lu.pos
lu.variant

Arkadiusz Janz's avatar
Arkadiusz Janz committed
99 100
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
101
4. a: General edge properties
Arkadiusz Janz's avatar
Arkadiusz Janz committed
102 103
```python

Arkadiusz Janz's avatar
Arkadiusz Janz committed
104 105 106 107 108 109 110 111 112 113 114 115 116 117
# source node
edge.source()

# target node
edge.target()
```

4. b: WN-specific edge properties
```python

# WordNet-based name of semantic link
edge.rel

# WordNet-based ID of a given semantic link
Arkadiusz Janz's avatar
Arkadiusz Janz committed
118
edge.rel_id
Arkadiusz Janz's avatar
Arkadiusz Janz committed
119

Arkadiusz Janz's avatar
Arkadiusz Janz committed
120
```
Arkadiusz Janz's avatar
Arkadiusz Janz committed
121

Arkadiusz Janz's avatar
Arkadiusz Janz committed
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
5. Custom node and edge attributes

- The idea is to make GraphTool interface transparent and override properties
- We use ```create_node_attribute``` and ```create_edge_attribute```
- Custom properties can be accessed as they were designed as plain Python attrs

```
bg.create_node_attribute('depth', 'double')  # can be also int, string, or vector

# use it like an attribute:
node.depth = 3
node.depth = node.depth + 1

bg.create_edge_attribute('weight', 'double')
edge.weight = 0.8
edge.weight = edge.weight / 2
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
140 141 142

#### Advanced Usage

Arkadiusz Janz's avatar
Arkadiusz Janz committed
143 144
0. Accessing Raw GT Properties

Arkadiusz Janz's avatar
Arkadiusz Janz committed
145
**basegraph** API provides a simple interface to retrieve properties for nodes
Arkadiusz Janz's avatar
Arkadiusz Janz committed
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164
and edges in the graph. With GT objects it's a bit more complicated:

```python
# get the underlying GT object
g = bg.use_graph_tool()

# now we have to get specfic property from the graph (e.g. synset property)
prop = g.vp['synset']

# now we can get property value for specific node (raw vertex object from GT);
# let's assume we have a BaseNode "n" representing the synset of ID 1319

n = bg.get_node_for_synset_id(1319).use_graph_tool()

synset = prop[n]

```


Arkadiusz Janz's avatar
Arkadiusz Janz committed
165 166
1. Find node by synset ID

Arkadiusz Janz's avatar
Arkadiusz Janz committed
167
```
Arkadiusz Janz's avatar
Arkadiusz Janz committed
168
bg.get_node_for_synset_id(synset_id)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
169
```
Arkadiusz Janz's avatar
Arkadiusz Janz committed
170 171 172

2. Find all nodes by given lemma

Arkadiusz Janz's avatar
Arkadiusz Janz committed
173
```
Arkadiusz Janz's avatar
Arkadiusz Janz committed
174 175 176 177
# first we have to initialize the dictionary
bg._generate_lemma_to_nodes_dict()
# then just take the nodes by lemma
nodes = bg._lemma_to_nodes_dict[lemma]
Arkadiusz Janz's avatar
Arkadiusz Janz committed
178
```
Arkadiusz Janz's avatar
Arkadiusz Janz committed
179

Arkadiusz Janz's avatar
Arkadiusz Janz committed
180 181 182 183 184
3. Graph filters

With filters we can easily reduce the graph based on a given predicate. The source
basegraph can be filtered in a `hard` way, by removing the nodes that did not
meet our condition. The `soft` way means we make the filtered nodes just transparent
Arkadiusz Janz's avatar
Arkadiusz Janz committed
185 186 187
and we can easily restore them later (using `reset_nodes_filter`). Analogous functions
were prepared for graph edges (e.g. `edges_filter_conditional`, `reset_edges_filter`).
Examples:
Arkadiusz Janz's avatar
Arkadiusz Janz committed
188 189 190 191 192 193 194 195 196 197 198 199 200 201

```python

from basegraph import BaseGraph

bg = BaseGraph()
bg.unpickle('data/graph_syn.xml.gz')

In [1]: sum(1 for n in bg.all_nodes())
Out[1]: 349189

In [2]: sum(1 for e in bg.all_edges())
Out[2]: 1552096

Arkadiusz Janz's avatar
Arkadiusz Janz committed
202 203 204 205
# Our condition:
condition = lambda node: node.in_degree() < 3

# Apply in a 'soft' way:
Arkadiusz Janz's avatar
Arkadiusz Janz committed
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222

bg.nodes_filter_conditional(condition, soft=True)

In [3]: sum(1 for n in bg.all_nodes())
Out[3]: 171179

In [4]: sum(1 for e in bg.all_edges())
Out[4]: 26964

In [5]: bg.reset_nodes_filter()

In [6]: sum(1 for n in bg.all_nodes())
Out[6]: 349189

In [7]: sum(1 for e in bg.all_edges())
Out[7]: 1552096

Arkadiusz Janz's avatar
Arkadiusz Janz committed
223 224
# Apply in a 'hard' way (modifies the graph in place, `reset_nodes_filter`
# doesn't work here)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
225 226 227 228 229 230 231 232 233 234 235 236

bg.nodes_filter_conditional(condition, soft=False)

In [8]: sum(1 for n in bg.all_nodes())
Out[8]: 171179

In [9]: sum(1 for e in bg.all_edges())
Out[9]: 26964


```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
Let's do the same thing, but now for the edges:

```python
In [10]: condition = lambda edge: edge.rel_id == 11
In [11]: bg.edges_filter_conditional(condition, soft=True)

In [12]: sum(1 for n in bg.all_edges())
Out[12]: 208571

In [13]: bg.reset_edges_filter()

In [14]: sum(1 for n in bg.all_edges())
Out[14]: 1552096
```


Arkadiusz Janz's avatar
Arkadiusz Janz committed
253 254
#### GraphTool Algorithms 

Arkadiusz Janz's avatar
Arkadiusz Janz committed
255
To apply predefined GraphTool algorithms we have to operate on underlying GT
Arkadiusz Janz's avatar
Arkadiusz Janz committed
256 257 258 259 260 261 262 263 264 265 266 267 268 269
objects. Let's try to compute the shortest distance between two specific nodes:

```python
from graph_tool.topology import shortest_distance

# don't forget to use only underlying GT objects when using raw GraphTool functions!
n1 = bg.get_node_for_synset_id(s1).use_graph_tool()
n2 = bg.get_node_for_synset_id(s2).use_graph_tool()
g = bg.use_graph_tool()

distance = shortest_distance(g, n1, n2)

```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
270 271 272 273 274 275 276 277 278
Now we can try to get the shortest path:

```python
from graph_tool.topology import shortest_path

n1 = bg.get_node_for_synset_id(s1).use_graph_tool()
n2 = bg.get_node_for_synset_id(s2).use_graph_tool()
g = bg.use_graph_tool()

Arkadiusz Janz's avatar
Arkadiusz Janz committed
279 280 281 282 283 284 285 286 287
# this returns raw GT objects, but still we can easily wrap them and use
# basegraph API
vertices, links = shortest_path(g, n1, n2)

nodes = [BaseNode(g, v) for v in vertices]
edges = [BaseEdge(g, e) for e in links]

for node in nodes:
    print(node.synset.synset_id)  # it's easier with BaseNode
Arkadiusz Janz's avatar
Arkadiusz Janz committed
288
    print(node.weight)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
289

Arkadiusz Janz's avatar
Arkadiusz Janz committed
290
# we can also try to use raw objects, but it's not so convenient
Arkadiusz Janz's avatar
Arkadiusz Janz committed
291 292
synset_prop = g.vp['synset']
weight_prop = g.vp['weight']
Arkadiusz Janz's avatar
Arkadiusz Janz committed
293
for v in vertices:
Arkadiusz Janz's avatar
Arkadiusz Janz committed
294 295
    print(synset_prop[v].synset_id)
    print(weight_prop[v])
Arkadiusz Janz's avatar
Arkadiusz Janz committed
296

Arkadiusz Janz's avatar
Arkadiusz Janz committed
297 298
```

Arkadiusz Janz's avatar
Arkadiusz Janz committed
299 300 301
#### WN-based Examples

Let's take all of the hyponyms of a node. To do this we need to know, what's
Arkadiusz Janz's avatar
Arkadiusz Janz committed
302
the ID of 'hypernymy' relation in our WN:
Arkadiusz Janz's avatar
Arkadiusz Janz committed
303 304 305 306 307 308 309 310

```python

def get_hypernyms(node):
    return {edge.target() for edge in node.all_edges()
            if edge.rel_id == 11 and edge.target() != node}
            
hypernyms = get_hypernyms(node)
Arkadiusz Janz's avatar
Arkadiusz Janz committed
311
```