Graphony is a Python library for doing high-performance graph analysis using the GraphBLAS over sparse and hypersparse data sets.
Graphony uses pygraphblas to store graph data in sparse GraphBLAS Matrices and node and edge properties in PostgreSQL.
Graphony's primary role is to easily construnct graph matrices and manage symbolic names and properties for graphs, nodes, properties and edges, and can be used to easily construct, save and manage graph data in a simple project directory format.
Graphs can be:
-
Simple: an edge connects one source to one destination.
-
Hypergraph: a graph with at lest one hyperedge connecting multiple source nodes to multiple destinations.
-
Multigraph: multiple edges can exist between a source and destination.
-
Property Graph: Nodes and and Edges can have arbitrary JSON properties.
This documentation is also a runnable Python test called a
doctest. In order to run and verify this documentation, we must
first create some helper objects like a function p()
that will
iterate results into a list and "pretty print" them. We also have to
setup a test PostgreSQL database and initialize it with the base
Graphony tables.
The core object of Graphony is a Graph()
. A new Graph can be created
with a connection string to an existing initialized database:
import os
import pprint
import postgresql
from pathlib import Path
from pygraphblas import FP64, INT64, gviz
from graphony import Graph, Node
p = lambda r: pprint.pprint(sorted(list(r)))
pgdata, conn = postgresql.setup()
postgresql.psql(f'-d "{conn}" -f dbinit/01_init_graphony.sql -f dbinit/02_karate_demo.sql')
G = Graph(conn)
The Graph
object G
is used throughout the following documenation
to demonstrate the features of Graphony. A Graphony Graph
consists
of four concepts:
-
Graph
: Top level object that contains all graph data in sub-graphs called properties. -
Property
: A named, typed sub-graph that holds edges. A property consists of two GraphBLAS Incidence Matrices that can be multiplied to project an adjacency with themselves, or any other combination of properties. -
Edge
: Property edges can be simple point to point edges or hyperedges that represent properties between multiple incoming and outgoing nodes. -
Node
: A node in the graph.
Graphs consist of multiple typed subgraphs called properties. All properties share node ids across a Graph object, but each property can store edge weights of different data types like int, float, or bool. Internally, "simple" graphs (non-hyper) are stored as Adjacency Matrices.
Before you can add an edge, a property to hold it must be declared
first. The default edge type is bool
if you don't specify one:
>>> G.add_property('friend')
Edges can be added directly into the Graph with the +=
method. In
their simplest form, an edge is a Python tuple with 2 elements a
source and a destination:
>>> G.friend += ('bob', 'alice')
>>> G.friend.draw(weights=False, filename='docs/imgs/G_friend_1')
<graphviz.dot.Digraph object at ...>
Using strings like 'bob'
and 'alice'
as edge endpoints creates new
graph nodes automatically. You can also create nodes explicity and
provide attributes for them:
>>> jane = Node(G, 'jane', favorite_color='blue')
>>> jane.favorite_color
'blue'
>>> G.friend += ('alice', jane)
>>> G.friend.draw(weights=False, filename='docs/imgs/G_friend_2')
<graphviz.dot.Digraph object at ...>
Now there are two edges in the friend
property, one from bob to
alice and the other from alice to jane.
>>> p(G.friend)
[friend(bob, alice), friend(alice, jane)]
An iterator of property tuples can also be provided to the +=
operator which will consume them and add them to the property:
>>> G.friend += [('bob', 'sal'), ('alice', 'rick')]
>>> G.friend.draw(weights=False, filename='docs/imgs/G_friend_3')
<graphviz.dot.Digraph object at ...>
As shown above, tuples are stored as boolean edges whose weights are
always True
and therefore can be ommited.
A Hypergraph is a generalization of a graph in which an edge can join any number of vertices in constrast to a simple graph, shown above, where an edge has exactly two endpoints and can only connect only one vertex to one other vertex.
In Graphony a hypergraph can created in any incidence property by
passing the incidence=True
flag. This causes the property to be
stored internally as two Incidence
Matrices which can
represent non-simple graphs like the hypergraph shown here:
>>> G.add_property('manages', incidence=True)
New hyperedges can be defined by passing a nested tuple of nodes as either the source or destinations, or both, for a hyperedge.
>>> G.manages += [('bob', ('rick', 'alice')), (('alice', 'bob'), 'jane')]
>>> G.manages.draw(weights=True, filename='docs/imgs/G_manages_1')
<graphviz.dot.Digraph object at ...>
Here a hyperedge with one source and two destinations is created from bob to jane and alice, and another with two sources and one destination is created from alice and bob to jane.
Graphs can have any number of properties, each with a particular
GraphBLAS type. In general this is referred to as a Property Graph.
As shown above the default property type is bool
which created
unweighted edges, but graph edge types can be specified on a
per-property basis:
>>> G.add_property('distance', int)
>>> G.distance += [('bob', 'alice', 422), ('alice', 'jane', 42)]
>>> G.distance.draw(weights=True, filename='docs/imgs/G_distance_2')
<graphviz.dot.Digraph object at ...>
Supported python types include bool
, int
, float
and complex
which are converted into the GraphBLAS types GrB_BOOL
, GrB_INT64
,
GrB_FP64
and GxB_FC64
for storage. You can also pass a specific
GraphBLAS type if you want different precision or a custom type.
Currently our graph looks like this, it contains 3 properties,
friend
, manages
and distance
:
>>> G.draw(weights=True, filename='docs/imgs/G_all_1')
<graphviz.dot.Digraph object at ...>
Graphs have a call interface like G(...)
that can be used to query
individual edges. A query consists of three optional arguments for
source
, property
and destination
. The default value for all
three is None, which acts as a wildcard to matches all values.
>>> p(G())
[friend(bob, alice),
friend(bob, sal),
friend(alice, jane),
friend(alice, rick),
manages((bob), (alice, rick), (True, True)),
manages((bob, alice), (jane), (True)),
distance(bob, alice, 422),
distance(alice, jane, 42)]
Only print edges where bob
is the src:
>>> p(G(source='bob'))
[friend(bob, alice),
friend(bob, sal),
manages((bob), (alice, rick), (True, True)),
manages((bob, alice), (jane), (True)),
distance(bob, alice, 422)]
Only print edges where manages
is the property:
>>> p(G(property='manages'))
[manages((bob), (alice, rick), (True, True)),
manages((bob, alice), (jane), (True))]
Only print edges where jane
is the destination:
>>> p(G(destination='jane'))
[friend(alice, jane),
manages((bob, alice), (jane), (True)),
distance(alice, jane, 42)]
Only print edges that match that bob
is a manages
of jane
.
Note in this case it returns two hyperedges, as in both cases bob is a
source and jane is a destination:
>>> p(G(source='bob', property='manages', destination='jane'))
[manages((bob, alice), (jane), (True))]
Any tuple producing iterator can be used to construct Graphs. Graphony offers a shorthand helper for this. Any query that produces 2 or 3 columns can be used to produce edges into the graph.
>>> G.add_property('karate')
>>> G.karate += G.sql("select 'k_' || s_id, 'k_' || d_id from graphony.karate")
>>> G.karate.draw(weights=False, filename='docs/imgs/G_karate_3',
... directed=False, graph_attr=dict(layout='sfdp'))
<graphviz.dot.Graph object at ...>
All the edges are in the karate property, as defined in the sql query above:
>>> len(G.karate)
78
In a Multigraph multiple edges can exist between two nodes. A good example is a De Bruijn graph, a directed graph that represents overlapping sequences of symbols.
These graphs are used in bioinformatics to analyze and assemble long sequences of genetic data. Construction involves iterating a sequence of genetic information and constructing multiple edges between pairs of nodes.
>>> from more_itertools import windowed
>>> G.add_property('debruijn', incidence=True)
>>> def kmer(t, k=3):
... return (tuple(map("".join, windowed(i, k-1))) for i in map("".join, windowed(t, k)))
>>> G.debruijn += kmer('ATCGATCGGATGACAGACACAATTC')
>>> G.debruijn.draw(graph_attr=dict(layout='circo'), weights=False, concentrate=True, filename='docs/imgs/G_debruijn_1')
<graphviz...>
Once the graph is built up it can be "collapsed" into a weighted graph, where the multi-edges between nodes are summed up into a single edge. In the GraphBLAS this can be accomplished by calling the properties with a semiring:
>>> M = G.debruijn(INT64.plus_pair)
>>> gviz.draw_graph(M, weights=True, label_vector=G.debruijn.label_vector(M),
... graph_attr=dict(layout='circo'), filename='docs/imgs/G_debruijn_2')
<graphviz...>
Here's an example or using Biopython to create an weighted De Bruijn graph of a Circovirus:
>>> from Bio import SeqIO, Entrez
>>> Entrez.email = "[email protected]"
>>> handle = Entrez.efetch(db="nucleotide", id="MZ299081", rettype="gb", retmode="text")
>>> record = SeqIO.read(handle, "genbank")
>>> handle.close()
>>> from more_itertools import windowed
>>> G.add_property('circovirus', incidence=True)
>>> def kmer(t, k=3):
... return (tuple(map("".join, windowed(i, k-1))) for i in map("".join, windowed(t, k)))
>>> seq = str(record.seq)
>>> G.circovirus += kmer(seq, 3)
>>> M = G.circovirus(INT64.plus_pair)
>>> gviz.draw_graph(M, weights=True, labels=True, label_vector=G.circovirus.label_vector(M),
... graph_attr=dict(layout='sfdp'), filename='docs/imgs/G_circovirus_1')
<graphviz...>
postgresql.teardown(pgdata)