GSQL Graph Algorithm Library
Updated June 12, 2020
Graph algorithms are functions for measuring characteristics of graphs, vertices, or relationships. Graph algorithms can provide insights into the role or relevance of individual entities in a graph. For example: How centrally located is this vertex? How much influence does this vertex exert over the others?
Some graph algorithms measure or identify global characteristics: What are the natural community groupings in the graph? What is the density of connections?

Using the GSQL Graph Algorithm Library

The GSQL Graph Algorithm Library is a collection of expertly written GSQL queries, each of which implements a standard graph algorithm. Each algorithm is ready to be installed and used, either as a stand-alone query or as a building block of a larger analytics application.
GSQL running on the TigerGraph platform is particularly well-suited for graph algorithms for the several reasons:
  • Turing-complete with full support for imperative and procedural programming, ideal for algorithmic computation.
  • Parallel and Distributed Processing, enabling computations on larger graphs.
  • User-Extensible. Because the algorithms are written in standard GSQL and compiled by the user, they are easy to modify and customize.
  • Open-Source. Users can study the GSQL implementations to learn by example, and they can develop and submit additions to the library.

Library Structure

You can download the library from github: https://github.com/tigergraph/gsql-graph-algorithms
The library contains two main sections: algorithms and tests. The algorithms folder contains template algorithms and scripts to help you customize and install them. The tests folder contains small sample graphs that you can use to experiment with the algorithms. In this document, we use the test graphs to show you the expected result for each algorithm. The graphs are small enough that you can manually calculate and sometimes intuitively see what the answers should be.

Installing an Algorithm

Remember that GSQL graph algorithms are simply GSQL queries. However, since we do not know what vertices or edges you want to analyze, or how you want to receive output, the algorithms are in a template format. You need to run a script to personalize your algorithm and then to install it.
Make sure that the install.sh is owned by the tigergraph user.
  1. 1.
    Within the Algorithms folder is a script install.sh. When you run the script, it will first ask you which graph schema you wish to work on. (The TigerGraph platform supports multiple concurrent graphs.)
  2. 2.
    It then asks you to choose from a menu of available algorithms.
  3. 3.
    After knowing your graph schema and your algorithm, the installer will ask you some questions for that particular algorithm:
    1. 1.
      the installer will guide you in selecting appropriate vertex types and edges types. Note this does not have to be all the vertex or edge types in your graph. For example, you may have a social graph with three categories of persons and five types of relationships. You might decide to compute PageRank using Member and Guest vertices and Recommended edges.
    2. 2.
      Some algorithms use edge weights as input information (such as Shortest Path where each edge has a weight meaning the "length" of that edge. The installer will ask for the name of that edge attribute.
  4. 4.
    Single Node Mode or Distributed Mode? Queries which will analyze the entire graph (such PageRank and Community Detection) will run better in Distributed Mode, if you have a cluster of machines.
  5. 5.
    It will then ask you what type of output you would like. It will proceed to create up to three versions of your algorithm, based on the three ways of receiving the algorithm's output:
    1. 1.
      Stream the output in JSON format, the default behavior for most GSQL queries.
    2. 2.
      Save the output value(s) in CSV format to a file. For some algorithms, this option will add an input parameter to the query, to let the user specify how many total values to output.
    3. 3.
      Store the results as vertex or edge attribute values. The attributes must already exist in the graph schema, and the installer will ask you which attributes to use.
  6. 6.
    After creating queries for one algorithm, the installer will loop back to let you choose another algorithm (returning to step 2 above).
  7. 7.
    If you choose to exit, the installer makes a last request: Do you want to install your queries? Installation is when the code is compiled and bound into the query engine. It takes a few minutes, so it is best to create all your personalized queries at once and then install them as a group.
Example:
1
$ bash install.sh
2
*** GSQL Graph Algorithm Installer ***
3
Available graphs:
4
- Graph social(Person:v, Friend:e, Also_Friend:e, Coworker:e)
5
Graph name? social
6
7
Please enter the number of the algorithm to install:
8
1) EXIT
9
2) Closeness Centrality
10
3) Connected Components
11
4) Strongly Connected Components
12
5) Label Propagation
13
6) Louvain Method with Parallelism and Refinement
14
7) PageRank
15
8) Weighted PageRank
16
9) Personalized PageRank
17
10) Shortest Path, Single-Source, No Weight
18
11) Shortest Path, Single-Source, Positive Weight
19
12) Shortest Path, Single-Source, Any Weight
20
13) Minimal Spanning Tree (MST)
21
14) Cycle Detection
22
15) Triangle Counting(minimal memory)
23
16) Triangle Counting(fast, more memory)
24
17) Cosine Neighbor Similarity (single vertex)
25
18) Cosine Neighbor Similarity (all vertices)
26
19) Jaccard Neighbor Similarity (single vertex)
27
20) Jaccard Neighbor Similarity (all vertices)
28
21) k-Nearest Neighbors (Cosine Neighbor Similarity, single vertex)
29
22) k-Nearest Neighbors (Cosine Neighbor Similarity, batch)
30
23) k-Nearest Neighbors Cross Validation (Cosine Neighbor Similarity)
31
#? 7
32
pageRank() works on directed edges
33
34
Available vertex and edge types:
35
- VERTEX Person(PRIMARY_ID id STRING, name STRING, score FLOAT, tag STRING) WITH STATS="OUTDEGREE_BY_EDGETYPE"
36
- DIRECTED EDGE Friend(FROM Person, TO Person, weight FLOAT, tag STRING) WITH REVERSE_EDGE="Also_Friend"
37
- DIRECTED EDGE Also_Friend(FROM Person, TO Person, weight FLOAT, tag STRING) WITH REVERSE_EDGE="Friend"
38
- UNDIRECTED EDGE Coworker(FROM Person, TO Person, weight FLOAT, tag STRING)
39
40
Please enter the vertex type(s) and edge type(s) for running PageRank.
41
Use commas to separate multiple types [ex: type1, type2]
42
Leaving this blank will select all available types
43
Similarity algorithms only take single vertex type
44
45
Vertex types: Person
46
Edge types: Friend
47
The query pageRank is dropped.
48
The query pageRank_file is dropped.
49
The query pageRank_attr is dropped.
50
51
Please choose query mode:
52
1) Single Node Mode
53
2) Distributed Mode
54
#? 1
55
56
Please choose a way to show result:
57
1) Show JSON result 3) Save to Attribute/Insert Edge
58
2) Write to File 4) All of the above
59
#? 4
60
61
gsql -g social ./templates/pageRank.gsql
62
The query pageRank has been added!
63
64
gsql -g social ./templates/pageRank_file.gsql
65
The query pageRank_file has been added!
66
67
If your graph schema has appropriate vertex or edge attributes,
68
you can update the graph with your results.
69
Do you want to update the graph [yn]? y
70
Vertex attribute to store FLOAT result (e.g. pageRank): score
71
gsql -g social ./templates/pageRank_attr.gsql
72
The query pageRank_attr has been added!
73
Created the following algorithms:
74
- pageRank(float maxChange, int maxIter, float damping, bool display, int outputLimit)
75
- pageRank_attr(float maxChange, int maxIter, float damping, bool display)
76
- pageRank_file(float maxChange, int maxIter, float damping, bool display, file f)
77
78
79
Please enter the number of the algorithm to install:
80
1) EXIT
81
2) Closeness Centrality
82
3) Connected Components
83
4) Label Propagation
84
5) Community detection: Louvain
85
6) PageRank
86
7) Shortest Path, Single-Source, Any Weight
87
8) Triangle Counting(minimal memory)
88
9) Triangle Counting(fast, more memory)
89
#? 1
90
Exiting
91
Algorithm files have been created. Do want to install them now [yn]? y
92
Start installing queries, about 1 minute ...
93
c
94
pageRank query: curl -X GET 'http://127.0.0.1:9000/query/social/pageRank?maxChange=VALUE&maxIter=VALUE&damping=VALUE&display=VALUE&outputLimit=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
95
pageRank_file query: curl -X GET 'http://127.0.0.1:9000/query/social/pageRank_file?maxChange=VALUE&maxIter=VALUE&damping=VALUE&display=VALUE&f=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
96
pageRank_attr query: curl -X GET 'http://127.0.0.1:9000/query/social/pageRank_attr?maxChange=VALUE&maxIter=VALUE&damping=VALUE&display=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
97
98
[======================================================================================================] 100% (3/3)
99
$
Copied!
After the algorithms are installed, you will see them listed among the rest of your GSQL queries.
1
GSQL > ls
2
...
3
Queries:
4
- cc_subquery(vertex v, int numVert, int maxHops) (installed v2)
5
- closeness_cent(bool display, int outputLimit) (installed v2)
6
- closeness_cent_attr(bool display) (installed v2)
7
- closeness_cent_file(bool display, file f) (installed v2)
8
- conn_comp() (installed v2)
9
- conn_comp_attr() (installed v2)
10
- conn_comp_file(file f) (installed v2)
11
- label_prop(int maxIter) (installed v2)
12
- label_prop_attr(int maxIter) (installed v2)
13
- label_prop_file(int maxIter, file f) (installed v2)
14
- louvain() (installed v2)
15
- louvain_attr() (installed v2)
16
- louvain_file(file f) (installed v2)
17
- pageRank(float maxChange, int maxIter, float damping, bool display, int outputLimit) (installed v2)
18
- pageRank_attr(float maxChange, int maxIter, float damping, bool display) (installed v2)
19
- pageRank_file(float maxChange, int maxIter, float damping, bool display, file f) (installed v2)
20
- tri_count() (installed v2)
21
- tri_count_fast() (installed v2)
22
Copied!
We will soon update the library so that most of schema choices can be made when running the algorithm, rather than when installing the algorithm.

Running an Algorithm

Running an algorithm is the same as running a GSQL query. For example, if you selected the JSON option for pageRank, you could run it from GSQL as below:
1
GSQL > RUN QUERY pageRank(0.001, 25, 0.85, 10)
Copied!
Installing a query also creates a REST endpoint. The same query could be run thus:
1
curl -X GET 'http://127.0.0.1:9000/query/alg_graph/pageRank?maxChange=0.001&maxIter=25&damping=0.85&outputSize=10'
Copied!
GSQL lets you run queries from within other queries. This means you can use a library algorithm as a building block for more complex analytics.

Library Overview

The following algorithms are currently available. The algorithms are grouped into five classes:
  • Path
  • Centrality
  • Community
  • Similarity
  • Classification (NEW)
Moreover, each algorithm may or may be be currently available for a graph with Undirected Edges, Directed Edges, and Weighted Edges.
  • Coming soon means that TigerGraph plans to release this variant of the algorithm soon.
  • n/a means that this variant of the algorithm is typically not used
Algorithm
Class
Undirected
Edges
Directed
Edges
Weighted
Edges
Single-Source Shortest Path
Path
Yes
Yes
Yes
All Pairs Shortest Path
Path
Yes
Yes
Yes
Minimum Spanning Tree
Path
Yes
n/a
Yes
Cycle Detection
Path
no
Yes
n/a
PageRank
Centrality
n/a
Yes
Yes (NEW)
Personalized PageRank
Centrality
n/a
Yes
Coming soon
Closeness Centrality
Centrality
Yes
n/a
Coming soon
Betweenness Centrality
Centrality
Coming soon
n/a
Coming soon
Connected Components
Community
Yes
n/a
n/a
Strongly Connected Components
Community
n/a
Yes, with reverse edges (NEW)
n/a
Label Propagation
Community
Yes
n/a
n/a
Louvain Modularity
Community
Yes
n/a
n/a
Triangle Counting
Community
Yes
n/a
n/a
Cosine Similarity of Neighborhoods (single-source and all-pairs)
Similarity
Yes
Yes
Yes
Jaccard Similarity of Neighborhoods (single-source and all-pairs)
Similarity
Yes
Yes
No
K-Nearest Neighbors (with cosine similarity for "nearness")
Classification
Yes
Yes
Yes

Computational Complexity

Computational Complexity is a formal mathematical term, referring to how an algorithm's time requirements scale according to the size of the data or other key parameters. For graphs, there are two key data parameters:
  • V (or sometimes n), the number of vertices
  • E (or sometimes m), the number of edges
The notation O(V^2) (read "big O V squared") means that when V is large, the computational time is proportional to V^2.

Path Algorithms

These algorithms help find the shortest path or evaluate the availability and quality of routes.

Single-Source Shortest Path, Unweighted

The algorithm we are discussing here finds an unweighted shortest path from one source vertex to each possible destination vertex in the graph. That is, it finds n paths.
If you just want to know the shortest path between two particular vertices, S and T in a graph with unweighted edges, we have described that query in detail in our tutorial document GSQL Demo Examples.
If your graph has weighted edges, see the next algorithm.

Description and Uses

If a graph has unweighted edges, then finding the shortest path from one vertex to another is the same as finding the path with the fewest hops. Think of Six Degrees of Separation and Friend of a Friend. Unweighted Shortest Path answers the question "How are you two related?" The two entities do not have to be persons. Shortest Path is useful in a host of applications, from estimating influences or knowledge transfer, to criminal investigation.
When the graph is unweighted, we can use a "greedy" approach to find the shortest path. In computer science, a greedy algorithm makes intermediate choices based on the data being considered at the moment, and then does not revisit those choices later on. In this case, once the algorithm finds any path to a vertex T, it is certain that that is a shortest path.

Specifications

1
shortest_ss_no_wt(VERTEX v, BOOL display)
2
shortest_ss_no_wt_file(VERTEX v, BOOL display, STRING filepath)
3
shortest_ss_no_wt_attr(VERTEX v, BOOL display)
Copied!
Characteristic
Value
Result
Computes a shortest distance (INT) and shortest path (STRING) from vertex v to each other vertex T. The result is available in 3 forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as two vertex attribute values.
Input Parameters
  • v: id of the source vertex
  • display: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.
  • filepath (for file output only): the path to the output file
Result Size
V = number of vertices
Computational Complexity
O(E), E = number of edges
Graph Types
Directed or Undirected edges, Unweighted edges

Example

In the below graph, we do not consider the weight on edge. Using vertex A as the source vertex, the algorithm discovers that the shortest path from A to B is A-B, and the shortest path from A to C is A-D-C, etc.
generic graph with shortest_pos5 choice, not considering weight
1
[
2
{
3
"ResultSet": [
4
{
5
"v_id": "B",
6
"v_type": "Node",
7
"attributes": {
10
"A",
11
"B"
12
]
13
}
14
},
15
{
16
"v_id": "A",
17
"v_type": "Node",
18
"attributes": {
21
"A"
22
]
23
}
24
},
25
{
26
"v_id": "C",
27
"v_type": "Node",
28
"attributes": {
31
"A",
32
"D",
33
"C"
34
]
35
}
36
},
37
{
38
"v_id": "E",
39
"v_type": "Node",
40
"attributes": {
43
"A",
44
"D",
45
"E"
46
]
47
}
48
},
49
{
50
"v_id": "D",
51
"v_type": "Node",
52
"attributes": {
55
"A",
56
"D"
57
]
58
}
59
}
60
]
61
}
62
]
Copied!

Single-Source Shortest Path, Weighted

Description and Uses

Finding shortest paths in a graph with weighted edges is algorithmically harder than in an unweighted graph because just because you find a path to a vertex T, you cannot be certain that it is a shortest path. If edge weights are always positive, then you must keep trying until you have considered every in-edge to T. If edge weights can be negative, then it's even harder. You must consider all possible paths.
A classic application for weighted shortest path is finding the shortest travel route to get from A to B. (Think of route planning "GPS" apps.) In general, any application where you are looking for the cheapest route is a possible fit.

Specifications

The shortest path algorithm can be optimized if we know all the weights are nonnegative. If there can be negative weights, then sometimes a longer path will have a lower cumulative weight. Therefore, we have two versions of this algorithm
1
shortest_path_pos_wt(VERTEX v, BOOL display)
2
shortest_path_pos_wt_file(VERTEX v, BOOL display, STRING filepath)
3
shortest_path_pos_wt_attr(VERTEX v, BOOL display)
Copied!
1
shortest_path_neg_wt(VERTEX v, BOOL display)
2
shortest_path_neg_wt_file(VERTEX v, BOOL display, STRING filepath)
3
shortest_path_neg_wt_attr(VERTEX v, BOOL display)
Copied!
Characteristic
Value
Result
Computes a shortest distance (INT) and shortest path (STRING) from vertex v to each other vertex T. The result is available in 3 forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as two vertex attribute values.
Input Parameters
  • v: id of the source vertex
  • display: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.
  • filepath (for file output only): the path to the output file
Result Size
V = number of vertices
Computational Complexity
O(V*E), V = number of vertices, E = number of edges
Graph Types
Directed or Undirected edges, Weighted edges
The shortest_path_neg_wt library query is an implementation of the Bellman-Ford algorithm. If there is more than one path with the same total weight, the algorithm returns one of them.

Example

The graph below has only positive edge weights. Using vertex A as the source vertex, the algorithm discovers that the shortest weighted path from A to B is A-D-B, with distance 8. The shortest weighted path from A to C is A-D-B-C with distance 9.
generic graph with shortest_pos5 choice
1
[
2
{
3
"ResultSet": [
4
{
5
"v_id": "B",
6
"v_type": "Node",
7
"attributes": {
10
"D",
11
"B"
12
]
13
}
14
},
15
{
16
"v_id": "A",
17
"v_type": "Node",
18
"attributes": {
21
}
22
},
23
{
24
"v_id": "C",
25
"v_type": "Node",
26
"attributes": {
29
"D",
30
"B",
31
"C"
32
]
33
}
34
},
35
{
36
"v_id": "E",
37
"v_type": "Node",
38
"attributes": {
41
"D",
42
"E"
43
]
44
}
45
},
46
{
47
"v_id": "D",
48
"v_type": "Node",
49
"attributes": {
52
"D"
53
]
54
}
55
}
56
]
57
}
58
]
Copied!
The graph below has both positive and negative edge weights. Using vertex A as the source vertex, the algorithm discovers that the shortest weighted path from A to E is A-D-C-B-E, with a cumulative score of 7 - 3 - 2 - 4 = -2.
shortest_path_wt(A, -1, true, "json") on shortest_neg5 graph

Single-Pair Shortest Path

The Single-Pair Shortest Path task seeks the shortest path between a source vertex S and a target vertex T. If the edges are unweighted, then use the query in our tutorial document GSQL Demo Examples.
If the edges are weighted, then use the Single-Source Shortest Path algorithm. In the worst case, it takes the same computational effort to find the shortest path for one pair as to find the shortest paths for all pairs from the same source S. The reason is that you cannot know whether you have found the shortest (least weight) path until you have explored the full graph. If the weights are always positive, however, then a more efficient algorithm is possible. You can stop searching when you have found paths that use each of the in-edges to T.

All-Pairs Shortest Path

The All-Pairs Shortest Path algorithm is costly for large graphs, because the computation time is O(V^3) and the output size is O(V^2). Be cautious about running this on very large graphs.
The All-Pairs Shortest Path (APSP) task seeks to find shortest paths between every pair of vertices in the entire graph. In principle, this task can be handled by running the Single-Source Shortest Path (SSSP) algorithm for each input vertex, e.g.,
1
CREATE QUERY all_pairs_shortest(INT maxDepth, BOOL display, STRING fileBase)
2
{
3
Start = {Node.*};
4
Result = SELECT s FROM Start:s
5
POST-ACCUM
6
shortest_ss_any_wt_file(s, maxDepth, display, fileBase+s);
7
}
Copied!
This example highlights one of the strengths of GSQL: treating queries as stored procedures which can be called from within other queries.
For large graphs (with millions of vertices or more), however, this is an enormous task. While the massively parallel processing of the TigerGraph platform can speed up the computation by 10x or 100x, consider what it takes just to store or report the results. If there are 1 million vertices, then there are nearly 1 trillion output values.
There are more efficient methods than calling the single-source shortest path algorithm n times, such as the Floyd-Warshall algorithm, which computes APSP in O(V^3) time.
Our recommendation:
  • If you have a smaller graph (perhaps thousands or tens of thousands of vertices), the APSP task may be tractable.
  • If you have a large graph, avoid using APSP.

Minimum Spanning Tree (MST)

Description and Uses

Given an undirected and connected graph, a minimum spanning tree is a set of edges which can connect all the vertices in the graph with the minimal sum of edge weight. A parallel version of the PRIM algorithm is implemented in the library:
  1. 1.
    Start with a set A = { an arbitrary vertex r }
  2. 2.
    For all vertices in A, find another vertex y in the graph not A and y is connected to a vertex x in A such that the weight on the edge e(x,y) is the smallest among all such edges from a vertex in A to a vertex not in A. Add y to A, and add the edge (x,y) to MST
  3. 3.
    Repeat 2 until A has all vertices in the graph.

Specifications

1
mst (VERTEX source)
2
mst_file (VERTEX source, FILE f)
3
mst_attr (VERTEX source)
Copied!
Characteristic
Value
Result
Computes a MST. The result is available in 3 forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as an edge attribute value.
Input Parameters
  • source: id of the source vertex
  • filepath (for file output only): the path to the output file
Result Size
V - 1 = number of vertices - 1
Computational Complexity
Graph Types
Undirected edges and connected
Example
In social10 graph, we consider only the undirected Coworker edges.
social10 graph with Coworker edges
This graph has 3 components. Minimum Spanning Tree finds a tree for one component, so which component it will work on depends on what vertex we give as the starting point. If we select Fiona, George, Howard, or Ivy as the start vertex, then it work on the 4-vertex component on the left. You can start from any vertex in the component and get the same or an equivalent MST result.
The figure below shows the result of mst(("Ivy", "Person")). Note that the value for the one vertex is ("Ivy","Person"). In GSQL, this 2-tuple format which explicitly gives the vertex type is used when the query is written to accept a vertex of any type.
mst(("Ivy","Person")) on social10 graph, with Coworker edges
File output:
1
From,To,Weight
2
Ivy,Fiona,6
3
Ivy,Howard,4
4
Ivy,George,4
Copied!
The attribute version requires a boolean attribute on the edge, and it will assign the attribute to "true" if that edge is selected in the MST:
mst_attr(("Ivy","Person")) on social10 graph, with Coworker edges & edge attribute "flag"

Cycle Detection

Description and Uses

The Cycle Detection problem seeks to find all the cycles (loops) in a graph. We apply the usual restriction that the cycles must be "simple cycles", that is, they are paths that start and end at the same vertex but otherwise never visit any vertex twice.
There are two versions of the task: for directed graphs and undirected graphs. The GSQL algorithm library currently supports only directed cycle detection. The Rocha–Thatte algorithm is an efficient distributed algorithm, which detects all the cycles in a directed graph. The algorithm will self-terminate, but it is also possible to stop at k iterations, which finds all the cycles having lengths up to k edges.
The basic idea of the algorithm is to (potentially) traverse every edge in parallel, again and again, forming all possible paths. At each step, if a path forms a cycle, it records it and stops extending it. More specifically: Initialization: For each vertex, record one path consisting of its own id. Mark the vertex as Active.
Iteration steps: Fo each Active vertex v:
  1. 1.
    Send its list of paths to each of its out-neighbors.
  2. 2.
    Inspect each path P in the list of the paths received:
    • If the first id in P is also id(v), a cycle has been found:
      • Remove P from its list.
      • If id(v) is the least id of any id in P , then add P to the Cycle List. (The purpose is to count each cycle only once.)
    • Else, if id(v) is somewhere else in the path, then remove P from the path list (because this cycle must have been counted already).
    • Else, append id(v) to the end of each of the remaining paths in its list.

Specifications

1
cycle_detection (INT depth)
2
cycle_detection_file (INT depth, FILE f)
Copied!
The algorithm traverses all edges. A user could modify the GSQL algorithm so that it traverse only edges of a certain type.
Characteristic
Value
Result
Computes a list of vertex id lists, each of which is a cycle. The result is available in 2 forms:
  • streamed out in JSON format
  • written to a file in tabular format
Input Parameters
  • depth: the maximum cycle length to search for = maximum number of iterations
  • filepath (for file output only): the path to the output file
Result Size
Number of cycles * average cycle length
Both of these measures are not known in advance.
Computational Complexity
O(E *k), E = number of edges.
k = min(max. cycle length, depth paramteter)
Graph Types
Directed
Example
In the social10 graph, there are 5 cycles, all with the Fiona-George-Howard-Ivy cluster.
cycle_detection(10) on social10 graph
1
[
2
{
3
"@@cycles": [
4
[
5
"Fiona",
6
"Ivy"
7
],
8
[
9
"George",
10
"Ivy"
11
],
12
[
13
"Fiona",
14
"George",
15
"Ivy"
16
],
17
[
18
"George",
19
"Howard",
20
"Ivy"
21
],
22
[
23
"Fiona",
24
"George",
25
"Howard",
26
"Ivy"
27
]
28
]
29
}
30
]
Copied!

Centrality Algorithms

Centrality algorithms determine the importance of each vertex within a network. Typical applications:
PageRank is designed for directed edges. The classic interpretation is to find the most "important" web pages, based on hyperlink referrals, but it can be used for another network where entities make positive referrals of one another.
Closeness Centrality and Betweenness Centrality both deal with the idea of "centrally located."

PageRank

Description and Uses

The PageRank algorithm measures the influence of each vertex on every other vertex. PageRank influence is defined recursively: a vertex's influence is based on the influence of the vertices which refer to it. A vertex's influence tends to increase if (1) it has more referring vertices or if (2) its referring vertices have higher influence. The analogy to social influence is clear.
A common way of interpreting PageRank value is through the Random Network Surfer model. A vertex's pageRank score is proportional to the probability that a random network surfer will be at that vertex at any given time. A vertex with a high pageRank score is a vertex that is frequently visited, assuming that vertices are visited according to the following Random Surfer scheme:
  • Assume a person travels or surfs across a network's structure, moving from vertex to vertex in a long series of rounds.
  • The surfer can start anywhere. This start-anywhere property is part of the magic of PageRank, meaning the score is a truly fundamental property of the graph structure itself.
  • Each round, the surfer randomly picks one of the outward connections from the surfer's current location. The surfer repeats this random walk for a long time.
  • But wait. The surfer doesn't always follow the network's connection structure. There is a probability (1-damping, to be precise), that the surfer will ignore the structure and will magically teleport to a random vertex.

Specifications

1
pageRank(FLOAT maxChange, INT maxIter, FLOAT damping, BOOL display, INT outputLimit)
2
pageRank_file(FLOAT maxChange, INT maxIter, FLOAT damping, FILE f)
3
pageRank_attr(FLOAT maxChange, INT maxIter, FLOAT damping)
Copied!
Characteristic
Value
Result
Computes a PageRank value (FLOAT type) for each vertex. The result is available in 3 forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as a vertex attribute value.
Input Parameters
  • maxChange: PageRank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ maxChange. That is, the scores have become very stable and are changing by less that maxChange from one iteration to the next. Suggested value: 0.001 or less.
  • maxIter: maximum number of iterations. Suggested value: between 10 and 100.
  • damping: fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives. Suggested value: 0.85.
  • f (for file output only): the path to the output file
  • display (for JSON output only): If true, include the graph's edges in the JSON output, so that the full graph can be displayed.
  • outputLimit (for JSON output only): maximum number of vertex values to output. Values will be sorted with highest value first.
Result Size
V = number of vertices
Computational Complexity
O(E*k), E = number of edges, k = number of iterations.
The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.
Graph Types
Directed edges

Example

We ran pageRank on our test10 graph (using Friend edges) with the following parameter values: damping=0.85, maxChange=0.001, and maxIter=25. We see that Ivy (center bottom) has the highest pageRank score (1.12). This makes sense, since there are 3 neighboring persons who point to Ivy, more than for any other person. Eddie and Justin have scores have exactly 1, because they do not have any out-edges. This is an artifact of our particular version pageRank. Likewise, Alex has a score of 0.15, which is (1-damping), because Alex has no in-edges.
pageRank_attr(0.001, 25, 0.85,"json",10) on social10 graph, with Friend edges

Personalized PageRank

Description and Uses

In the original PageRank, the damping factor is the probability of the surfer continues browsing at each step. The surfer may also stop browsing and start again from a random vertex. In personalized PageRank, the surfer can only start browsing from a given set of source vertices both at the beginning and after stopping.

Specifications

1
pageRank_pers(Set<Vertex> source, FLOAT maxChange, INT maxIter, FLOAT damping, INT outputLimit)
2
pageRank_pers_file(Set<Vertex> source, FLOAT maxChange, INT maxIter, FLOAT damping, FILE f)
3
pageRank_pers_attr(Set<Vertex> source, FLOAT maxChange, INT maxIter, FLOAT damping)
Copied!
Characteristic
Value
Result
Computes a personalized PageRank value (FLOAT type) for each vertex. The result is available in 3 forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as a vertex attribute value.
Input Parameters
  • source: a set of source vertices
  • maxChange: personalized PageRank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ maxChange. That is, the scores have become very stable and are changing by less that maxChange from one iteration to the next. Suggested value: 0.001 or less.
  • maxIter: maximum number of iterations. Suggested value: between 10 and 100.
  • damping: fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives. Suggested value: 0.85.
  • f (for file output only): the path to the output file
  • outputLimit (for JSON output only): maximum number of vertex values to output. Values will be sorted with highest value first.
Result Size
V = number of vertices
Computational Complexity
O(E*k), E = number of edges, k = number of iterations.
The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.
Graph Types
Directed edges

Example

We ran Personalized PageRank on our test10 graph using Friend edges with the following parameter values: damping=0.85, maxChange=0.001, maxIter=25, and source="Fiona". In this case, the random walker can only start or restart walking from Fiona. In the figure below, we see that Fiona has the highest pageRank score in the result. Ivy and George have the next highest scores, because they are direct out-neighbors of Ivy and there are looping paths that lead back to them again. Half of the vertices have a score of 0, since they can not be reached from Fiona.
pageRank_pers_attr([("Fiona","Person")],0.001,25,0.85) on social10 graph, with Friend edges

Closeness Centrality

We all have an intuitive understanding when we say a home, an office, or a store is "centrally located." Closeness Centrality provides a precise measure of how "centrally located" is a vertex. The steps below show the steps for one vertex v.
Description of Steps
Mathematical Formulation
1. Compute the average distance from vertex v to every other vertex:
davg(v)=uvdist(v,u)/(n1)d_{avg}(v) = \sum_{u \ne v} dist(v,u)/(n-1)
2. Invert the average distance, so we have average closeness of v:
CC(v)=1/davg(v)CC(v) = 1/d_{avg}(v)
These steps are repeated for every vertex in the graph.

Specifications

1
closeness_cent(BOOL display, INT maxOutput)
2
closeness_cent_file(BOOL display, STRING filepath)
3
closeness_cent_attr(BOOL display)
Copied!
Parameters
Characteristic
Value
Result
Computes a Closeness Centrality value (FLOAT type) for each vertex. The result is available in 3 forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as a vertex attribute value.
Required Input Parameters
  • display: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.
    filepath (for file output only): the path to the output file
  • maxOutput (for JSON output only): maximum number of vertex values to output. Values will be sorted with highest value first.
Result Size
V = number of vertices
Computational Complexity
O(E*k), E = number of edges, k = number of iterations.
The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.
Graph Types
Directed or Undirected edges, Unweighted edges

Example

Closeness centrality can be measured for either directed edges (from v to others) or for undirected edges. Directed graphs may seem less intuitive, however. because if the distance from Alex to Bob is 1, it does not mean the distance from Bob to Alex is also 1.
For our example, we wanted to use the topology of the Likes graph, but to have undirected edges. We emulated an undirected graph by using both Friend and Also_Friend (reverse direction) edges.
closeness_cent("json",10) on social10 graph, with Friend and Also_Friend edges

Betweenness Centrality

The Betweenness Centrality of a vertex is defined as the number of shortest paths which pass through this vertex, divided by the total number of shortest paths. That is
BC(v)=svtPDst(v)=svtSPst(v)/SPst, BC(v) =\sum_{s \ne v \ne t}PD_{st}(v)= \sum_{s \ne v \ne t} SP_{st}(v)/SP_{st} ,
where
PD PD
is called the pair dependency,
SPstSP_{st}
is the total number of shortest paths from node s to node t and
SPst(v) SP_{st}(v)
is the number of those paths that pass through v.
The TigerGraph implementation is based A Faster Algorithm for Betweenness Centrality by Ulrik Brandes, Journal of Mathematical Sociology 25(2):163-177, (2001). For every vertex s in the graph, the pair dependency starting from vertex s to all other vertices t via all other vertices v is computed first,
PDs(v)=t:sVPDst(v) PD_{s*}(v) = \sum_{t:s \in V} PD_{st}(v)
.
Then betweenness centrality is computed as
BC(v)=s:sVPDs(v)/2BC(v) =\sum_{s:s \in V}PD_{s*}(v)/2
.
According to Brandes, the accumulated pair dependency can be calculated as
PDs(v)=w:vPs(w)SPsv(v)/SPsw(1+PDs(w)),PD_{s*}(v) =\sum_{w:v \in P_s(w)} SP_{sv}(v)/SP_{sw} \cdot (1+PD_{s*}(w)) ,
where the set of predecessors of vertex w on shortest paths from s
Ps(w)P_s(w)
is defined as
Ps(w)={uV:{u,w}E,dist(s,w)=dist(s,u)+dist(u,w)}. P_s(w) = \{u \in V: \{u, w\} \in E, dist(s,w) = dist(s,u)+dist(u,w) \} .
For each single vertex, the algorithm works in two phases. The first phase calculates the number of shortest paths passing through each vertex. Then starting from the vertex on the most outside layer in a non-incremental order with pair dependency initial value of 0, traverse back to the starting vertex.

Specifications

1
betweenness_cent(INT maxHops)
2
betweenness_cent_file(STRING filepath, INT maxHops)
3
betweenness_cent_attr(INT maxHops)
Copied!
Parameters
Characteristic
Value
Result
Computes a Betweenness Centrality value (FLOAT type) for each vertex. The result is available in 3 forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as a vertex attribute value.
Required Input Parameters
  • filepath (for file output only): the path to the output file
  • maxHops: maximum number of iterations
Result Size
V = number of vertices
Computational Complexity
O(E*V), E = number of edges, V = number of vertices.
Considering the high time cost of running this algorithm on a big graph, the users can set a maximum number of iterations. Parallel processing reduces the time needed for computation.
Graph Types
Undirected edges, Unweighted edges

Example

In the example below, Claire is in the very center of the social graph, and has the highest betweenness centrality. Six shortest paths pass through Sam (i.e. paths from Victor to all other 6 people except for Sam and Victor), so the score of Sam is 6. David also has a score of 6, since Brian has 6 paths to other people that pass through David.
betweenness_cent_attr(10) on a social graph with undirected edges Friend
1
[
2
{
3
"@@BC": {
4
"Alice": 0,
5
"Frank": 0,
6
"Claire": 17,
7
"Sam": 6,
8
"Brian": 0,
9
"David": 6,
10
"Richard": 0,
11
"Victor": 0
12
}
13
}
14
]
Copied!
In the following example, both Charles and David have 9 shortest paths passing through them. Ellen is in a similar position as Charles, but her centrality is weakened due to the path between Frank and Jack.
betweenness_cent_attr(10) on a social graph with undirected edges Friend
1
[
2
{
3
"@@BC": {
4
"Alice": 0,
5
"Frank": 0,
6
"Charles": 9,
7
"Ellen": 8,
8
"Brian": 0,
9
"David": 9,
10
"Jack": 0
11
}
12
}
13
]
Copied!

Community Algorithms

These algorithms evaluate how a group is clustered or partitioned, as well as its tendency to strengthen or break apart.

Connected Components

Description and Uses

A component is the maximal set of vertices, plus their connecting edges, which are interconnected. That is, you can reach each vertex from each other vertex. In the example figure below, there are three components.
This particular algorithm deals with undirected edges. If the same definition (each vertex can reach each other vertex) is applied to directed edges, then the components are called Strongly Connected Components. If you have directed edges but ignore the direction (permitting traversal in either direction), then the algorithm finds Weakly Connected Components.

Specifications

1
conn_comp()
2
conn_comp_file(STRING filepath)
3
conn_comp_attr()
Copied!
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value. The result is available in three forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as a vertex attribute value.
Input Parameters
  • filepath (for file output only): the path to the output file
Result Size
V = number of vertices
Computational Complexity
O(E*d), E = number of edges, d = max(diameter of components)
Graph Types
Undirected edges

Example

It is easy to see in this small graph that the algorithm correctly groups the vertices:
  • Alex, Bob and Justin all have Community ID = 2097152
  • Chase, Damon, and Eddie all have Community ID = 5242880
  • Fiona, George, Howard, and Ivy all have Community ID = 0
Our algorithm uses the TigerGraph engine's internal vertex ID numbers; they cannot be predicted.
conn_comp(true, "json") on social10 graph with Coworker edges

Strongly Connected Components

Description and Uses

A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. An SCC algorithm finds the maximal SCCs within a graph. Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm[1]. In each iteration, pick a pivot vertex v randomly, and find its descendant and predecessor sets, where descendant set D_v is the vertex reachable from v, and predecessor set P_v is the vertices which can reach v (stated another way, reachable from v through reverse edges). The intersection of these two sets is a strongly connected component SCC_v. The graph can be partitioned to 4 sets: SCC_v, descendants D_v excluding SCC_v, predecessors P_v excluding SCC, and the remainders R_v. It is proved that any SCC is a subset of one of the 4 sets [1]. Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.
The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real world use cases [3]. We added two trimming stages to improve the performance: size-1 SCC trimming[2] and weakly connected components[3].
The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.
[1] Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.
[2] Mclendon Iii, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.
[3] Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.

Specifications

1
scc(INT iter = 500, INT iter_wcc = 5, INT top_k_dist)
2
scc_file(INT iter = 500, INT iter_wcc = 5, INT top_k_dist, FILE f)
3
scc_attr(INT iter = 500, INT iter_wcc = 5, INT top_k_dist)
Copied!
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value. The result is available in three forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as a vertex attribute value.
Input Parameters
  • iter: number of maximum iteration of the algorithm
  • iter_wcc: find weakly connected components for the active vertices in this iteration, since the largest SCCs are already found after several iterations; usually a small number(3 to 10)
  • top_k_dist: top k result in SCC distribution
  • f (for file output only): the path to the output file
Result Size
V = number of vertices
Computational Complexity
O(iter*d), d = max(diameter of components)
Graph Types
Directed edges with reverse direction edges as well

Example

We ran scc on the social26 graph. A portion of the JSON result is shown below.
1
[
2
{
3
"i": 1
4
},
5
{
6
"trim_set.size()": 8
7
},
8
{
9
"trim_set.size()": 5
10
},
11
{
12
"trim_set.size()": 2
13
},
14
{
15
"trim_set.size()": 2
16
},
17
{
18
"trim_set.size()": 0
19
},
20
{
21
"@@cluster_dist_heap": [
22
{
23
"csize": 9,
24
"num": 1
25
},
26
{
27
"csize": 1,
28
"num": 17
29
}
30
]
31
},
Copied!
The first element "i"=1 means the whole graph is processed in just one iteration. The 5 "trim_set.size()" elements mean there were 5 rounds of size-1 SCC trimming. The final "@@.cluster_dist_heap" object" reports on the size distribution of SCCs.There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.

Label Propagation

Description and Uses

Label Propagation is a heuristic method for determining communities. The idea is simple: If the plurality of your neighbors all bear the label X, then you should label yourself as also a member of X. The algorithm begins with each vertex having its own unique label. Then we iteratively update labels based on the neighbor influence described above. It is important that they the order for updating the vertices be random. The algorithm is favored for its efficiency and simplicity, but it is not guaranteed to produce the same results every time.
In a variant version, some vertices could initially be known to belong to the same community,. If they are well-connected to one another, they are likely to preserve their common membership and influence their neighbors,

Specifications

1
label_prop(INT maxIter)
2
label_prop_file(INT maxIter, FILE filepath)
3
label_prop_attr(INT maxIter)
Copied!
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value. The result is available in three forms:
  • streamed out in JSON format
  • written to a file in tabular format, or
  • stored as a vertex attribute value.
Input Parameters
  • maxIter: the maximum number of update iterations.
  • filepath (for file output only): the path to the output file
Result Size
V = number of vertices
Computational Complexity
O(E*k), E = number of edges, k = number of iterations.
Graph Types
Undirected edges

Example

This is the same graph that was used in the Connected Component example. The results are different, though. The quartet of Fiona, George, Howard, and Ivy have been split into 2 groups. See can see the symmetry:
  • (George & Ivy) each connect to (Fiona & Howard) and to one another.
  • (Fiona & Howard) each connect to (George & Ivy) but not to one another.
Label Propagation tries t