This is ericpony's blog

Wednesday, May 4, 2016

Active Set in GraphX Pregel

The original Pregel computation abstraction takes a vertex-centric view to express graph programs. In this abstraction, a vertex is a two-state machine which is either inactive or active. Each active vertex concurrently runs a vertex program, which can send messages to other vertices, update the value of the vertex, or deactivates the vertex (aka. votes to halt). A deactivated vertex remains inactive in subsequent execution steps unless it receives a message. A vertex activated by a message must vote to halt again to get deactivated. No computation is associated with edges.
The GraphX computation framework realizes Pregel based on the Gather-Apply-Scatter (GAS) pattern. GAS breaks down the computation of vertex program into purely edge-parallel and vertex-parallel stages with three user-defined functions, called gather, apply and scatter: the gather function is an associative and commutative operator over the inbound messages of a vertex; the apply function operates on a vertex and updates its value based on the inbound message; the scatter function computes and sends messages along each edge. Note that, unlike more standard implementations of Pregel, GraphX Pregel does not allow modifying edge attributes and graph topology during execution. Besides, it allows message exchanges only between the neighboring vertices. These restrictions are imposed so as to increase parallelism and enable more optimizations within GraphX.
GraphX Pregel does not provide operations to deactivate a vertex (i.e. to let a vertex vote to halt). Instead, it automatically avoids executing the scatter function on edges where neither endpoint has received a message. These edges will remain skipped until their endpoints receive a message again. Hence, an edge in GraphX Pregel is also a two-state machine, just like a vertex in the original Pregel: an edge will be deactivated if its endpoints do not receive any message and will be activated after its endpoints receive a message. In practice, GraphX decides whether an edge is activated or not by bookkeeping the set of vertices that received messages from the last execution step, called the active set. Hence, an edge is activated iff it is incident to some vertex in the active set. In addition, GraphX also uses the size of the active set as an indicator of workload and adjust message delivery strategy accordingly at runtime.
Adopting the bookkeeping mechanism allows GraphX Pregel to avoid unnecessary computation and adjust its message delivery strategy at runtime. The mechanism, however, can increase the complexity to reason about the behaviors of a Pregel program. Fortunately, it turns out that enabling bookkeeping does not change the observable behaviors of a Pregel program under certain conditions. More precisely, we have the following two facts:
Fact 1. The outcome of a GraphX Pregel program is not affected by the choice of message delivery strategy if the gather function is associative and commutative.
Fact 2. The outcome of a GraphX Pregel program is not affected by the bookkeeping mechanism if the scatter function only depends on the edge triplet, namely, the edge attribute, the source vertex, and the destination vertex.
The first fact is clear. To see why the second fact is true, consider two versions of GraphX frameworks, one of which executes the scatter function only on activated edges and the other executes it on all edges. Given the condition in Fact 2 and a deterministic apply function, we will show that a Pregel program produces the same output graph in both frameworks from the same initial message and input graph. The proof is done by induction on the number of execution steps taken by the program:
Step 0. Since the apply function is deterministic and the initial message is sent to all vertices without executing the scatter function, the program produces the same graph in both frameworks after the first step.
Step n+1. Suppose the program produces the same graph after the nth step. Consider an edge e in the (n+1)-th step. If e is active, then the scatter function behaves the same on it in both frameworks. If e is inactive, then two facts about its endpoints hold in the last step: i) they did not receive any message from e, and ii) they did not change values. Since the scatter function only checks the edge triplet of e, which is unchanged since the last step, the function will not send any message along e even if it is executed on e in this step. Hence, the scatter function sends the same messages to the same vertex in both frameworks. Moreover, since the gather function is associative and commutative, each vertex receives the same reduced inbound message in both frameworks. Finally, since the apply function is deterministic, the program will produce the same graph in both frameworks after the (n+1)-th step. This concludes our proof.
Note that the condition in Fact 2 holds when the scatter function is pure. Hence, we can safely assume that all edges are activated when we are simulating or reasoning about a pure Pregel program in GraphX. In the original Pregel model, however, active set is essential to decide whether vertex program should be executed on a vertex. It is the GAS decomposition and the restriction on edge modification that enable us to eliminate the use of active set in GraphX.

References

1. Pregel API in GraphX Programming Guide and its source code on GitHub.
2.Xin, Reynold S., et al. "Graphx: A resilient distributed graph system on spark." First International Workshop on Graph Data Management Experiences and Systems. 2013.
3. Xin, Reynold S., et al. "GraphX: Unifying data-parallel and graph-parallel analytics." arXiv preprint. 2014.
4. Gonzalez, Joseph E., et al. "Graphx: Graph processing in a distributed dataflow framework." OSDI. 2014.
Related Posts Plugin for WordPress, Blogger...