Ericpony's Blog: June 2015

Monday, June 15, 2015

Writing decentralized algorithms with Pregel: A case study

In this post, we demonstrate how to implement a purely functional decentralized algorithm using the Pregel abstraction offered by the Spark GraphX library.

A vertex coloring of a graph is a way to color the vertices such that no two adjacent vertices share the same color. Formally, given an undirected graph $G=(V,E)$, a k-coloring of $G$ is a map $f: V \rightarrow \{1,...,k\}$ such that $f(v) \neq f(u)$ for any $(v,u) \in E$. A graph is called k-colorable if it has a $k$-coloring.

It takes linear time to find a $(\Delta+1)$-coloring for any graph with maximum degree $\Delta$. On the other hand, finding a 4-coloring for a 3-colorable graph is already NP-hard. There are many randomized algorithms in the literature that color a graph in exponential time and linear space. Among them, the Communication-Free Learning (CFL) algorithm [1] is a fully decentralized algorithm inspired by the problem of allocating non-interfering channels in IEEE 802.11 wireless networks.

The CFL algorithm works as follows. For each vertex $v \in V$, let $C_n(v) \in \{1,...,k\}$ denote the color of $v$ in the $n$-th iteration. Also, let $P_n(v,c)$ denote the probability that vertex $v$ has color $c$ in the $n$-th iteration. That is, $\Pr\{C_n(v) = c\} = P_n(v,c)$. At $n = 0$, the color of each vertex $v$ is chosen uniformly at random, i.e., $P_0(v, c) = 1/k$ for all $v \in V$ and $c = 1,...,k$. For $n \ge 1$, the color of $v$ is chosen randomly according to the following rule: If no adjacent vertex to $v$ shares the same color with $v$, then we let $P_n(v, c) = 1$ for $c = C_{n-1}(v)$ and let $P_n(v, c) = 0$ otherwise. Note that this implies $C_n(v) = C_{n-1}(v)$. If there is a collision of color between $v$ and its neighbors, then we choose $C_n(v)$ according to probability distribution $P_n(v, \cdot)$, such that for $c = 1,...,k$,
$$ P_{n}(v,c)=\begin{cases} (1-\beta)\cdot P_{n-1}(v,c), & c=C_{n-1}(v)\\ (1-\beta)\cdot P_{n-1}(v,c)+\beta\cdot(k-1)^{-1}, & \mbox{otherwise,} \end{cases} $$ where $\beta \in (0,1)$ is a pre-determined parameter. Observe that the vertex colors stabilize if and only if they form a $k$-coloring.
The CFL algorithm can be executed via the following decentralized procedure:

Set up $C_0(v)$ and $P_0(v,\cdot)$ for each vertex $v$ and let all vertices be active.
A) An active vertex $v$ is deactivated if no vertices in $N(v)$ share the same color with $v$. An inactive vertex $v$ is activated if there is a vertex in $N(v)$ sharing the same color with $v$.
B) Each vertex $v$ computes $C_n(v)$ and $P_n(v,\cdot)$ upon the $n$-th change of the vertex state (from active to inactive or vice versa).
Repeat Step 2 until all vertices are simultaneously inactive.

Thus we implement the algorithm by carrying out the above procedure with Pregel. A vertex $v$ is represented as a 4-tuple

type Palette = (Color, List[Double], Boolean, Random)

composed of the vertex color $C_n(v)$, the color distribution $P_n(v,\cdot)$, the vertex state (active/inactive), and a random number generator in turn. Given a graph RDD graph, we first construct its base graph and initialize the attributes of the vertices:

val seed = Random.nextInt
val baseGraph = graph.mapVertices((id, _) => {
  val rng = new Random(seed + id)
  (rnd.nextInt(k + 1), initColorDist, true, rng)
})

Note that we choose to maintain an RNG at each vertex. One may feel tempted to declare a global RNG and share it among the vertices. However, since Spark serializes the closure as it dispatches jobs, doing so will lead each vertex to use a copy of the global RNG instead of sharing it. Hence, all vertices will obtain the same sequence of random numbers upon their queries, which is definitely an undesired situation.
We use functions sendMsg and mergeMsg to carry out Part A of the second step in the procedure:

def sendMsg (edge: EdgeTriplet[Palette, _]): Iterator[(VertexId, Boolean)] = {
  if (edge.srcAttr._1 == edge.dstAttr._1)
    return Iterator((edge.srcId, true))
  if (edge.srcAttr._3)
    return Iterator((edge.srcId, false))
  Iterator.empty
}
def mergeMsg (msg1: Boolean, msg2: Boolean) = msg1 || msg2

For each edge $(v, u)$, the sendMsg function can activate or deactivate the source vertex $v$ by sending it a Boolean message. The messages are aggregated by mergeMsg via disjunction, such that a vertex is activated if any of its neighbors wants to activate it, and is deactivated if all of its neighbors want to deactivate it.
Part B of the second step is carried out by the vprog function:

def vprog (id: VertexId, attr: Palette, active: Boolean): Palette = {
  val color = attr._1
  val dist  = attr._2
  val rng   = attr._4
  val newDist = dist.foldLeft((1, List[Double]())) {
    case ((i, list), weight) => (i + 1,
      if (active)
        list :+ (weight * (1 - beta) + (if (color == i) 0.0 else beta / (maxNumColors - 1)))
      else
        list :+ (if (color == i) 1.0 else 0.0))
  }._2
  val newColor = if (active) sampleColor(newDist, rng.nextDouble) else color
  (newColor, newDist, active, rng)
}

The vprog function uses a helper function colorSampler to color a vertex randomly according to its current color distribution:

def sampleColor (dist: List[Double], rnd: Double): Int = {
  dist.foldLeft((1, 0.0)) {
    case ((color, mass), weight) => {
      val m = mass + weight
      (if (m < rnd) color + 1 else color, m)
    }}._1
}

Finally, we invoke Pregel with the above functions to compute a $k$-coloring for the base graph baseG:

Pregel(baseGraph, true)(vprog, sendMsg, mergeMsg).mapVertices((_, attr) => attr._1)

It is shown in [2] that the CFL algorithm colors a graph in exponential time with high probability. More precisely, given any $\epsilon \in (0,1)$, the algorithm can find a $k$-coloring for a $k$-colorable graph with probability $1-\epsilon$ in $O(N\exp^{cN\lg 1/\epsilon})$ iterations, where $N$ is the number of vertices and $c$ is a constant depending on $\beta$.

References

1. D. J. Leith, P. Clifford, "Convergence of distributed learning algorithms for optimal wireless channel allocation." IEEE CDC, 2006.
2. Duffy, Ken R., et al. "Complexity analysis of a decentralised graph colouring algorithm." Information Processing Letters, 107.2, 2008.