Ericpony's Blog: June 2014

Saturday, June 21, 2014

Verifying the Commutativity of Reduce Functions

MapReduce is a programming model that assumes as input a large, unordered stream of data. Computation essentially proceeds in three steps:
Map A map function is applied to each input, which outputs zero or more intermediate key-value pairs of an arbitrary type.
Shuffle All intermediate key-value pairs are grouped and sorted by key, so that pairs with the same key can be reduced together.
Reduce A reduce function combines values with the same key and produces results associated with that key in the final output.
Note that there is in fact an optional optimization step between map and shuffle that combines values by key on a map node. This is useful as reducing the number of key-value pairs locally before they are shuffled. We ignore this step because it is irrelevant to the topics discussed later.

While records are sorted by key before they reach the reduce function, for any particular key, the order that the values appear is not stable from one run to the next, since they come from different map tasks, which may finish at different times from run to run. As a result, most MapReduce programs are written so as not to depend on the order that the values appear to the reduce function. In other words, an implementation of a reduce function is expected to be commutative with respect to its input. Formally, a reduce function, which we may alternatively call a reducer hereafter, is commutative if for each input list $L$, it returns the same result for each permutation $\sigma$ of that list: $$\forall L, \sigma: reduce(k, L) = reduce(k, \sigma(L))$$ Note that we treat reduce functions in a very high-level manner: A reducer in our mind is just a function that takes a key and a list as input and returns a value as output. Moreover, we may assume without loss of generality that keys and values are all integers, say by considering their hash values instead of their actual values.
The main purpose of this post is to initiate an investigation to model check commutativity of reduce functions. Our modeling language supports Boolean and integer data types, Boolean operations and integer arithmetics, if-then-else and control structures that can be constructed from it, return statement, and a while loop over the iterator of the input list. Moreover, we introduce non-determinism to the language so that we can model uninterpreted function calls, etc. While this model looks like an over-simplification of reducers used in practical MapReduce programs, hopefully it has captured enough ingredients for us to derive non-trivial results that may shed light on more realistic use cases. Our first theorem about this model is as follows.
Theorem. The commutativity of reducers is undecidable.
Proof. We shall reduce the Diophantine problem , i.e., determining the solvability of Diophantine equation systems, which is known to be undecidable in general, to the problem of determining commutativity of reducers. Given a Diophantine equation system $P: P_1(x_1,...,x_k)=0$, ..., $P_n(x_1,...,x_k)=0$, we define a reducer as follows:

public int reduce(int key, Iterator<int> values) {
    int x1, ..., xk;
    if(values.hasNext()) x1 = values.next(); else return 0;
    ...
    if(values.hasNext()) xk = values.next(); else return 0;

    if(P1(x1,...,xk)!=0) return 0;
    ...
    if(Pn(x1,...,xk)!=0) return 0;

    int y1, y2;
    if(values.hasNext()) y1 = values.next(); else return 0;
    if(values.hasNext()) y2 = values.next(); else return 0;
    return y1 - y2;
}

It is clear that if equation system $P$ has no solution, the reduce function always returns zero regardless of its input, i.e., it is commutative. On the other hand, if there is a solution, then its return value depends on its input values as well as the order that they are iterated, i.e., the function is not commutative. Note that the return value $y_1-y_2$ makes the reducer non-commutative even when $P$ has a unique solution with $x_1=...=x_k$. In this way, we reduced the problem of checking the solvability of $P$ to that of checking the commutativity of a reducer. This concludes our proof. $\square$
As verifying commutativity is undecidable, the best hope for us is to derive a semi-algorithm that is effective for some interesting case studies. Our first attempt is to use abstract interpretation, along with counter abstraction for lists, and reduce the verification problem to a reachability problem. Let $t,n\in\mathbb N$ be refinement parameters. A state is a 4-tuple $(L,pc,itr,V)$ where $L$ is an abstract list value, $pc$ is the program counter, $itr$ is the position of iterator, and $V$ is the valuation in abstract domain $[-n, n]$. An abstract list is a prefix of size $t$ followed by a subset of $\{-n,...,n\}$ that represents the "tail". For example, an abstract list $1\rightarrow 2\rightarrow \{1,3\}$ concretizes to set of lists $12\{1,3\}^*$ (in regular expression). We say an abstract list is "likely" a permutation of another abstract list if any of the permutations of its concretization is contained in the concretization of the other abstract list. Finally, a state is called accepting if its $pc$ equals to the line number of some return statement.
Now, given two initial states $(L,pc,itr,V)$ and $(L',pc',itr',V')$ such that $L$ is likely a permutation of $L'$, we carry out two coordinated symbolic executions from each state and see if they can reach accepting states after the same number of steps (note that we have to allow nondeterministic NOP's so that the executions can represent real paths of different lengths even though they are coordinated). We say that the executions reach a bad state if the two paths reach some return statements, say "return x" and "return y", respectively, and $V(x)\neq V'(y)$. If no bad state is reached, then we have proved that the reducer is commutative. Otherwise, we check the bad state we reach and see if it is in effect concrete, i.e., abstract lists do not have tails and all valuations are in $(-n, n)$. If so, then we have proved that the reducer is not commutative. If the bad state is not concrete, then we have found a plausible bug, but it may be a false alarm since abstract states are over-approximation of concrete states. In such case, we need to refine the abstraction, e.g., by increasing parameters $t,n$, and re-run the execution from the beginning. Continuing in this flavor, we can obtain an answer for certain if the process terminates. Of course, it is possible that the process never stops and we just keep doing refinements again and again. This possibility cannot be eliminated, however, since the question we want to answer is undecidable per se.

References and further readings

1. Yu-Fang Chen, Chih-Duo Hong, Nishant Sinha, and Bow-Yaw Wang. "Commutativity of Reducers", TACAS, 2015.

2. A note on Hilbert's 10th Problem by Yuri Matiyasevich, or see this overview paper for a glimpse of the literature.

3. Computational Aspects of Diophantine Equation Systems with Mathematica.

4. Csallner, Christoph, Leonidas Fegaras, and Chengkai Li. "New ideas track: testing MapReduce-style programs." ACM SIGSOFT, 2011.

5. Xiao, Tian, et al. "Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs." Companion Proceedings of the 36th ICSE. ACM, 2014.

Monday, June 16, 2014

Functors and Monads in Scala ─ Part I

A type constructor is a type that you can apply type arguments to construct a new type: given a type A and a type constructor M, you get a new type M[A]. A functor is a type constructor, say M, together with a function

    lift: (A -> B) -> M[A] -> M[B]

that lifts a function over the original types into a function over the new types. Moreover, we require that the lifting function should preserve composition and identities. That is, if function h is formed by composing functions g and f, then lift(h) should be equivalent to the composition of lift(g) and lift(f). Formally, this means

    h(a) = g(f(a))  implies  lift h a = lift g (lift f a).

Also, if h is an identity function on variables of type A, then lift(h) should be an identity function on variables of type M[A]:

    h(a) = a  implies  lift h a = a.

A monad is a functor associated with two special operations, unitizing and binding:

    unit: A -> M[A]
    bind: (A -> M[B]) -> M[A] -> M[B]

A unitizing operation takes an object of the original type and maps it to an object of the new type. A binding operation is useful in composing two monadic functions, i.e., functions with signature $*\rightarrow M[*]$. Using binding, the composition of two monadic functions f: A -> M[B] and g: B -> M[C] can be defined as

    a => bind g f(a),

which has type A -> M[C]. Composition of monadic functions is supposed to be associative and has unit as a left and right identity. For this purpose, definitions of bind and unit must satisfy three "monad laws". That is, the following three equations

1.    bind (a => bind g f(a)) la = bind g (bind (a => f(a)) la)
2.    bind f (unit a) = f(a)
3.    bind unit la = la

should hold for any values a with type A, la with type M[A], functions f with type A -> M[B] and g with type B -> M[C].

A monad can be thought of as representing computations just like a type represents values. These computations are operations waiting to be carried out. Presumably, the programmer has some interest in not carrying them out directly. For example, the computations to be done may not be fully known at compile time. One can find similar ideas of representing computations as objects in the Command design pattern of GoF. Regarding monads as computations, the purpose of unit is to coerce a value into a computation; the purpose of bind is to construct a computation that evaluates another computation and yields a value. Informally, unit gets us into a monad and bind gets us around the monad. To get out of the monad, one usually uses an operation with type M[A] -> A, though the precise type depends on the purpose of the monad at hand.

In Scala, a monad is conventionally treated like a collection associated with list operations: the unit operation corresponds to a constructor of singleton lists; lift and bind are named as map and flatMap, respectively, in the sense that lift maps a list to another list and bind maps a list of lists to another list of lists and then flattens the result to a one-layered list. Map and flatMap operations kind of subsume each other given the flatten operation, as

    flatMap g la = flatten (map g la)
    map (a => f(a)) la = flatMap (a => unit f(a)) la

Hence, the minimal set of operations a monad must provide can be {unit, flatMap} or {unit, map, flatten}. Scala doesn't define a base trait for Monad and leaves the decision of interface to the programmer. On the other hand, if a monad implements methods map, flapMap and filter, then it can be composed with other monads via for-comprehension to streamline monadic computations.

Examples

Optional values. Option (aka. Maybe) monads are used extensively in many functional programming languages: see here for a more general treatment of option monads, and here for a detailed explanation of their usages in Scala. Below is a minimalistic implementation of Option type, and its companion subtypes Some and None:

sealed trait Option[T] {
    def map[U](f: T => U):Option[U]
    def flatMap[U](f: T => Option[U]): Option[U]
}
case class Some[T](val t:T) extends Option[T] { 
    override def map[U](f: T => U): Option[U] = new Some[U](f(t))
    override def flatMap[U](f: T => Option[U]): Option[U] = f(t)
}
case class None[T] extends Option[T] {
    override def map[U](f: T => U): Option[U] = new None[U]
    override def flatMap[U](f: T => Option[U]): Option[U] = new None[U]
}

The canonical advantage of the option monad is that it allows composition of partial functions, i.e., functions that return a null pointer when they fail to give correct results. Instead, a partial function returns None (representing an "empty" Option) and this result may be composed with other functions. In other words, it allows transformation of partial functions into total ones. For example, suppose g1, g2 and g3 are functions that return either a null pointer or an addable value, and we want to print g3(g2(x) + g1(x)) given that the result is meaningful. Without Option, one has to use nested if-then null checks or a try-catch block to handle possible null inputs/outputs, for example,

try{ 
  val y = g3(g2(x) + g1(x))
  if(y != null) println(y)
}catch {
  case e : NullPointerException => ; 
}

If we make g1, g2 and g3 return an option monad via a wrapper, we can achieve the same purpose using for-comprehension, so that nothing would be printed if any of g1, g2 or g3 returns null:

def monadify[T](f: T => T) =
  (x: T) => { 
    val y = if(x!=null) f(x) else x
    if(y != null) Some(y) else None
  }
for {
  y <- monadify(g1)(x)
  z <- monadify(g2)(x)
  r <- monadify(g3)(y + z)
} yield println(r)

Whether the functional ("idiomatic") way of handling null pointers is preferable to the traditional one is still under debate, especially among sophisticated users of imperative programming languages. Since these two approaches differ very little in performance and Scala supports both of them, the choice is really only a matter of taste than truth IMHO.

An important thing to notice about this example is that the monad is used here as a control structure. We have got a sequence of operations chained together in a monad via for-comprehension. When we evaluate the monadic sequence, the implementation of the binding operator that combines monads actually does control flow management. This trick of using the chaining operator to insert control flow into monadic sequences is an incredibly useful technique and is used in Scala by a variety of monads beside Option, such as Try, Either, Future and Promise.

State transformers. A stateful computation can be modeled as a function that takes inputs with an initial state and produces an output paired with the new state. Due to the presence of states, a function may produce different results even though it is given the the same inputs. Let Env be a type representing the possible states of the environment. An Effect is a mapping from Env to Env that describes changes of states, i.e., the side-effects caused by a computation, and the monad type M adjoins an extra side-effect to any type:

    Effect = Env -> Env
    M[A] = (A, Effect)

The unit function pairs up a value with the identity on side-effects; the map function propagates a state leaving it unchanged:

    unit a = (a, e => e)
    (a, e) map f = (f(a), e)

If you want to adjoin two side-effects e and e', you can use bind to compose them and get a single effect e' ○ e, where ○ denotes function composition:

    (a, e) bind (a => (f(a), e')) = (f(a), e' ○ e)

In Scala, the state monad described above may be written as

case class M[+A](a: A, e: Env => Env) {
    private def bind[B](f: A => M[B]): M[B] = {
      val fa = f(a); M(fa.a, env => fa.e(e(env)))
    }
    override def toString = showM(this)
    def map[B](f: A => B): M[B] = bind(x => unitM(f(x)))
    def flatMap[B](f: A => M[B]): M[B] = bind(f)
}
def unitM[A](a: A) = M(a, env => env)

The following example shows a state monad that increments a counter each time an arithmetic operation is invoked. (Note that one can actually declare the counter as a field member, avoiding the use of monads. This may explain the viewpoint that state monads are arguably less important in impure FPs like Scala than in pure FPs like Haskell.) The add and div operations demonstrate how to handle a monad wrapped in another monad using nested for-comprehensions.

type Env = Int
// Ad hoc operations for this particular monad
def showM[A](m: M[A]) = "Value: " + m.a + "; Counter: " + m.e(0)
val tickS = M(0, e => e + 1)

// Following code demonstrates the use of monads
type Value = Double
def add(a: M[Option[Value]], b: M[Option[Value]]): M[Option[Value]] = 
  for {
    _  <- tickS
    aa <- a 
    bb <- b
    c = for {
      a <- aa if aa.nonEmpty
      b <- bb if bb.nonEmpty
    } yield a + b     // yield None if aa is None or bb is None
  } yield c

def div(a: M[Option[Value]], b: M[Option[Value]]): M[Option[Value]] = 
  for {
    _  <- tickS
    aa <- a
    bb <- b
    c = for {
      a <- aa if aa.nonEmpty
      b <- bb if bb.nonEmpty
      b <- Some(b) if b != 0
    } yield a / b     // yield None if b==0 or aa is None or bb is None
  } yield c

val const0 = unitM(Some(0.0))
val const1 = unitM(Some(1.0))
val const2 = unitM(Some(2.0))
val const3 = unitM(Some(3.0))
val expr = div(add(const3, add(const1, const2)), const2)  
println(expr)                            // Value: Some(3.0); Counter: 2
println(add(div(expr, const0), const1))  // Value: None; Counter: 5

The syntax sugar of for-comprehensions abstracts the details and help the user manipulate monads without getting involved in the underlying machinery of binding and lifting. The fact that the abstraction works as expected is a consequence of the monad laws, and it is up to the designer of the monad to assure that the implementation does keep the laws.

We just stop here as this post is getting too long. More examples, techniques and discussions will be introduced in the Part II of this tutorial.

References and resources

1. Monads are Elephants, Part 1 ~ Part 4, is a gentle and pragmatic tutorial that explains exactly what a Scala programmer needs to know to exploit monads. Scala Monads: Declutter Your Code With Monadic Design is one of the videos on YouTube that teaches you how to write modular and extensible programs with monadically structured code.
2. Many excellent papers and slides about monads could be found in the homepage of Philip Wadler, one of the most enthusiastic advocates of monadic design in the FP community. Among others, The Essence of Functional Programming and Monads for Functional Programming are two must-reads for anyone who is struggling to turn monadic reasoning into his intuitions.
3. For a general and rigid mathematical treatment of monads, see e.g., Arrows, Structures, and Functors: The Categorical Imperative, where Section 10.2 describes monads as generalized monoids in a way that basically makes everything that we can write down or model using abstract syntax a monad. Mathematically mature computer science readers will find everything they need to know about the subject in the celebrated book Categories for the Working Mathematician by Mac Lane, the co-founder of category theory.
4. There are also books available online that introduce programmers with moderate math background into the subject, including Category Theory for Computing Science by M. Barr and C. Wells, and Computational Category Theory by D.E. Rydeheard and R.M. Burstall. Also check this thread and references therein about the relations between category theory, type theory, and functional programming.