Map A map function is applied to each input, which outputs zero or more intermediate key-value pairs of an arbitrary type.
Shuffle All intermediate key-value pairs are grouped and sorted by key, so that pairs with the same key can be reduced together.
Reduce A reduce function combines values with the same key and produces results associated with that key in the final output.
Note that there is in fact an optional optimization step between map and shuffle that combines values by key on a map node. This is useful as reducing the number of key-value pairs locally before they are shuffled. We ignore this step because it is irrelevant to the topics discussed later.
While records are sorted by key before they reach the reduce function, for any particular key, the order that the values appear is not stable from one run to the next, since they come from different map tasks, which may finish at different times from run to run. As a result, most MapReduce programs are written so as not to depend on the order that the values appear to the reduce function. In other words, an implementation of a reduce function is expected to be commutative with respect to its input. Formally, a reduce function, which we may alternatively call a reducer hereafter, is commutative if for each input list $L$, it returns the same result for each permutation $\sigma$ of that list: $$\forall L, \sigma: reduce(k, L) = reduce(k, \sigma(L))$$ Note that we treat reduce functions in a very high-level manner: A reducer in our mind is just a function that takes a key and a list as input and returns a value as output. Moreover, we may assume without loss of generality that keys and values are all integers, say by considering their hash values instead of their actual values.
The main purpose of this post is to initiate an investigation to model check commutativity of reduce functions. Our modeling language supports Boolean and integer data types, Boolean operations and integer arithmetics, if-then-else and control structures that can be constructed from it, return statement, and a while loop over the iterator of the input list. Moreover, we introduce non-determinism to the language so that we can model uninterpreted function calls, etc. While this model looks like an over-simplification of reducers used in practical MapReduce programs, hopefully it has captured enough ingredients for us to derive non-trivial results that may shed light on more realistic use cases. Our first theorem about this model is as follows.
Theorem. The commutativity of reducers is undecidable.
Proof. We shall reduce the Diophantine problem , i.e., determining the solvability of Diophantine equation systems, which is known to be undecidable in general, to the problem of determining commutativity of reducers. Given a Diophantine equation system $P: P_1(x_1,...,x_k)=0$, ..., $P_n(x_1,...,x_k)=0$, we define a reducer as follows:
public int reduce(int key, Iterator<int> values) { int x1, ..., xk; if(values.hasNext()) x1 = values.next(); else return 0; ... if(values.hasNext()) xk = values.next(); else return 0; if(P1(x1,...,xk)!=0) return 0; ... if(Pn(x1,...,xk)!=0) return 0; int y1, y2; if(values.hasNext()) y1 = values.next(); else return 0; if(values.hasNext()) y2 = values.next(); else return 0; return y1 - y2; }It is clear that if equation system $P$ has no solution, the reduce function always returns zero regardless of its input, i.e., it is commutative. On the other hand, if there is a solution, then its return value depends on its input values as well as the order that they are iterated, i.e., the function is not commutative. Note that the return value $y_1-y_2$ makes the reducer non-commutative even when $P$ has a unique solution with $x_1=...=x_k$. In this way, we reduced the problem of checking the solvability of $P$ to that of checking the commutativity of a reducer. This concludes our proof. $\square$
As verifying commutativity is undecidable, the best hope for us is to derive a semi-algorithm that is effective for some interesting case studies. Our first attempt is to use abstract interpretation, along with counter abstraction for lists, and reduce the verification problem to a reachability problem. Let $t,n\in\mathbb N$ be refinement parameters. A state is a 4-tuple $(L,pc,itr,V)$ where $L$ is an abstract list value, $pc$ is the program counter, $itr$ is the position of iterator, and $V$ is the valuation in abstract domain $[-n, n]$. An abstract list is a prefix of size $t$ followed by a subset of $\{-n,...,n\}$ that represents the "tail". For example, an abstract list $1\rightarrow 2\rightarrow \{1,3\}$ concretizes to set of lists $12\{1,3\}^*$ (in regular expression). We say an abstract list is "likely" a permutation of another abstract list if any of the permutations of its concretization is contained in the concretization of the other abstract list. Finally, a state is called accepting if its $pc$ equals to the line number of some return statement.
Now, given two initial states $(L,pc,itr,V)$ and $(L',pc',itr',V')$ such that $L$ is likely a permutation of $L'$, we carry out two coordinated symbolic executions from each state and see if they can reach accepting states after the same number of steps (note that we have to allow nondeterministic NOP's so that the executions can represent real paths of different lengths even though they are coordinated). We say that the executions reach a bad state if the two paths reach some return statements, say "return x" and "return y", respectively, and $V(x)\neq V'(y)$. If no bad state is reached, then we have proved that the reducer is commutative. Otherwise, we check the bad state we reach and see if it is in effect concrete, i.e., abstract lists do not have tails and all valuations are in $(-n, n)$. If so, then we have proved that the reducer is not commutative. If the bad state is not concrete, then we have found a plausible bug, but it may be a false alarm since abstract states are over-approximation of concrete states. In such case, we need to refine the abstraction, e.g., by increasing parameters $t,n$, and re-run the execution from the beginning. Continuing in this flavor, we can obtain an answer for certain if the process terminates. Of course, it is possible that the process never stops and we just keep doing refinements again and again. This possibility cannot be eliminated, however, since the question we want to answer is undecidable per se.
References and further readings
1. Yu-Fang Chen, Chih-Duo Hong, Nishant Sinha, and Bow-Yaw Wang. "Commutativity of Reducers", TACAS, 2015.
2. A note on Hilbert's 10th Problem by Yuri Matiyasevich, or see this overview paper for a glimpse of the literature.
3. Computational Aspects of Diophantine Equation Systems with Mathematica.
4. Csallner, Christoph, Leonidas Fegaras, and Chengkai Li. "New ideas track: testing MapReduce-style programs." ACM SIGSOFT, 2011.
5. Xiao, Tian, et al. "Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs." Companion Proceedings of the 36th ICSE. ACM, 2014.