aggregate
function in Scala has method signature as follows:
def aggregate[A](z: => A)(seqOp: (A, B) => A, combOp: (A, A) => A): AThe semantics of
aggregate
depend on the collection upon which it is invoked. When aggregate
is invoked on a sequential traversable collection such as a List, it fallbacks to foldLeft
and ignores the provided compOp
. In this case, aggregate
behaves just like the ordinary fold
in typical functional programming languages such as Lisp and ML. When aggregate
is invoked on a parallel traversable collection, however, it performs a two-stage aggregation over the data elements in parallel. In Scala, a parallel collection can be splitted into partitions and distributed over several cores or nodes. Invoking aggregate
on a parallel collection reduces a value for the collection by first folding each partition separately using seqOp
, and then merging the results one by one using combOp
.Note that the Scala API document doesn't specify the way in which the collection is partitioned, as well as the order in which the intermediate results are merged. Hence, if a programmer hopes to obtain the same output whenever
aggregate
is invoked on the same parallel collection, he has to make sure himself that seqOp
is associative and combOp
is both associative and commutative.
Aggregate v.s fold
Thefold
function in Scala has method signature as follows:
def fold[A1 >: A](z: A1)(combOp: (A1, A1) => A1): A1One can observe two differences between the signatures of
fold
and aggregate
: 1) fold
only needs one operator, while aggregate
needs two; 2) fold
requires the result type to be a supertype of the element type, while aggregate
doesn't. Note that both methods fallback to foldLeft
when they are invoked on a sequential collection. One may notice that Scala's aggregate is in effect the ordinary FP fold
fold : (A → B → A) → A → ([B] → A)
and Scala's fold is just a special form of its aggregate. The idea behind this design is to allow for parallel FP fold. In Scala, fold
and aggregate
are both done in two stages, namely folding and combining. For fold
, the same operator is used for folding and combining, and this requires the supertype relationship between the element and the result type. Without the supertype relationship, it will take two operators to fold and combine, and this is why Scala provides aggregate
that takes two operators as arguments.
Aggregate in Spark
Spark implements its ownaggregate
operations for RDD. One can observe from the source code of Spark that the intermediate results of aggregate
are always iterated and merged in a fixed order (see the Java, Python and Scala implementations for details).
Hence, we don't have to use commutative combOp
to obtain deterministic results from aggregate
, given that at least one of the following conditions holds:
(i) there doesn't exist an empty partition in any partitioning of the collection, or
(ii)
z
is an identity of combOp
, i.e., for all a
of type A
, combOp(z,a)
= combOp(a,z)
= a
.
Note that this observation is not assured by the official Spark API document and is thus not guaranteed to hold in the future versions of Spark. However, it is still useful if we want to model the behaviours of a Spark program formally.
Below, we shall try to formalize Spark's implementation of
aggregate
.
While Scala is an impure functional language, we only consider pure functions here for simplicity. Let $R(L,m)$ denote an RDD object obtained from partitioning a list $L \in \mathbb{D}^*$ into $m \ge 1$ sub-lists.
Suppose that we have fixed an element $zero \in \mathbb{E}$, as well as operators $seq:{\mathbb{E} \times \mathbb{D} \to \mathbb{E}}$ and $comb:{\mathbb{E} \times \mathbb{E} \to \mathbb{E}}$.
An invocation of aggregate
is said to be deterministic with respect to $zero$, $seq$ and $comb$ if the output of
$$R(L,\cdot).\textrm{aggregate}(zero)(seq,\ comb)$$ only depends on $L$.
Define a function
$\textrm{foldl}:({\mathbb{E} \times \mathbb{D} \to \mathbb{E}}) \times \mathbb{E} \times \mathbb{D}^* \to \mathbb{E}$ by
\begin{eqnarray*}
\textrm{foldl}(f,\ a,\ Nil) & = & a \\
\textrm{foldl}(f,\ a,\ b \!::\! bs) & = & \textrm{foldl}(f,\ f(a,b),\ bs),
\end{eqnarray*}
where $f\in {\mathbb{E} \times \mathbb{D} \to \mathbb{E}}$, $a \in \mathbb{E}$, $b \in \mathbb{D}$ and $bs \in \mathbb{D}^*$.
Suppose that $1\le m \le |L|$ and $R(L,m)$ partitions $L$ into $\{L_1, \dots, L_m\}$
such that $L=L_1 \cdots L_m$. According to the source code of Spark, we can define the aggregate operation in Spark as
\begin{eqnarray*}
R(L,m).\textrm{aggregate}(zero)(seq,\ comb) = \textrm{foldl}(comb,\ zero,\ L'),
\end{eqnarray*}
where $L' \in \mathbb{E}^*$ is a list of length $m$ and $L'[i] = \textrm{foldl}(seq,\ zero,\ L_i)$ for $i=1,\dots,m$.Note that the partitioning of $L$ is unspecified and may be non-deterministic. Hence, we have to use an associative $comb$ operator if we want to ensure that the result of aggregation is deterministic. Suppose that $comb$ is associative and at least one of conditions (i) and (ii) is satisfied. It turns out that the output of
aggregate
is deterministic with respect to $zero$, $seq$ and $comb$ if and only if for all lists $L_1$ and $L_2$,
$$comb(zero,\ \textrm{foldl}(seq,\ zero,\ L_1L_2)) = comb(\textrm{foldl}(seq,\ zero,\ L_1),\ \textrm{foldl}(seq,\ zero,\ L_2)).$$
While the above condition is undecidable in general, it can be effectively checked for some classes of programs such as FSMs over finite alphabets.
Concluding remarks
Aggregate in Spark is the parallelized version of the conventional fold, which was already shown to be extremely useful in functional programming. It is therefore of practical interests to study the properties and behaviours of theaggregate
function. In particular, it is interesting to establish and verify the conditions under which aggregate
is deterministic with respect to the provided $zero$, $seq$ and $comb$. We may further investigate this problem in the coming posts.