This is ericpony's blog
Showing posts with label Algorithm. Show all posts
Showing posts with label Algorithm. Show all posts

Saturday, March 4, 2017

A note on parallelizable functions

 click me to see the origin
Yesterday I was accidentally asked that what kind of problems can be benefited the most from distributed computing frameworks such as MapReduce. I responded immediately that any problem that can be tackled by divide-and-conquer strategies should fit in the framework. However, on second thought, I found the answer missing the point because it remains unclear as to what kind of problems are solvable by divide and conquer. For example, sorting can be done in this way because sort(a++b) can be obtained from sort(a) and sort(b). How about the Boolean function is_sorted? The value of is_sort(a++b) can not be determined merely from the return values of is_sort(a) and is_sort(b). So there must be some characteristics that tell sort and is_sort apart.
Let's call a function parallelizable if it can be computed by divide and conquer. It is possible to "lift" a non-parallelizable function to a parallelizable one, such that the output of the original function can be obtained from that of the lifted function. In the above example, function is_sort can be made parallelizable by augmenting its return value with additional information max(s) and min(s) for input list s, since $a$$+\!\!+$$b$ is sorted in ascending order if and only if $a$ and $b$ are sorted in ascending order and $\max(a)\le \min(b)$. This parallelization is also effective because it reduces the time complexity from linear to logarithmic in appropriate computation models. So here we can ask several interesting questions: 1) What functions are parallelizable? 2) Is it always possible to lift a non-parallelizable function to a parallelizable one? 3) Do parallelizable functions always have efficient parallel implementations?

Homomorphism and Parallelism

After a brief survey, I found the rough ideas in my mind actually fall into a popular line of research over a long period of time. Particularly, there is a well-known result connecting homomorphisms with functions that admit divide-and-conquer imple-mentations.
Definition. A function $f : [A]\rightarrow B$ is $\odot$-homomorphic for operator $\odot:B\times B\rightarrow B$ if $f(s$$+\!\!+$$t) = f(s)\odot f(t)$ for any lists $s$ and $t$ on $[A]$, the set of finite lists of type $A.$
Note that the definition implies that $\odot$ is associative on $B$, since $+\!\!+$ is associative on $[A].$ Also, $f([\ ])$ is a unit of $\odot$ if $f$ is defined on $[\ ]$, which is a unit of $+\!\!+$ on $[A]$.
The First Homomorphism Theorem. ([1]) A function $f : [A]\rightarrow B$ can be computed by divide and conquer if and only if it is $\odot$-homomorphic for some $\odot$ on $B$.
It is clear that the $\odot$ operator captures the semantic of "combine" in a divide-and-conquer algorithm. In MapReduce, the conquer phase is done by a mapper and the combine phase by a reducer. However, as MapReduce does not specify the order of reduction, the $\odot$ operator (i.e. the reducer) has to be commutative for the outcome to be deterministic [2].
The above theorem tells us that a function is parallelizable iff it is homomorphic. Interestingly, it turns out that any non-homomorphic function can be trivially lifted to a homomorphism using a trick called "tupling". For example, this trick would lift $is\_sort: [Num]\rightarrow Bool$ to a function $is\_sort': [Num]\rightarrow [Num]\times Bool$ that is $\odot$-homomorphic with respect to $(s, a)\odot (t, b) \equiv (s$$+\!\!+$$t,\ is\_sort(s$$+\!\!+$$t))$. The parallel implementation based on this lift is clearly no better than its sequential counterpart. Similarly, $sort:[Num]\rightarrow[Num]$ is $\odot$-homomorphic when $\odot$ is defined by $s \odot t \equiv sort'(s$$+\!\!+$$t)$, where $sort'$ can be any sorting algorithm such as insertion sort, though the resulted parallel algorithm is inefficient and useless. Therefore, finding an efficient parallel implementation for a function, parallelizable or not, amounts to finding a $\odot$ operator with low computation costs. (To be continued)

References

1. Bird, Richard S. "An introduction to the theory of lists." 1987.
2. Chen, Y. F., et al. "Commutativity of Reducers." 2015.

Tuesday, January 31, 2017

DFA as proof for unreachability

Preliminaries. Given a finite set $\Sigma$ and a binary relation $f:\Sigma^* \to \mathcal P(\Sigma^*)$, we would use $F$# to denote the set function $\mathcal P(\Sigma^*) \to \mathcal P(\Sigma^*)$ defined by $F(S):= \bigcup_{v\in S}f(v).$ Also,$$F^0(S):= S;\quad F^{n+1}(S):= F(F^n(S))\cup F^n(S);\quad F^*(S):= \bigcup\nolimits_{n\ge 0} F^n(S).$$Intuitively, $F^*(S)$ is the set reachable from $S$ by applying $f$ to $S$ finitely many times. A binary relation is called regular if it can be recognised by a finite automaton over alphabet $(\Sigma\cup\{\epsilon\})\times(\Sigma\cup\{\epsilon\}).$
Unreachability analysis. Suppose $I$ and $B$ are two regular sets and $f$ is a regular relation. The unreachability problem $(I,F,B)$ asks whether $B$ is unreachable from $I$ through $F$, namely, $F^*(I)\cap B=\emptyset$. In software verification, many safety properties of a software can be modelled as (un)reachability problems [1] for an infinite state system. Particularly, the semantics of a program is modelled by a binary relation over configurations $\Sigma^*.$ If we let $I$ and $B$ denote the sets of the initial and the bad configurations, respectively, then $F^*(I)\cap B=\emptyset$ means that the execution of the program would never go wrong.
Tractability. Given $(I,F,B)$, $F^*(I)$ is not necessarily regular even though $F^n(S)$ is regular for all $n\ge0.$ Also, most interesting properties about $F^*(S)$ are undecidable. This fact is not surprising as the transition relation of a Turing Machine is regular*. To make the verification problem more tractable, we assume that $f$ is length-preserving, i.e., $w\in f(u)$ only if $|w|=|u|$, which weakens the expressive power of the transition system $(I,f)$ from a Turing machine to a linear-bounded automaton (LBA), which is only allowed to work with the amount of space occupied by the input string. Membership of $F^*(I)$ becomes decidable with the length-preserving restriction, but unreachability remains undecidable, as $LBA\cap DFA=LBA$ and checking emptiness of LBA is undecidable. This follows from the fact one can encode a given TM $T$ as an LBA $A$ such that $L(A)$ consists of precisely the halting runs of $T$.
Remark. While a regular transition system is as powerful as a TM (which can compute functions like $f(w)=w\#w$), not all transition systems are regular. For example, transition systems induced by higher-order pushdown automata allow transitions like $w \to w\#w$ and are thus not regular.
Regular proof synthesis. While the problem is undecidable, we can given a positive answer to it as long as we can find a proof . Such a proof can be a set $S$, called an inductive invariant, that satisfies $B\cap S=\emptyset$ and $F^*(I)\subseteq S.$ Recall that in this post we introduced L*, an algorithm that can learn a DFA by querying the target language. Below, we try to construct a teacher for L* to learn a regular proof for unreachability problem $(I,F,B).$ Upon a membership queries for word $w$, the teacher checks if $I$ is reachable from $w$ through $f^{-1}$. This check is decidable thanks to the assumption that $f$ is length-preserving, which guarantees that a word is only reachable from a finite number of words. Upon an equivalence query for DFA $A$, the teacher performs the following 4 steps in turn:
Step 1. If $\exists w\in I\setminus L(A)$ then return $w$.  // check that $I\subseteq L(A)$)
Step 2. If $\exists w\in L(A)\cap B$ then halt and report "reachable" if $Mem(u)=yes$; return $w$ otherwise.  // check that $L(A)$ doesn't overlap $B$
Step 3. If there exist $u\in L(A)$ and $w\in f(u)\setminus L(A)$, then return $w$ if $Mem(u)=yes$ and return $u$ otherwise.  // check that $L(A)$ is inductive
Step 4. Halt and report "unreachable".
If a candidate DFA $A$ passes Step 1-3, then $L(A)$ is clearly an inductive invariant. Observe that each counterexample falls in the symmetric difference of $L(A)$ and $F^*(I).$ When $F^*(I)$ is regular, it follows from the complexity results of L* that an inductive invariant can be learnt with at most $n$ equivalence queries, where $n$ is the number of states in the minimal DFA of $F^*(I)$, and at most $m\!\cdot\!(n+1)$ membership queries, where $m$ is an upper-bound on the length of the counter-examples returned by the teacher. Here we have several notes in order:
On the length-preserving restriction. We have assumed that the transition relation preserve lengths. In this way, a bad configuration is reachable from some initial configuration if and only if it can be reached within a bounded memory. This restriction is not essential for practical unreachability analysis as long as we allow "padding" in the initial set. That is, for every word in the initial configurations, say $w$, we would pad it with an arbitrary number of placeholders, say $\bot$, so that $w\in I$ implies $w\bot^*\subseteq I$. In this way, the size of available memory for computing $w$ is finite but effectively unbounded. On the other hand, though length-preserving relations are powerful enough to model many useful transition systems, they cannot model systems that require $\epsilon$ in their alphabet, such as lossy channel machines.
Minimality of proof. While L* always learns a minimal DFA by the Myhill-Nerode theorem, here it does not necessarily learn the minimal inductive invariant in terms of set inclusion. Instead, depending on the counterexamples returned by the teacher, L* would try to learn one of the inductive invariants and, when it found one, it learns the minimal DFA for that particular invariant.
Completeness of the method. When $F^*(I)$ is regular, L* is guaranteed to find a proof. When $F^*(I)$ is not regular, however, our method may not find a regular proof even when one exists. On the other hand, checking unreachability is in fact decidable when a regular proof exists, since we can launch two procedures at the same time, one looking for a counterexample and one looking for a proof. From this aspect, L* learning is weaker in completeness than brute force enumeration. (Note that finding a DFA for $F^*(I)$ may still be undecidable even when $F^*(I)$ is regular.)
A non-terminating example. Consider an unreachability problem $(I,F,B)$ where $I:=\{(ab)^n:n\ge 1\}$, $f:= \{(xbay, xaby): x,y\in \{a,b\}^*\}$, and $B$ is some trivial set such as $\emptyset.$ Note that $F^*(I)$ is not regular, since $F^*(I)\cap a^*b^*=\{a^nb^n:n\ge1\}$ is not regular. Now consider a Teacher implementing Step 1-4 for equivalence queries that would return a shortest counterexample for the candidate proof. Below we list the queries and the automata $A$ learnt by L* in each iteration.
Iteration 1: $A=\emptyset$; containment check of $I$ fails with counterexample $ab$; add $ab.$
Iteration 3: $A=a(ba)^*b$; inductiveness check fails at $abab \to aabb$; add $aabb.$
digraph finite_state_machine {
rankdir=LR;
size="8,5"
node [shape = doublecircle]; 3 ;
node [shape = circle];
1 -> 2 [ label = "a" ];
2 -> 3 [ label = "b" ];
3 -> 2 [ label = "a" ];
"init" [shape = point];
"init" -> 1;
}
Iteration 4: $A=a(ba|ab)^*b$; inductiveness check fails at $aababb \to aaabbb$; add $aaabbb.$
digraph finite_state_machine {
rankdir=LR;
size="8,5"
node [shape = doublecircle]; 4 ;
node [shape = circle];
2 -> 3 [ label = "b" ];
1 -> 3 [ label = "a" ];
3 -> 2 [ label = "a" ];
3 -> 4 [ label = "b" ];
4 -> 3 [ label = "a" ];
"init" [shape = point];
"init" -> 1;
}
Iteration 5: $A=a(ba|a(ab)^*b)^*b$; inductiveness check fails at $aaababbb \to aaaabbbb$; add $aaaabbbb.$
digraph finite_state_machine {
rankdir=LR;
size="8,5"
node [shape = doublecircle]; 5 ;
node [shape = circle];
2 -> 3 [ label = "b" ];
3 -> 2 [ label = "a" ];
3 -> 4 [ label = "b" ];
1 -> 4 [ label = "a" ];
4 -> 3 [ label = "a" ];
4 -> 5 [ label = "b" ];
5 -> 4 [ label = "a" ];
"init" [shape = point];
"init" -> 1;
}
One can observe that L* got trapped from the 3rd iteration - it keeps splitting state 2, creating a lineage that leads to a divergent sequence of counterexamples. On the other hand, L* has indeed tried its best after the 3rd iteration: each time L* receives a counterexample of length $2n$, it succeeds to find an inductive proof for the reachable configurations of length $\le 2n$, namely $A_n:=\{w\in \{a,b\}^*:|w|\le n$ and $0\le$#$a-$#$b\le n$ in all prefixes of $w \}.$ Note that $|A_{n+1}|=|A_{n}|+1.$ Due to the fact that L* progresses incrementally in terms of automata size, and that $a^{n+1}b^{n+1}$ is the shortest counter-example for $A_n$, L* will never reach an automaton covering all possible $A_n$.

Learning with witnesses. We have discussed how to use the L* learning algorithm to prove $F^*(I)\cap B=\emptyset$. In order to answer membership queries, we assumed that the transition relation is length-preserving. However, it is possible to use L* without this assumption. For example, Vardhan et al. [3] proposed to learn a set of configuration-witness pairs as a proof, where a witness refers to the number of steps needed to reach a configuration. More precisely, the proof they proposed to learn is a regular subset of $((\Sigma\cup\{\bot\})\times\{0,1\})^*$, where $\bot$ serves as a padding symbol. Each word in this set encodes a pair $(c,n)$, meaning that $c$ is a configuration contained in $F^n(I)$. Let $[c,n]$ denote the encoding of $(c,n).$ Given $I$, $f$, and $B$, we define $I'$, $f'$, and $B'$ as
$\begin{eqnarray*}
I' & := & \{[c,0]: c\in I\};\\
f'(S) &:=& \{[f(c),n+1]: [c,n]\in S\};\\
B' &:= & \{[c,n]:c\in B, n\ge 0\}.
\end{eqnarray*}$
Note that if $I$, $f$, and $B$ are regular then so are $I'$, $f'$, and $B'.$
It is straightforward to learn a proof for unreachability problem $(I',F',B')$ using L*: membership query is decidable thanks to the presence of witness, and equivalence query can be answered as we did before, aiming to find an inductive invariant as an over-approximation. After L* learns a proof for the new problem, we obtain a proof for the original problem by projection. However, note that the transformation from configurations to configuration-witness pairs doesn't preserve regularity. Hence, the transformed problem $(I',F',B')$ doesn't necessarily have a regular proof even when the original problem $(I,F,B)$ does. In particular, unlike the previous method, the learning process here doesn't guarantee to terminate when $F^*(I)$ is regular. It would be interesting to characterise the classes for which this technique is effective, namely, the classes where regularity of $F^*(I)$ implies that of $F'^*(I')$.

References

1. A unified approach for studying the properties of transitions systems, TCS 1982.
2. Learning to prove safety over parameterised concurrent systems, FMCAD 2017.
3. Learning to verify safety properties, ICFME 2004.

Thursday, January 28, 2016

Intuition behind a mathematical puzzle

Sometimes, it is helpful to reason about a mathematical puzzle in a concrete context. A solution may reveal itself after you reword the problem statement in intuitive terms. The following example shows how the solution to a seemingly nontrivial problem becomes obvious after the statement of the problem is reworded appropri-ately.
Example. Consider the following function:
function f (n: int) {
  if n == 1 return n
  choose integers a, b ≥ 0 such that a + b == n
  return a * b + f(a) + f(b)
}
Our goal is to show that the output of f(n) does not depend on how a and b are determined in the function body. This fact is far from obvious by inspecting the code. A standard approach to show this kind of fact is via mathematical induction. However, to establish the induction hypothesis, one must first express the value of f(n) in $n$. This is not easy given that values of a and b are not specified. Besides, an inductive proof typically proceeds by manipulating formulas. In that case, the proof itself is not very helpful for us to gain insights to the fact it proves.
On the other hand, the fact turns evident once you notice what the function intends to compute. The key observation here is that f(n) effectively counts the number of hand-shaking in a group of $n$ people. A group of size $n$ is is divided into two groups of size $a$ and $b.$ People from different groups then shake hand with each other. This process continues recursively until no smaller group can be form. At that time, each person eventually shakes hand with everybody else for exactly once. Therefore, the value of f(n) is $n\cdot(n-1)/2$ regardless of how groups are divided during the process.

A technique called double counting is closely related to the idea expressed above.
Example. We know that there are ${n \choose k}$ ways to select a set of size $k$ from a set $S$ of $n$ distinct elements. If $S$ is divided into two partitions of size $a$ and $b$, then selecting a subset from $S$ amounts to selecting a subset from each of the partitions respectively. Therefore, no matter how the partitions are divided, i.e., $a + b = n$, we have $$\sum_{i+j=k}{a \choose i}{b \choose j} = {n \choose k}$$(recall that ${n \choose 0}=1$ and ${n \choose k}=0$ for $k<0$ or $k>n$).

Tuesday, January 12, 2016

A note on Angluin’s Learning Algorithm

The L* Learning Algorithm, also known as Angluin’s Learning Algorithm for DFAs, was first developed by Angluin [1] and later improved by Rivest and Schapire [2]. L* learns an unknown regular language, say $U$ over an alphabet $\Sigma$, and produces a DFA $C$ such that $L(C) = U$. L* works by incrementally producing a sequence of candidate DFAs $C_1 , C_2, ...$ converging to $C$. In order to learn $U$, L* needs a Teacher to answer two type of questions. The first type is a membership query, consisting of a string $\sigma \in \Sigma*$; the answer is true if $\sigma \in U$, and false otherwise. The second type of question is a conjecture, which is a candidate DFA $C$ whose language the algorithm believes to be identical to $U$. The answer is true if $L(C) = U$. Otherwise the Teacher returns a counterexample, which is a string $\sigma'$ in the symmetric difference of $L(C)$ and $U$.
At a higher level, L* creates a table where it incrementally records whether strings in $\Sigma^*$ belong to $U$. It does this by making membership queries to the Teacher. At various stages, L* would decide to make a conjecture. It constructs a candidate automaton $C$ based on the information contained in the table, and asks the Teacher whether the conjecture is correct. If it is, the algorithm terminates. Otherwise, L* uses the counterexample returned by the Teacher to expand the table with strings that witness differences between $L(C)$ and $U$.
The Learner maintains a Boolean table with index set $X\times Y \subset \Sigma^*\times\Sigma^*.$ The rows of the table are indexed by $X$ while the columns are indexed by $Y$. Each cell $(x,y)$ indicates whether or not $xy \in U$. Two strings $a,b\in X$ are called $Y$-equivalent, written as $a\equiv_Y b$, if $ay\in U \iff by\in U$ for all $y\in Y.$ Note that $a\equiv_Y b$ iff the rows indexed by $a$ and $b$ are identical binary strings. The table is called consistent iff all distinct strings $x,x'\in X$ are not $Y$-equivalent. The table is called closed iff for all $x\in X$ and $a\in\Sigma$, there exists $x'\in X$ such that $xa\equiv_Y x'$.
The table determines a reduced DFA when it is consistent and closed. The states of the DFA are $\{[x]_Y:x\in X\}$ (here $[\cdot]_Y$ is the equivalence classes induced by $\equiv_Y$); accepting states are $\{[x]_Y: x\in X\cap U\}$; the transition function $\delta:[X]_Y\times\Sigma\to [X]_Y$ is given by $\delta([x]_Y,a)=[xa]_Y.$ This DFA is reduced as every two states of it can be distinguished by some string in $Y$ by the definition of consistency. The L* algorithm can now be sketched as follows:
Step 1. Start with a one-cell table with index set $\{\epsilon\}\times \{\epsilon\}$.
Step 2. Fill the table cells with membership queries and expand the index set when necessary until the table is consistent and closed. This procedure guarantees to find the minimal expansion of the table that is consistent and closed. Construct a conjecture DFA $C$ from the table.
Step 3. Check the conjecture using an equivalence query. If it is correct then output it and halt. Otherwise, expand the table with the provided counterexample and go to Step 2.
We give more details about Step 3. Suppose the Teacher returns a counterexample $w=w_1\dots w_n \in \Sigma^n$ for conjecture $C$. Let the trace of this string in $C$ be$$x_0\overset{w_1}{\longrightarrow}x_1\overset{w_2}{\longrightarrow}\cdots\overset{w_n}{\longrightarrow}x_n,$$for $x_0,...,x_n\in X.$ Then there is $1\le i\le n$ such that $x_i w_{i+1}...w_n \in U \iff w \not\in U.$ The table is then expanded by including $x_{i-1}w_i$ and $w_{i+1}...w_n$ in $X$ and $Y$, respectively.
Recall that the number of states of a conjecture DFA is the same as the number of distinct rows in the table. Also, the number of distinct rows increases by at least one each time an equivalence query fails. The L* algorithm would therefore terminate after making at most $|C| − 1$ incorrect equivalence queries and learn the reduced automaton $C$ for the target language.

Difficulty of learning DFAs

Difficulty of offline learning. Let $\mathcal{A}, \mathcal{B} \subset \Sigma^*$ be two finite sets of strings. Given $k$, it is NP-complete to decide whether a DFA $D$ with $k$ states exists such that $\mathcal{A} \subseteq L(D)$ and $\mathcal{B} \subseteq \Sigma^* \setminus L(D)$. It is still NP-complete to find such a DFA with the minimum size. In fact, even finding a DFA with a number of states polynomial on the number of states of the minimum one is NP-complete. If all strings of length $n$ or less are given, then the problem can be solved in time polynomial on the input size. Note, however, that the size of the input is itself exponential on the number of states in the resulting DFA. Angluin has shown that this problem remains NP-complete if an arbitrarily small fixed fraction is missing from the input set.
Difficulty of online learning. Learning a minimal DFA becomes easier if the algorithm is allowed to make queries to the unknown automaton. The L* algorithm proposed by Angluin [1] solves the problem in polynomial time by allowing the algorithm to ask membership and equivalence queries. Rivest and Schapire [2] later improved the algorithm so that it does not need to reset the unknown automaton to the initial state after each query.

References

1. D. Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 1987.
2. R. L. Rivest and R. E. Schapire. Inference of finite automata using homing sequences. Information and Computation, 1993.
3. Michael Kearns and Umesh Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.
4. Ron Rivest, Machine Learning Lecture Notes at MIT, Lecture 23 & 24.
5. Rajesh Parekh, Vasant Honavar. Learning DFA from Simple Examples. Machine Learning, 2001.

Monday, November 2, 2015

A note on the Bellman–Ford's algorithm

Bellman-Ford Shortest Path Algorithm is an elegant solution to the single-source shortest path problem, stated as follows: given a weighted digraph $G=(V,E)$, let $w(e)\in \mathbb R$ denote the weight of edge $e$. Fix $s \in V$ to be the source vertex, and let $dist(v)$ be the distance from $s$ to $v$. That is,
$dist(v) = \min\{\ \sum_{e\in P} w(e)$ : $P \subseteq E$ is a path from $s$ to $v\ \}.$
The goal of the problem then is to compute $dist(v)$ for each $v \in V$.
Bellman-Ford algorithm maintains a map $d: V \rightarrow \mathbb R$ to store the tentative distance values. We use $d_k(v)$ to denote the value of $d(s, u)$ computed in the $k$th iteration. Let $d_0(v)=\infty$ for $v \neq s$ and $d_0(s)=0$. In the $n$th iteration, $d_n(v)$ is computed based on $d_{n-1}(v)$ as follows:
$d_n(v)=\min\{\ d_{n-1}(v),\ \min\{\ (u, v)\in E: d_{n-1}(u) + w(u,v)\ \}\ \}$.
Here are some elementary facts about Bellman-Ford algorithm:
  • The original version of the algorithm runs for a fixed number of $|V|$ iterations. In each iteration, the distance values are updated by looping through the edges. More precisely, for each $(u, v) \in E$, we relax it by setting $d_n(v)=\min\{d_{n-1}(v), d_{n-1}(u) + w(u, v)\}$. The time complexity of the algorithm is thus $O(|V||E|)$.
  • Loop invariant: The following constraint holds for $n\ge 0$ and each $v\in V$:
    $dist(v) \le d_{n+1}(v) \le d_n(v).$
  • It turns out that $d_n(v)$ is the length of the shortest path from $s$ to $v$ with at most $n$ edges. Since a cycle-free path contains at most $|V|-1$ edges, $d_{|V|-1}(v) = dist(v)$ if a negative cycle cannot be reached from the source vertex $s$.
  • In fact, $G$ doesn't contain a reachable negative cycle if and only if $d_n$ converges before $n=|V|$. This fact leads us to a $O(|V||E|)$ algorithm for negative cycle detection.

Optimizations

1. One may immediately observe that if $d_n(v)=d_{n-1}(v)$, then we don't have to check edge $(v, u)$ in the $n$th iteration, leading to a constant-factor savings in time. Also, it is possible that $d_n$ stabilize at some $n$ less than $|V|-1$, allowing us to terminate the main loop earlier. However, there are graphs where $|V|-1$ iterations are needed in the worst case.
2. In the above definition of algorithm, we update $d_n$ according to the values of $d_{n-1}$. Hence, the values of $d_n$ don't depend on the order in which edges are relaxed. In practice, $d_n$ and $d_{n-1}$ can refer to the same map, in which case the order of relaxation impacts the efficiency of the algorithm. For example, when there is no negative cycle reachable from the source, the number of iterations can be reduced to $\lceil |V|/2 \rceil$ if all vertices are assigned with some arbitrary linear order and the edges are relaxed by topological sort (Yen 1970). Furthermore, if the arbitrary linear order of the vertices are replaced by a random permutation, then the expected number of iterations needed in the main loop is at most $|V|/3$ (Bannister & Eppstein 2012).
Unfortunately, optimizations based on the order of relaxation cannot be applied as easily in the distributed settings.

Distributed adaptation

Bellman-Ford algorithm is particularly suitable (compared to Dijkstra’s algorithm) to be carried out on fully distributed platform. First, Bellman-Ford algorithm allows the distance value of each vertex to be updated asynchronously. Thus we don't need to use shared memory. Also, the correctness of the algorithm doesn't rely on the orders in which the edges are relaxed. Thus we don't need a master node to coordinate other nodes. The following pseudo-code describes a simple adaptation of Bellman-Ford algorithm to a message-passing system. Note that each $d(v)$ is stored in the local memory at vertex $v$ and cannot be accessed by other vertices.
Initialization:
    Set $d(v)=\infty$ at each vertex $v\neq s$ and set $d(s)=0$ at the source vertex $s$.
    Set $cnt(v)=0$ at each vertex $v$.
    Send $\mathtt{update}(w(s, v))$ to vertex $v$ for each edge $(s, v)$.

When vertex $v$ receives message $\mathtt{update}(t)$:
    If $t<d(v)$ then
        Set $cnt(v) = cnt(v) + 1$
        If $cnt(v) \ge |V|$ then
            Report that $v$ is on a negative cycle.
        else
            Set $d(v)=t$ and send $\mathtt{update}(t+w(v, u))$ to $u$ for each edge $(v, u)$.

Note that this algorithm is fully distributed---it doesn't need any synchronization mechanism, and initialization can be done by broadcasting a special reset message. On the other hand, since the number of messages a vertex would receive is indefinite, some of the vertices may never know whether it has already computed the shortest distance.

Applications

Optimal tokenization. Given a sequence of characters and a map from words to scores, insert delimiters to the sequence so that the resulted set of words has the maximum score. Hint: Model the positions of the sequence as vertices. A sub-string forms a weighted edge if it is a word with a nonzero score. Consider a shortest path from the first vertex to the last one.
(More to come)

References

1. Bellman–Ford algorithm on Wikipedia.
2. Sedgewick's web exercises for Algorithms, 4th ed.

Thursday, July 16, 2015

Correctness specification for Pregel programs

In this post, we show how to prove correctness of programs written in the Pregel abstraction by specifying loop invariants and variants for the supersteps. This post is only a sketch and will be elaborated incrementally when I have time.

In the following examples, we use $G=(V,E)$ to denote the base graph. Without loss of generality, we maintain a global counter to record the number of supersteps taken so far. Let $C_v(n)$ denote the attribute of vertex $v$ in the $n$-th Pregel iteration. A state of a Pregel program is a tuple $(k, G, \{C_v(k): v \in V\})$, where $k$ denotes the value of the global counter. We shall use $c(s)$ to denote the counter value in state $s$. An invariant for a Pregel program is a sequence of assertions $\{f_n\}_{n\ge0}$ over program states, such that each $f_k$ holds after the $k$-th superstep (i.e., $c(s)=k$ implies $f_k(s)$ is true). Moreover, $f_{k+1}$ should hold given that $f_{k}$ holds and that there are active vertices after the $k$-th superstep. A variant $V$ for a Pregel program is mapping from program states to a poset $(P,\succ)$, such that i) $P$ doesn't contain a infinite descending chain, and ii) $s_1\leadsto s_2 \wedge c(s_2)>c(s_1)$ implies $V(s_1)\succ V(s_2)$. Here we simply set this poset to $(\mathbb N\cup\{0\}, >)$.

Our first example is the connected components algorithm from the Spark library. This algorithm, after it terminates, assigns each vertex the smallest vertex id among the vertices in the same connected components. The vertex attribute $C_v(n)$ records the the lowest vertex id that vertex $v$ has found up to the $n$-th superstep. Let $C_v(0)=id(v)$  for all $v\in V$. It can be shown that
Invariant: $f_n$ $\equiv$ "$C_v(n) = \min\{C_u(n-1) : u\in N(v)\}$"
Variant: $V \equiv \sum_{v\in V} C_v(n)$.

The second example is the shortest landmark distances algorithm from the Spark library. Given a set of landmarks $L \subset V$, the algorithm computes for each vertex $v$ a mapping $C_v(n): L \rightarrow Int$, such that $C_v(n)(u)$ records the shortest distances from $v$ to $u$ that vertex $v$ has found up to the $n$-th superstep. For all $v,u\in V$, let $C_v(0)(u) = 0$ if $v=u \in L$, and $C_v(0)(u) = \infty$ otherwise. It turns out that
Invariant: $f_n$ $\equiv$ "For each $u \in L$, $C_v(n)(u)$ is the distance of the shortest paths from $v$ to $u$ with at most $n$ edges."
Variant: $V \equiv \sum_{v\in V}\sum_{u\in L} C_v(n)(u)$.

Sometimes, loop invariants and variants cannot be found as intuitively even for simple Pregel programs. Consider the following vertex coloring algorithm that colors the vertices using at most $\Delta+1$ colors, where $\Delta$ denotes the maximum degree of the base graph. Also, let the set of available colors be cyclically ordered with cardinality $\Delta+1$.
  Initialize all vertices to the same color.   while there exists $(v, u) \in E$ such that $v$ and $u$ share the same color do       Let $w$ denote the vertex from ${v, u}$ with a larger vertex id.       Set the color of $w$ to the minimal color larger than its current color.   done
Obviously, there exists a $(\Delta+1)$-coloring for the graph. Besides, the algorithm terminates only if it found a proper coloring. Now the question is: Is it possible that this algorithm doesn't terminate, say, given some order in which the vertices are colored? A variant cannot be found as easily as in the previous examples, because the relation $\succ$ is anti-reflexive but the number of colliding vertices doesn't decrease strictly as the algorithm iterates. However, after some reflection, we can still specify a variant as follows:

Variant: $V \equiv $ the maximal number of vertices that collide with their neighbors.

More precisely, after the $n$-th superstep finishes, there exists at least $n$ vertices in the graph, such that the coloring of these $n$ vertices is a subset of a proper $(\Delta+1)$-coloring of the graph. Observe that existence of $V$ here happens to assure both the termination and the correctness of this vertex coloring algorithm.

Monday, June 15, 2015

Writing decentralized algorithms with Pregel: A case study

In this post, we demonstrate how to implement a purely functional decentralized algorithm using the Pregel abstraction offered by the Spark GraphX library.

A vertex coloring of a graph is a way to color the vertices such that no two adjacent vertices share the same color. Formally, given an undirected graph $G=(V,E)$, a k-coloring of $G$ is a map $f: V \rightarrow \{1,...,k\}$ such that $f(v) \neq f(u)$ for any $(v,u) \in E$. A graph is called k-colorable if it has a $k$-coloring.

It takes linear time to find a $(\Delta+1)$-coloring for any graph with maximum degree $\Delta$. On the other hand, finding a 4-coloring for a 3-colorable graph is already NP-hard. There are many randomized algorithms in the literature that color a graph in exponential time and linear space. Among them, the Communication-Free Learning (CFL) algorithm [1] is a fully decentralized algorithm inspired by the problem of allocating non-interfering channels in IEEE 802.11 wireless networks.

The CFL algorithm works as follows. For each vertex $v \in V$, let $C_n(v) \in \{1,...,k\}$ denote the color of $v$ in the $n$-th iteration. Also, let $P_n(v,c)$ denote the probability that vertex $v$ has color $c$ in the $n$-th iteration. That is, $\Pr\{C_n(v) = c\} = P_n(v,c)$. At $n = 0$, the color of each vertex $v$ is chosen uniformly at random, i.e., $P_0(v, c) = 1/k$ for all $v \in V$ and $c = 1,...,k$. For $n \ge 1$, the color of $v$ is chosen randomly according to the following rule: If no adjacent vertex to $v$ shares the same color with $v$, then we let $P_n(v, c) = 1$ for $c = C_{n-1}(v)$ and let $P_n(v, c) = 0$ otherwise. Note that this implies $C_n(v) = C_{n-1}(v)$. If there is a collision of color between $v$ and its neighbors, then we choose $C_n(v)$ according to probability distribution $P_n(v, \cdot)$, such that for $c = 1,...,k$,
$$ P_{n}(v,c)=\begin{cases} (1-\beta)\cdot P_{n-1}(v,c), & c=C_{n-1}(v)\\ (1-\beta)\cdot P_{n-1}(v,c)+\beta\cdot(k-1)^{-1}, & \mbox{otherwise,} \end{cases} $$ where $\beta \in (0,1)$ is a pre-determined parameter. Observe that the vertex colors stabilize if and only if they form a $k$-coloring.
The CFL algorithm can be executed via the following decentralized procedure:
  1. Set up $C_0(v)$ and $P_0(v,\cdot)$ for each vertex $v$ and let all vertices be active.
  2. A) An active vertex $v$ is deactivated if no vertices in $N(v)$ share the same color with $v$. An inactive vertex $v$ is activated if there is a vertex in $N(v)$ sharing the same color with $v$.
    B) Each vertex $v$ computes $C_n(v)$ and $P_n(v,\cdot)$ upon the $n$-th change of the vertex state (from active to inactive or vice versa).
  3. Repeat Step 2 until all vertices are simultaneously inactive.
Thus we implement the algorithm by carrying out the above procedure with Pregel. A vertex $v$ is represented as a 4-tuple
type Palette = (Color, List[Double], Boolean, Random)
composed of the vertex color $C_n(v)$, the color distribution $P_n(v,\cdot)$, the vertex state (active/inactive), and a random number generator in turn. Given a graph RDD graph, we first construct its base graph and initialize the attributes of the vertices:
val seed = Random.nextInt
val baseGraph = graph.mapVertices((id, _) => {
  val rng = new Random(seed + id)
  (rnd.nextInt(k + 1), initColorDist, true, rng)
})
Note that we choose to maintain an RNG at each vertex. One may feel tempted to declare a global RNG and share it among the vertices. However, since Spark serializes the closure as it dispatches jobs, doing so will lead each vertex to use a copy of the global RNG instead of sharing it. Hence, all vertices will obtain the same sequence of random numbers upon their queries, which is definitely an undesired situation.
We use functions sendMsg and mergeMsg to carry out Part A of the second step in the procedure:
def sendMsg (edge: EdgeTriplet[Palette, _]): Iterator[(VertexId, Boolean)] = {
  if (edge.srcAttr._1 == edge.dstAttr._1)
    return Iterator((edge.srcId, true))
  if (edge.srcAttr._3)
    return Iterator((edge.srcId, false))
  Iterator.empty
}
def mergeMsg (msg1: Boolean, msg2: Boolean) = msg1 || msg2
For each edge $(v, u)$, the sendMsg function can activate or deactivate the source vertex $v$ by sending it a Boolean message. The messages are aggregated by mergeMsg via disjunction, such that a vertex is activated if any of its neighbors wants to activate it, and is deactivated if all of its neighbors want to deactivate it.
Part B of the second step is carried out by the vprog function:
def vprog (id: VertexId, attr: Palette, active: Boolean): Palette = {
  val color = attr._1
  val dist  = attr._2
  val rng   = attr._4
  val newDist = dist.foldLeft((1, List[Double]())) {
    case ((i, list), weight) => (i + 1,
      if (active)
        list :+ (weight * (1 - beta) + (if (color == i) 0.0 else beta / (maxNumColors - 1)))
      else
        list :+ (if (color == i) 1.0 else 0.0))
  }._2
  val newColor = if (active) sampleColor(newDist, rng.nextDouble) else color
  (newColor, newDist, active, rng)
}
The vprog function uses a helper function colorSampler to color a vertex randomly according to its current color distribution:
def sampleColor (dist: List[Double], rnd: Double): Int = {
  dist.foldLeft((1, 0.0)) {
    case ((color, mass), weight) => {
      val m = mass + weight
      (if (m < rnd) color + 1 else color, m)
    }}._1
}
Finally, we invoke Pregel with the above functions to compute a $k$-coloring for the base graph baseG:
Pregel(baseGraph, true)(vprog, sendMsg, mergeMsg).mapVertices((_, attr) => attr._1)
It is shown in [2] that the CFL algorithm colors a graph in exponential time with high probability. More precisely, given any $\epsilon \in (0,1)$, the algorithm can find a $k$-coloring for a $k$-colorable graph with probability $1-\epsilon$ in $O(N\exp^{cN\lg 1/\epsilon})$ iterations, where $N$ is the number of vertices and $c$ is a constant depending on $\beta$.

References

1. D. J. Leith, P. Clifford, "Convergence of distributed learning algorithms for optimal wireless channel allocation." IEEE CDC, 2006.
2. Duffy, Ken R., et al. "Complexity analysis of a decentralised graph colouring algorithm." Information Processing Letters, 107.2, 2008.

Sunday, March 22, 2015

Designing algorithms with Spark aggregate – Part II

In the previous post, I proposed a framework for Spark aggregate and listed several examples derived from the framework. One may observe that the $\mathrm{comb}$ functions are associative and commutative in all but the last example. One may also observe that the results of aggregate calls are deterministic in all but the last example. Now a question rises: is it necessary to use an associative and commutative $\mathrm{comb}$ function in an aggregate call, so that the result can be deterministic? The answer is (somewhat surprisingly) no. In this post, I shall give a simple guideline to find deterministic aggregate examples with non-commutative $\mathrm{comb}$ functions.

Given domain $A$ and range $B$, the idea is to find $B' \subset B$, $\mathrm{seq}: B \rightarrow A \rightarrow B'$ and $\mathrm{comb}: B \rightarrow B \rightarrow B'$, such that $\mathrm{aggregate(zero,\ seq,\ comb,}\ as)$ is deterministic w.r.t. the provided $\mathrm{zero} \in B$ and all $as \in A^+$. Moreover, $\mathrm{comb}$ is non-commutative over $B$, but the restriction of $\mathrm{comb}$ to $B'$ is associative and commutative.

The above idea can be instantiated using an abelian subgroup $(B', ⊗)$ of a non-abelian group $(B, ⊗)$. For example, we can let $B'$ be a cyclic subgroup of a symmetric group $B$. Here ⊗ is function composition and $A$ is the set of integers:

$\quad \mathrm{zero}$: an arbitrary element from $B$.
$\quad \mathrm{seq}\ (b, n) = b^n$.
$\quad \mathrm{comb}\ (b_1, b_2) = b_1 ⊗ b_2$.

Observer that the range of $\mathrm{comb}$ is confined to $B'$ due to the closure property of group operators. For another example, we can let $B$ be the group comprised of all non-singular 2-by-2 square matrices, and $B'$ be the subgroup formed by the rotation matrices in $B$. Here we set $A = B$ and let ⊗ be matrix multiplication:

$\quad \mathrm{zero}$: the identity matrix.
$\quad \mathrm{seq}\ (b, a) = b ⊗ [a]$, where $[x] = x \cdot |\det(x)|^{-1}$, the normalized matrix of $x$.
$\quad \mathrm{comb}\ (b_1, b_2) = b_1 ⊗ b_2$.

We can also construct examples using an abelian base group. We just have to "twist" the $\mathrm{comb}$ function such that it is commutative only over $B'$. For example, let $B$ be the set of n-by-n square matrices, and B' be the subset of $B'$ formed by all matrices whose last row is zero. Moreover, set $A = B$ and use matrix addition $+$ as a commutative group operator. Then we have the following example:

$\quad \mathrm{zero}$: the zero matrix.
$\quad \mathrm{seq}\ (b, a) = b + \{a\}$, where $\{x\}$ is obtained from setting the last row of ${x}$ to zero.
$\quad \mathrm{comb}\ (b_1, b_2) = b_1 + \langle b_2\rangle$, where $\langle x\rangle$ is obtained from doubling the last row of $x$.

Observe that $\mathrm{seq}$ and $\mathrm{comb}$ behave just like ordinary matrix additions over $B'$. They however become non-commutative operators when they are applied to elements outside $B'$. It turns out that we can use a textbook of elementary abstract algebra as a free source of interesting aggregate examples, at least in theory.

Saturday, February 21, 2015

An example of online algorithm

Online algorithms. An online algorithm are designed to receive and serve an unbounded finite sequence of requests $\sigma=\sigma(1),\sigma(2),...,\sigma(m)$ with a bounded memory. The requests are served in the order of occurrence. When serving request $\sigma(t)$, the algorithm doesn't have knowledge of requests $\sigma(t')$ for $t'>t$. The total number of requests is unknown until the last request is served. Serving requests incurs cost. The goal of an algorithm designer is to minimize the total cost paid on serving the entire request sequence.
Competitive analysis. The performance of an online algorithm is usually evaluated by comparing it with an optimal offline algorithm in the worst case. Let $ALG$ be an online algorithm, and $OPT$ be an algorithm that knows all the requests in advance, so that it can serve them in the minimal cost. Given a request sequence $\sigma$, let $ALG(\sigma)$ and $OPT(\sigma)$ denote the costs incurred by $ALG$ and $OPT$, respectively. Algorithm $ALG$ is called c-competitive if there exists a constant $b$ such that for all sequences $\sigma$, $$ALG(\sigma) \le c · OPT(\sigma) + b.$$The competitive ratio of algorithm $ALG$ is defined as $c(ALG)=\sup_{\sigma}ALG(\sigma)/OPT(\sigma).$ The best competitive ratio an algorithm can achieve is thus $c^*=\inf_{ALG}c(ALG)$.

Example. The ski rental problem is one of the classic online optimization problems. The scenario is as follows. You are having a vacation at a ski resort without skis. You can either rent or buy skis from the local shops. You are determined to ski on every sunny day. Suppose the cost of renting skis is one dollar per day, and the cost of buying skis is $s>1$ dollars. Denote the number of sunny days in your vacation as $d$. Clearly, if $d$ is known beforehand (eg. from a magical local weather forecast service), then the optimal (offline) strategy is to rent skis all the way when $d<s$, and buy skis at the first day when $d\ge s$. Now, let's assume you are in the opposite situation, where $d$ is unknown and can't be estimated. You need an online algorithm to help you make decision to rent or buy skis.
Suppose a deterministic online algorithm tells you to rents skis up to day $k$ ($k\le d$) and buy skis on day $k+1$ when $k<d$. Then the competitive ratio of this algorithm is computed by\begin{align*}\label{eq:eg1} \max\left\{\min\limits_{k\ge d}\max\limits_{k\ge d, s\ge d}\left\{\frac{d}{d}\right\}, \min\limits_{k\ge d}\max\limits_{k\ge d, s<d}\left\{\frac{d}{s}\right\}, \min\limits_{k<d}\max\limits_{k< d, s\ge d}\left\{\frac{k+s}{d}\right\}, \min\limits_{k<d}\max\limits_{k < d, s<d}\left\{\frac{k+s}{s}\right\}\right\}. \end{align*}The four terms in the braces correspond to the online/optimal cost ratios in the four cases. In each case, the ratio is first maximized by the adversary (ie. the weather) by choosing $d$ assuming the values of $k$ and $s$. The online algorithm then minimizes the maximal ratio (ie. the worst case) by selecting $k$ based on the choice of $d$. In this way, we obtain a tight upperbound on the cost ratio for each case. Since competitive analysis is a worst-case analysis, the largest bound among the four upperbounds is chosen as the competitive ratio:$$\max\left\{1, \frac{s+1}{s}, \min_{k<s}\left\{\frac{k+s}{k+1}\right\}\right\} = \max\left\{1, \frac{s+1}{s}, \frac{2s-1}{s}\right\} = 2 - \frac{1}{s}.$$This ratio is met by renting for $s-1$ days and buying at the $s$th day when $s\le d.$ No deterministic algorithms can achieve a competitive ratio smaller than $2-\frac{1}{s}$.

Randomized online algorithms. In many problems, online algorithms can achieve a better performance if they are allowed to make random choices. A randomized online algorithm $ALG$ is called $c$-competitive if there are constants $c,b$ such that $E[ALG(\sigma)] \le c ·OPT(\sigma)+b$ for all request sequences $\sigma$, where expectation is taken over the random choices made by $ALG$. The best competitive ratio for randomized online algorithm is thus $c^*=\inf_{ALG}\sup_{\sigma}E[ALG(\sigma)]/OPT(\sigma).$ For simplicity, here we only consider requests generated by an oblivious adversary, that is, an optimal offline algorithm that generates the entire request sequence before any requests are served and any random choices are made. One can also consider competitive ratios against more powerful adversaries; see [3] for more details.

Example. In the ski rental problem, a randomized algorithm chooses some day $k$ at random according to some distribution, rents for up to $k$ days and buys skis at day $k+1$. Assuming the distribution of $k$, the adversary would choose $d$ to maximize the expected online/optimal cost ratio in each case. The actual distribution of $k$ is then determined to minimize the maximal expected ratio over all cases. It turns out that the optimal competitive ratio of randomized online algorithms is roughly $\frac{e}{e-1}\approx 1.58$ [1,2]. (TODO: give the derivation of $\frac{e}{e-1}$.)

Lowerbound analysis. A randomized algorithm may be interpreted as a random choice among deterministic algorithms. In this sense, the problem of finding the best competitive ratio may be viewed as a 2-person zero-sum game, where i) the players are the Algorithm Designer and the Adversary, ii) Adversary’s strategies are all request sequences, iii) Designer's strategies are all deterministic online algorithms, and iv) the payoff for Designer is $ALG(\sigma)/OPT(\sigma)$. In addition, the players are allowed to use randomized strategies. Suppose that Designer chooses algorithms $ALG$ according to probability distribution $P$, and Adversary chooses sequences $\sigma$ according to probability distribution $Q$. Then it follows from Yao's principle that the best competitive ratio $c^*$ for a randomized online algorithm satisfies $$c^*\ge \inf_{ALG}E_Q[ALG(\sigma)/OPT(\sigma)]$$That is, lowerbounds for $c^*$ can be obtained by computing the minimal expected online/optimal cost ratio of deterministic online algorithms against random request sequences. (In fact, using a more general result called von Neumann's Minimax Theorem, one can show that $c^*=\sup_{Q}\inf_{ALG}E_{Q}[ALG(\sigma)/OPT(\sigma)].$)

References

1. https://en.wikipedia.org/wiki/Ski_rental_problem
2. http://cs.brown.edu/~claire/Talks/skirental.pdf
3. http://www14.in.tum.de/personen/albers/papers/inter.pdf

Thursday, February 12, 2015

Designing algorithms with Spark aggregate – Part I

In this post, I shall give a simple framework to design algorithms in terms of Spark's aggregate function. This framework is just a proof of concept and should not be applied blindly in practice. However, it does induce efficient algorithms in some special cases, as will be shown in the examples.

In general, our goal is to compute a property of any given instance from a domain. Let $G$ denote the domain of all interested instances, and $C$ the domain of the desired property. For example, if $G$ denotes graphs and the property to compute is MST, then $C$ will be the set of trees. Furthermore, we assume that there exists an associative and commutative "composition" operator $\oplus : G \rightarrow G \rightarrow G$, such that an instance of interests can be decomposed into (and composed back from) sub-instances in $G$. We shall denote the desired property of instance $g$ as $f(g)$. For example, if $g$ is a graph, then $f(g)$ can be some graph property such as diameter, MST and vertex cover. Note that the the mapping $f : G \rightarrow C$ can be non-deterministic if the property is not unique w.r.t. the given instance.

To use the aggregate function, we specify zero, seg and comb as follows:
$\mathrm{zero} \in (G, C)$ is a pair comprised of the zero values from $G$ and $C$. If there are no such zero values, we can set $\mathrm{zero} = (g, f(g))$ for an arbitrary sub-instance of $g$.
$\mathrm{seq} : (G, C) \rightarrow G \rightarrow (G, C)$ takes a partially aggregated result $(g_1, f(g_1))$ and a sub-instance $g_2$. It outputs $(g_3, f(g_1 \oplus g_2))$ such that $f(g_3 \oplus g) = f(g_1 \oplus g_2 \oplus g)$ for all $g \in G$.
$\mathrm{comb} : (G, C) \rightarrow (G, C) \rightarrow (G, C)$ takes two partially aggregated results $(g_1, f(g_1))$ and $(g_2, f(g_2))$. It outputs $(g_3, f(g_1 \oplus g_2))$ such that $f(g_3 \oplus g) = f(g_1 \oplus g_2 \oplus g)$ for all $g \in G$.
The rest of the steps are standard: we first decompose the input instance into a sequence of sub-instances. The sequence is then converted to an RDD through parallelize. Finally, we invoke RDD.aggregate(zero)(seq,comb) to compute the desired property of the instance.

Remarks

1. While functions seq and comb look very similar in spec, they can use different methods to compute $f(g_3)$. Due to the partial results provided, seq is suitable to use incremental methods and comb can use (the merge part of) divide-and-conquer methods to compute the desired property. See the convex hull example below.
2. The assumption that operator $\oplus$ is associative and commutative over $G$ can be relaxed to be so only with respect to the given property $f$. For instance, suppose that $G$ is a set of finite strings and $\oplus$ is the string concatenation operator. While $\oplus$ is not commutative per se, it is commutative w.r.t. the string length property. That is, $\mathrm{strlen}(g_1 \oplus g_2) = \mathrm{strlen}(g_2 \oplus g_1)$ for all $g_1, g_2 \in G$. The framework is thus applicable to compute the length of string. See the majority algorithm below for a more non-trivial example.
3. The framework is practical only if $|g_3|$ can be significantly less than $|g_1 \oplus g_2|$, for otherwise we will have to pass almost the whole instance from node to node. In the case that $f(g_1 \oplus g_2)$ can be computed from $f(g_1)$ and $f(g_2)$, there is no need to output $g_3$, i.e., $|g_3|=0$. This special case allows us to operate on simpler domains (e.g., the type of seq becomes $C \rightarrow G \rightarrow C$) and to obtain more efficient aggregate algorithms due to less communication overheads.

Examples

The above naive framework can yield efficient algorithms if $f(g_1 \oplus g_2)$ can be computed solely based on $f(g_1)$ and $f(g_2)$. Below we give three examples that satisfy this condition. The first is a variant of the reverse-delete algorithm that finds an MST of a graph. The second is an outline of a convex hull algorithm. The last is Boyer and Moore's algorithm that finds the majority element in a list.

Finding MST. Let $G$ be the set of all weighted undirected graphs, and $C$ be the set of trees. Given a graph $g$ from $G$, we assign a unique index to each of its edges. Let $w(e)$ and $idx(e)$ denote the weight and index of edge $e$, respectively. Given two edges $e_1$ and $e_2$ of $g$, we say $e_1 < e_2$ iff either $w(e_1) < w(e_2)$, or $w(e_1) = w(e_2)$ and $idx(e_1)$ < $idx(e_2)$. Now we define an operator msf over $G$, such that msf computes a minimum spanning forest of $g_1 \cup g_2$ from $g_1$ and $g_2$ as follows:
def msf (g1, g2) = {
  var g = g1 ∪ g2
  while (g contains a cycle c)
    remove the largest edge from c
  return g
}
Let RDD(g) denote the RDD obtained from decomposing and parallelizing graph $g$. Then RDD(g).aggregate(Nil)(msf,msf) computes a minimum spanning forest of $g$, which would be a MST if $g$ is connected. Note that the result is deterministic if the assignment of indices is deterministic.

Finding convex hull. Here $G$ is the collection of finite sets in Euclidean plane and $C$ is the set of convex polygons. The convex hull of a set of points is the smallest convex polygon containing all the points. If the points are given as a sequence, we can use aggregate to find their convex hull by computing smaller convex hulls incrementally in seq and then merging them in comb to obtain the final convex hull. Both of seq and comb can be implemented to take linear time (see the links for pseudo codes).

Finding majority. Let $C$ be a set of comparable elements. Fix some element $\bot \in C$ and let $G = (C - \{\bot\})^*$ be a set of finite sequences. The majority problem is to determine in a given sequence from $G$ whether there is an element with more occurrences than all the others, and if so, to determine this element. Click here to see a linear-time algorithm that solves the problem using aggregate. Note that the aggregate invocation only finds a candidate for majority; we need a further check to assert that the candidate is indeed the majority element.

Sunday, March 9, 2014

Algorithm: Linked List Problems on Leetcode

Copy List with Random Pointer

Problem A linked list is given such that each node contains an additional random pointer which could point to any node in the list or null. Return a deep copy of the list.
Solution It is straightforward to use a hash table to record the new node for each existing node.  The new list can then be established from the table in linear time (code). The drawback of a hash table is the space overhead. Here we offer another solution that takes $O(n)$ time and $O(1)$ auxiliary space (code). The idea is as follows:
Step 1: Create a new node for each existing node and join them together, eg: A->B->C will become A->A'->B->B'->C->C', so that n' is a copy of n except that the random pointer in  n' is not set.
Step 2: Copy the random links to the new nodes: for each node n,
n'.random = n.random.next.
Step 3: Detach the new nodes from the list: for each node n,
n.next = n.next.next; n'.next = n'.next.next.

Merge k Sorted Lists

Problem Merge $k$ sorted linked lists and return it as one sorted list.
Solution Denote the size of the $i$th list by $n_i$. Merging two lists with sizes $n_i$ and $n_j$ takes $\Theta(n_i+n_j)$ time. A naive approach to merge $k$ sorted lists would be merging them one by one. The time complexity would be $$\Theta(n_1+n_2) + \Theta(n_1+n_2+n_3) + ... + \Theta(n_1+...+n_k)=\Theta(kn_1+(k-1)n_2+...+n_k).$$
(To be continued)

Wednesday, February 12, 2014

Algorithm: Detecting cycles in a linked list

Detecting cycle in a linked list I

Problem Detect if there is a cycle in a singly-linked list. Do it in $O(n)$ time and $O(1)$ space.

Solution Assume without loss of generality that a list node never points to itself. To detect a cycle, we iterate the list with two pointers such that one pointer moves faster than the other. We argue that the two pointers eventually coincide (i.e. point to the same node) if the list contains a cycle.
Suppose that in each iteration, the faster pointer advances $m$ steps and the slower pointer advances one step. We check whether the two pointers coincide at the end of an iteration. Let $d(x,y)\in[0,\infty]$ denote the minimal number of moves a pointer has to made starting from node $x$ to reach node $y$. Let $A$ be the first node of the cycle visited by the slower pointer, and $B$ be the node the pointers point to when they coincide. Denote the number of nodes in the cycle by $k$.
When the two pointers coincide, the number of moves made by the faster pointer is $m$ times larger than the moves made by the slower pointer. Hence, the two pointers eventually coincide iff there exist integers $p,q,v,h\ge0$ such that $h=d(head,A) \ge 0$, $v=d(A,B)\le k-1$, and $h+pk+v=m(h+qk+v)$ with $m,k\ge 2$, which holds iff there exist $p,q,v\ge 0$ satisfying \begin{equation}k(m-1) > v(m-1) = k(p-mq)-h(m-1) \ge 0,\label{coin-eg1}\end{equation}which, by setting $X=p-mq$, can be be written as $$k (m-1) > kX - h (m-1) \ge 0.$$ If $k\ge h$, then all $p,q\ge 0$ such that $X=m-1$, namely $p=mq+m-1$, satisfy \eqref{coin-eg1}. If $k < h$, then there exists $n\ge 1$ such that $kn\ge h > k (n-1)$. In this case, observe that when $X=n(m-1)$, namely $p=mq+nm-n$, we have
 $kX-h(m-1) = kn(m-1) - h(m-1) \ge 0$, and
 $k(m-1) = kn(m-1) - k(n-1)(m-1) > kn(m-1) - h(m-1) = kX - h(m-1)$
It follows that all $p,q\ge 0$ such that $p=mq+mn-n$ satisfy \eqref{coin-eg1}. This concludes the proof.
Discussion. The above analysis gives two sufficient conditions for coincidence where the only constraint for $q$ is $q\ge 0$. By setting $q=0$, we see that one can detect the cycle at the $h+v<n$th iteration, which is already the minimal number possible. Also, $v=kp/(m-1)$, where $p$ is the minimal integer to make $v$ an integer. We can obtain the size of the cycle with an additional $k$ iterations after we found node $B$.

Detecting cycle in a linked list II

Problem Detect if there is a cycle in a singly-linked list, and if there is one, find the node where the cycle begins. Do it in $O(n)$ time and $O(1)$ space.

Solution We assume that the list contains a cycle and use the notations defined before. Iterate the list as we did in the previous solution and set $m=2$. By the above argument, there exist $p,q\ge0$ such that $h+pk+v=2(h+qk+v)$, which implies $h+v=k(p-2q)$. Hence, if a pointer advances $h$ steps from node $B$, it will reach node $A$ after looping the cycle $p-2q$ times. To find node $A$, we can launch two fresh pointers from the head and node $B$ respectively, both moving one node forward at a time. These two pointers will both point to node $A$ after $h$ moves. See here for an implementation of the algorithm.

Discussion. In this solution, we stipulate the ratio of speeds $m=2$ for the two pointers. I don't know if it is possible to find the node within the required complexity given an arbitrary ratio of speeds. Also, for the ease of proof, we check coincidence of the pointers only at the end of each iteration. A conscientious reader may want to prove the correctness of an algorithm that checks coincidence after each step, and see if it is indeed more efficient than the simplified version presented here.

Further readings

1. http://en.wikipedia.org/wiki/Cycle_detection

Tuesday, October 29, 2013

Algorithm: Order Statistics

Finding dominating members

Problem Given $k\ge2$ and array $A$ with $|A|=n$, find the elements with frequency greater than $n/k$ in $O(nk)$ time. (When $k=2$, this problem reduces to finding the majority.)

Solution Given a multiset $A$ and $k>0$, we call a number $(A,k)$-dominating if its frequency in $A$ is greater than $|A|/k$. Observe that if $S\subset A$ and $|S|=k$, then $(A,k)$-dominating implies $(A/S,k)$-dominating.
Now we can solve this problem by removing sets of $k$ distinct elements from $A$ until no such set exists. To do this efficiently, we enumerate $A$ and in the meantime maintain $k-1$ counters to record the frequencies of duplicate elements. For each element, we check if all $k-1$ counters are nonzero. If so, we can remove $k$ distinct elements from $A$ by decrementing all counters by 1. Otherwise, we record the element by incrementing the associated counter by 1. At the end of enumeration, elements that dominate $A$ must have the maximal counter values in the record. Note that the reverse is not true, since it is possible that dominating elements don't exist in the first place. Hence, we collect the elements associated with the maximal counter values and return them if they indeed dominate. [Example Code]
Related Posts Plugin for WordPress, Blogger...