This is ericpony's blog

Sunday, December 20, 2015

Contextual equivalence (in progress)

\( \def\[#1]{[\![\,#1\,]\!]} \def\V[#1]{\mathcal{V}\,[\![#1]\!]} \def\E[#1]{\mathcal{E}\,[\![#1]\!]} \def\G[#1]{\mathcal{G}\,[\![#1]\!]} \)
A type environment $\Delta$ is defined by $\Delta ::= \cdot\ |\ \Delta, \alpha$, where $\alpha$ denotes a type variable. A judgement in form of $\Delta\vdash  \tau$ states that type $\tau$ is well-formed in $\Delta$, i.e., all free type variables in $\tau$ appear in $\Delta$. An example of judgements of this kind is $\beta \vdash \forall \alpha.\, \alpha\rightarrow \beta$. We lift the judgement to define $\Delta \vdash \Gamma$, meaning $\forall x\in Dom(\Gamma).\,\Delta\vdash\Gamma(x)$. That is, all the types in the range of $\Gamma$ are well-formed in $\Delta$. When we write $\Delta; \Gamma \vdash e:\tau$, we mean $\Gamma \vdash e:\tau$ plus that $\tau$ is well-formed in $\Delta$. A type parameter $\alpha$ is introduced to term $e$ by type abstraction $\Lambda \alpha.\, e$, and eliminated from $e$ by type application $e[\tau]$ (for $e$ in form of $\Lambda \alpha.\, e$). The corresponding typing rules are$$\frac{\Delta, \alpha; \Gamma \vdash e: \tau}{\Delta; \Gamma \vdash \Lambda \alpha.\,e: \forall\alpha.\tau}\quad\mbox{and}\quad\frac{\Delta; \Gamma \vdash e: \forall\alpha.\tau\quad\Delta \vdash \tau'}{\Delta; \Gamma \vdash e[\tau']: \tau[\tau'/\alpha]}.$$In this lecture, we shall consider an extension of STLC with type abstraction and type application. This extension is in fact a terminating fragment of System F.
A context is comprehended as a complete program with a single hole that is capable of accepting an expression of a certain type. Context typing is in general expressed as $C: (\Gamma\vdash \tau) \leadsto (\Gamma'\vdash \tau')$. That is, given a context $C$, we can embed any expression $e$ in it such that $\Gamma\vdash e:\tau$ implies $\Gamma'\vdash C[e]:\tau'$.
Two expressions $e_1$ and $e_2$ are contextual equivalent in $\Gamma$ (written as $\Gamma \vdash e_1 \approx_{ctx} e_2: \tau$) iff $\Gamma \vdash e_1:\tau$, $\Gamma \vdash e_2:\tau$, and for all $C:(\Gamma \vdash \tau) \leadsto (\cdot \vdash bool)$, $\exists v:\tau.$ $e_1 \Downarrow v \wedge e_2 \Downarrow v$.
Note that here we only consider contextual equivalence for terminating programs. We shall extend its definition for general programs a bit later in this lecture. Also, we assume that the context yields a closed Boolean program after $e$ is embedded. This assumption is made without loss of generality: for any two programs of type inhabited by at least two members, there exists a context to differentiate their return values with Boolean if and only if the two programs are not contextual equivalent. (Any two programs of type with only one inhabitant are contextual equivalent by definition.)
The main difficulty of reasoning about contextual equivalence lies in the fact that equivalence must be shown for all contexts $C$. Proving equivalence by inducting on the structure of $C$ is problematic for $C = \lambda x:\tau.\, C$ and $C = \Lambda \alpha.\,C$. Hence, instead of proving contextual equivalence directly, people usually use proof methods such as logical relations or other technical bisimulations. We shall prove the following two interesting theorems in this lecture to demonstrate the technique of logical relations:
Theorem 1. If $\vdash e: \forall \alpha.\, \alpha\rightarrow\alpha$ and $\vdash \tau$ and $\vdash v:\tau$, then $e[\tau]\ v \hookrightarrow^* v$.
Theorem 2. If $\vdash e: \forall \alpha.\, \alpha\rightarrow bool$, $\vdash \tau_1 $, $\vdash \tau_2$, $\vdash v_1:\tau_1$, $\vdash v_2:\tau_2$, then $e[\tau_1]\ v_1 = e[\tau_2]\ v_2$.
The idea is to define logical relations such that contextual equivalent terms can be related by hypothesis. Let's first define value interpretations for some closed types:
$\V[bool]=\{\ (true,true),\ (false,false)\ \}$
The slogan of LRs for function types is "You give me related inputs; I give you related outputs." Do not confuse this with Theorem 2, where the two inputs are not required to be related.
$\V[\tau\rightarrow\tau']=\{\ (\lambda x:\tau.\, e_1,\lambda x:\tau.\, e_2)\ |\ \forall (v_1, v_2)\in \V[\tau].\, (e_1[v_1/x], e_2[v_2/x]) \in \E[\tau']\ \}$
The interpretation for polymorphic types is new.
$\V[\forall \alpha.\, \tau]=\{\ (\Lambda \alpha.\,e_1, \Lambda \alpha.\,e_2)\ |\ \forall \tau_1, \tau_2.\, (e_1[\tau_1/\alpha], e_2[\tau_2/\alpha]) \in \V[\tau](\rho[\alpha\mapsto(\tau_1, \tau_2)])\ \}$
Note that $\tau_1[\tau'/\alpha]$ and $\tau_2[\tau'/\alpha]$ are different types. Hence we relate them with a LR for parametric type $\tau$, and use $\rho$ (called a relational substitution) to remember the types $\tau_1,\ \tau_2$ picked for type parameter $\alpha$. In general, a relational substitution is in form of $\{\alpha_1\mapsto(\tau_{11},\tau_{12}), \alpha_2\mapsto(\tau_{21},\tau_{22}), ... \}$. We shall write $\V[\tau](\rho)$ to express the fact that $\V[\tau]$ is parametrized by substitution $\rho$, assuming that $dom(\rho)\supseteq FTV(\tau)$.
$\V[\alpha]=\{\ (\tau_1, \tau_2)\ |\ \rho(\alpha)=(\tau_1, \tau_2)\ \}$
(to be continued)

Thursday, December 10, 2015

Logical relations as a proof technique — Part II

\( \def\[#1]{[\![\,#1\,]\!]} \def\V[#1]{\mathcal{V}\,[\![#1]\!]} \def\E[#1]{\mathcal{E}\,[\![#1]\!]} \def\G[#1]{\mathcal{G}\,[\![#1]\!]} \)
This is the part II of my lecture note based on the lecture series "Logical Relations" given by Amal Ahmed at Oregon Programming Languages Summer School, 2015. See here for the Part I of this note.

Type safety. The slogan for type safety is "if a term is well-typed, it won't get stuck." Formally, this means that if $\vdash e:\tau$ and $e\hookrightarrow^* e'$, then either $e'$ is a value, or there exists $e''$ such that $e'\hookrightarrow e''$. Note that type safety doesn't require termination, and thus is weaker than strong normalization we proved earlier. On the other hand, type safety can be proved for Turing-complete languages such as STLC with recursive types (whereas strong normalization can not).
A well-known approach to prove type safety is via proving progress and preservation (note that these are proof methods for type safety, not the definition of it):
Progress lemma. If $\vdash e:\tau$, then either $e'$ is a value, or $\exists e''.\, e'\hookrightarrow e''.$
Preservation lemma. If $\vdash e:\tau$ and and $e'\hookrightarrow e''$, then $\vdash e'':\tau$.
Once these lemmas are established, type safety can be proved easily by induction. In this lecture, however, we shall use logical relations instead of the above lemmas to prove type safety.

We need a bunch of definitions before we proceed. Define that $val(e)$ holds iff $e$ is a value; $red(e)$ holds iff $\exists e'.\, e\hookrightarrow e'$; $safe(e)$ holds iff $\forall e'.\, e \hookrightarrow^* e' \implies red(e') \vee val(e')$. Now the definition of type safety can be stated as $\vdash e:\tau \implies safe(e)$. Recall that to prove strong normalization, we introduced a logical predicate $SN_\tau$ and used it to break statement $\vdash e:\tau \implies e\Downarrow$ into $\vdash e:\tau \implies SN_\tau(e)$ and $SN_\tau(e) \implies e\Downarrow$. Here we shall adopt a similar strategy. To begin with, we define the value interpretations of type-safe values and closed expressions as
$\V[bool] = \{\ false,\, true\ \}$
$\V[\tau\rightarrow\tau']=\{\ \lambda x:\tau.\, e\ |\ \forall v \in \V[\tau].\, e[v/x] \in \E[\tau']\ \}$
$\E[\tau]=\{\ e\ |\ \forall e'.\, e \hookrightarrow e' \implies red(e') \vee e' \in \V[\tau]\ \}$
Now, to prove $\vdash e:\tau \implies safe(e)$, it suffices to prove that $\vdash e:\tau \implies e \in \E[\tau]$ and $e \in \E[\tau] \implies safe(e)$. As before, the main difficulty lies in proving the first part, since the second follows immediately from the definition of $\E[\tau]$. Also, while we are proving type safety for closed terms, we need to use a hypothesis about open terms so as to apply induction to lambda expressions. For this purpose, we define a value interpretation for environments as follows:
$\G[\emptyset] = \{\emptyset\}$
$\G[\,\Gamma, x:\tau\,] = \{\ \gamma[x\mapsto v]\ |\ \gamma \in \G[\Gamma] \wedge v \in \V[\tau]\ \}$
Hence, a substitution $\gamma$ is in $\G[\Gamma]$ iff it maps variables to values that are type-safe and consistent with $\Gamma$. An expression is called semantically type-safe in $\Gamma$, written as $\Gamma \models e:\tau$, if and only if $\gamma(e) \in \E[\tau]$ for all substitutions $\gamma \in \G[\Gamma]$. That is, $\Gamma \models e:\tau$ iff $e$ always reduces to a type-safe closed term after its free variables are replaced by type-safe values consistent with $\Gamma$.
As mentioned above, to prove type-safety it suffices to show A) $\vdash e:\tau \implies \models e:\tau$, and B) $\models e:\tau \implies safe(e)$. Part A says that the syntactic notion of type safety (ie., well-typedness) implies a semantic notion of type safety which says that a term will not get stuck in execution. It is in fact a special case of the so-called "Fundamental Property" of logical relation $\ \Gamma \vdash e:\tau \implies \Gamma \models e:\tau$. Part B can be shown directly as follows. Recall that $safe(e)$ holds iff $\forall e'.\, e \hookrightarrow^* e' \implies red(e') \vee val(e')$. Suppose that $e \hookrightarrow^* e'$. If $\models e:\tau$, then we have $e\in \E[\tau]$ since $\emptyset(e)=e$. But this means $red(e') \vee e' \in \V[\tau]$, that is, $e'$ is reducible or $e'$ is a value. This concludes the proof.
You can see here and in the example of strong normalization that the property of interest follows almost immediately from the definition of the logical relations we use. This is often the case if you construct suitable logical relations to assist your proof.
Now we are going to prove the Fundamental Property, i.e., $\Gamma \vdash e:\tau \implies \Gamma \models e:\tau$. As we did in the proof of strong normalization, we perform induction on the type derivations case by case.


(to be continued)

Saturday, December 5, 2015

Logical relations as a proof technique — Part I

This is the part I of my lecture note based on the lecture series "Logical Relations" given by Amal Ahmed at Oregon Programming Languages Summer School, 2015. See here for Part II.

Overview. These lectures will introduce the proof method of logical relations. Logical relations have been used to prove a wide range of properties, including normalization, type safety, equivalence of programs, non-interference (in security typed languages), and compiler correctness. We will start by looking at how to prove normalization and type safety for the simply-typed lambda calculus (STLC). We will then add recursive types to STLC and develop a step-indexed logical relation, using step-indexing to ensure that the logical relation remains well-founded. We will cover polymorphic and existential types and develop a logical relation to reason about the associated notions of parametricity and representation independence. The lectures will also discuss how to establish that a logical relation corresponds to contextual equivalence.

Free theorems. In parametric polymorphism, types cannot be checked, inspected or instantiated by programs. Hence, programs with polymorphic types don't change behaviors according to their type parameters. It follows that all functions of type $\forall a.\, a \rightarrow a$ are identity functions, since nothing but the argument can be returned by the function. Similarly, all functions of type $\forall a.\, a \rightarrow Int$ are constant functions. In contrast, there doesn't exist a function of type $\forall a.\, Int \rightarrow a$, since the return value can be obtained from nowhere. Results of this kind are called free theorems, and many of them can be proved using logical relations.

Logical predicates. Unary logical relations are also called logical predicates. They are used to assert properties of one single program, such as termination. In contrast, binary logical relations are used to assert relations between two programs, such as equivalence. Note that we only put closed programs in a logical relation. By and large, to prove that a well-typed and closed term $e$ has some special property, we shall set up a predicate $P_\tau$ such that $P_\tau(e)$ holds if and only if all of the following conditions hold: i) $\vdash e: \tau$, ii) $e$ has the property of interests, and iii) the property of interests is preserved by elimination forms (or introduction forms) of type $\tau$.

Strong normalization. In history, logical relations first appeared as a technique to prove strong normalization, which means that a well-typed term always reduces to the same value regardless of the strategies of evaluation. Since typing rules in STLC are deterministic, proving strong normalization for STLC amounts to showing that every well-typed program terminates, that is, $\vdash e:\tau \implies e \Downarrow$. Observe that this statement cannot be proved by applying structural induction on typing derivations directly. The problem of induction occurs at the application rule: we cannot conclude $e_1\ e_2\Downarrow$ from $e_1 \Downarrow$ and $e_2 \Downarrow$. Nevertheless, this problem can be resolved by using a stronger induction hypothesis, which leads to the following definition of logical predicates.
We shall set up a predicate $SN_\tau(\cdot)$ such that $SN_\tau(e)$ holds only if term $e$ has type $\tau$ and is strongly normalizing. We consider two types for STLC: boolean and function types. For boolean, it suffices to define that $SN_{bool}(e)$ holds iff $\vdash e:bool \wedge e \Downarrow$, since boolean doesn't has elimination forms. For functions types, we define that $SN_{\tau\rightarrow\tau'}(e)$ holds iff
$\vdash e':\tau\rightarrow\tau'\quad\wedge\quad e' \Downarrow\quad \wedge\quad (\forall e.\, SN_{\tau}(e) \implies SN_{\tau'}(e'\ e))$.
(1)
We shall see that this definition provides the strengthened hypothesis we need to deal with the application rule. The proof of $\vdash e:\tau \implies e \Downarrow$ now breaks into two parts: A) $\vdash e:\tau \implies SN_{\tau}(e)$, and B) $SN_{\tau}(e) \implies e \Downarrow$. Part B follows immediately from the definition of $SN_\tau(e)$. It remains to prove part A, and we rely on structural induction to do this. But again, we cannot use the statement $\vdash e:\tau \implies SN_{\tau}(e)$ we are going to prove as an induction hypothesis. It only asserts closed terms, but to show that it holds for terms of function type, say, for $(\lambda x:\tau.\, e'): \tau \rightarrow \tau'$, we need the hypothesis to hold for open term $e'$ so as to perform induction.
Instead, we shall use the following stronger claim as an induction hypothesis. Given a mapping $\gamma$ from variables to values, we write $\gamma \models \Gamma$ when $dom(\gamma)=dom(\Gamma)$ and $\forall x\in dom(\Gamma).\, SN_{\Gamma(x)}(\gamma(x))$ holds. That is, $\gamma$ maps variables in $dom(\Gamma)$ to values that strongly normalize with types compatible with $\Gamma$. We claim that for any term $e$ and substitution $\gamma$,
$\Gamma \vdash e:\tau \wedge \gamma \models \Gamma  \implies SN_{\tau}(\gamma(e))$.
(2)
It is clear that part A follows from this claim by setting $\Gamma$ to be empty. Now we shall prove (2) by inducting on the derivation rules case by case.
Case $\quad\rhd\ \ \vdash true: bool\quad$ and $\quad\rhd\ \ \vdash false: bool$.
 Note that $\gamma(x)=x$ for $x\in \{\,true,\,false\,\}$, which is of course strongly normalizing.
Case $\quad\Gamma(x)=\tau\ \ \rhd\ \ \Gamma \vdash x:\tau$.
 By premises $\Gamma(x)=\tau$ and $\gamma\models\Gamma$, $\gamma(x)$ has type $\tau$. Thus $SN_{\tau}(\gamma(x))$ holds by $\gamma\models\Gamma$.
Case $\quad\vdash e': \tau\rightarrow\tau'$, $\vdash e: \tau\ \ \rhd\ \ \vdash e'\ e: \tau'$.
 Note that $\gamma(e'\ e)=\gamma(e')\ \gamma(e)$. By premise $\gamma\models\Gamma$, both $SN_{\tau\rightarrow\tau'}(\gamma(e'))$ and $SN_{\tau}(\gamma(e))$ holds by induction. Thus $SN_{\tau'}(\gamma(e')\ \gamma(e))$, which equals to $SN_{\tau'}(\gamma(e'\ e))$, holds by (1).
Case $\quad\Gamma, x:\tau \vdash e':\tau'\ \ \rhd\ \ \Gamma\vdash\lambda x:\tau.\, e': \tau\rightarrow\tau'$.
  This is the case we got stuck with before introducing logical predicates, so it is not surprising that the proof for this case is far more complicated than others. Before we proceed, we need the following two lemmas:
Lemma 1. (substitution lemma) If $\Gamma \vdash e:\tau$ and $\gamma \models \Gamma$, then $\Gamma \vdash \gamma(e):\tau.$
Lemma 2. (SN is preserved by forward/backward reductions) Suppose that $\vdash e:\tau$ and $e \hookrightarrow e'$. Then $SN_{\tau}(e)$ holds if and only if $SN_{\tau}(e')$ holds.
To prove (2), i.e., $\Gamma \vdash \lambda x:\tau.\, e' \wedge \gamma\models\Gamma$ implies $SN_{\tau\rightarrow\tau'}(\gamma(\lambda x:\tau.\, e'))$, it suffices to prove the following three statements:
1) $\gamma(\lambda x:\tau.\, e')$ has type $\tau\rightarrow\tau'$. This follows by premise $\gamma\models \Gamma$ and Lemma 1;
2) $\gamma(\lambda x:\tau.\, e')\Downarrow$. This is obvious, since $\gamma(\lambda x:\tau.\, e')=\lambda x:\tau.\,\gamma(e')$ is a lambda value;
3) $\forall e.\,SN_\tau(e)\implies SN_{\tau'}(\gamma(\lambda x:\tau.\, e')\ e)$. By induction, we know that $\Gamma, x:\tau \vdash e':\tau'$ and $\gamma'\models \Gamma, x:\tau$ implies $SN_{\tau'}(\gamma'(e'))$. Suppose that $SN_\tau(e)$ holds for some $e$. Then $e\Downarrow v$ for some value $v$, and $SN_{\tau}(v)$ holds by Lemma 2. Given premises $\Gamma \vdash \lambda x:\tau.\, e'$ and $\gamma\models\Gamma$, define $\gamma'=\gamma[x \mapsto v]$. Note that $\gamma'\models\Gamma,x:\tau$. Thus $SN_{\tau'}(\gamma'(e'))$ holds by induction. However, note that
$\gamma'(e')=\gamma(e')[x \mapsto v]=(\lambda x:\tau.\,\gamma(e'))\ v=\gamma(\lambda x:\tau.\,e')\ v$.
It then follows from $SN_{\tau'}(\gamma'(e'))$ and Lemma 2 that $SN_{\tau'}(\gamma(\lambda x:\tau.\,e')\ e)$ holds. This concludes our proof of strong normalization for STLC.

(End of lecture note Part I. See here for Part II.)

Monday, November 2, 2015

A note on the Bellman–Ford's algorithm

Bellman-Ford Shortest Path Algorithm is an elegant solution to the single-source shortest path problem, stated as follows: given a weighted digraph $G=(V,E)$, let $w(e)\in \mathbb R$ denote the weight of edge $e$. Fix $s \in V$ to be the source vertex, and let $dist(v)$ be the distance from $s$ to $v$. That is,
$dist(v) = \min\{\ \sum_{e\in P} w(e)$ : $P \subseteq E$ is a path from $s$ to $v\ \}.$
The goal of the problem then is to compute $dist(v)$ for each $v \in V$.
Bellman-Ford algorithm maintains a map $d: V \rightarrow \mathbb R$ to store the tentative distance values. We use $d_k(v)$ to denote the value of $d(s, u)$ computed in the $k$th iteration. Let $d_0(v)=\infty$ for $v \neq s$ and $d_0(s)=0$. In the $n$th iteration, $d_n(v)$ is computed based on $d_{n-1}(v)$ as follows:
$d_n(v)=\min\{\ d_{n-1}(v),\ \min\{\ (u, v)\in E: d_{n-1}(u) + w(u,v)\ \}\ \}$.
Here are some elementary facts about Bellman-Ford algorithm:
  • The original version of the algorithm runs for a fixed number of $|V|$ iterations. In each iteration, the distance values are updated by looping through the edges. More precisely, for each $(u, v) \in E$, we relax it by setting $d_n(v)=\min\{d_{n-1}(v), d_{n-1}(u) + w(u, v)\}$. The time complexity of the algorithm is thus $O(|V||E|)$.
  • Loop invariant: The following constraint holds for $n\ge 0$ and each $v\in V$:
    $dist(v) \le d_{n+1}(v) \le d_n(v).$
  • It turns out that $d_n(v)$ is the length of the shortest path from $s$ to $v$ with at most $n$ edges. Since a cycle-free path contains at most $|V|-1$ edges, $d_{|V|-1}(v) = dist(v)$ if a negative cycle cannot be reached from the source vertex $s$.
  • In fact, $G$ doesn't contain a reachable negative cycle if and only if $d_n$ converges before $n=|V|$. This fact leads us to a $O(|V||E|)$ algorithm for negative cycle detection.

Optimizations

1. One may immediately observe that if $d_n(v)=d_{n-1}(v)$, then we don't have to check edge $(v, u)$ in the $n$th iteration, leading to a constant-factor savings in time. Also, it is possible that $d_n$ stabilize at some $n$ less than $|V|-1$, allowing us to terminate the main loop earlier. However, there are graphs where $|V|-1$ iterations are needed in the worst case.
2. In the above definition of algorithm, we update $d_n$ according to the values of $d_{n-1}$. Hence, the values of $d_n$ don't depend on the order in which edges are relaxed. In practice, $d_n$ and $d_{n-1}$ can refer to the same map, in which case the order of relaxation impacts the efficiency of the algorithm. For example, when there is no negative cycle reachable from the source, the number of iterations can be reduced to $\lceil |V|/2 \rceil$ if all vertices are assigned with some arbitrary linear order and the edges are relaxed by topological sort (Yen 1970). Furthermore, if the arbitrary linear order of the vertices are replaced by a random permutation, then the expected number of iterations needed in the main loop is at most $|V|/3$ (Bannister & Eppstein 2012).
Unfortunately, optimizations based on the order of relaxation cannot be applied as easily in the distributed settings.

Distributed adaptation

Bellman-Ford algorithm is particularly suitable (compared to Dijkstra’s algorithm) to be carried out on fully distributed platform. First, Bellman-Ford algorithm allows the distance value of each vertex to be updated asynchronously. Thus we don't need to use shared memory. Also, the correctness of the algorithm doesn't rely on the orders in which the edges are relaxed. Thus we don't need a master node to coordinate other nodes. The following pseudo-code describes a simple adaptation of Bellman-Ford algorithm to a message-passing system. Note that each $d(v)$ is stored in the local memory at vertex $v$ and cannot be accessed by other vertices.
Initialization:
    Set $d(v)=\infty$ at each vertex $v\neq s$ and set $d(s)=0$ at the source vertex $s$.
    Set $cnt(v)=0$ at each vertex $v$.
    Send $\mathtt{update}(w(s, v))$ to vertex $v$ for each edge $(s, v)$.

When vertex $v$ receives message $\mathtt{update}(t)$:
    If $t<d(v)$ then
        Set $cnt(v) = cnt(v) + 1$
        If $cnt(v) \ge |V|$ then
            Report that $v$ is on a negative cycle.
        else
            Set $d(v)=t$ and send $\mathtt{update}(t+w(v, u))$ to $u$ for each edge $(v, u)$.

Note that this algorithm is fully distributed---it doesn't need any synchronization mechanism, and initialization can be done by broadcasting a special reset message. On the other hand, since the number of messages a vertex would receive is indefinite, some of the vertices may never know whether it has already computed the shortest distance.

Applications

Optimal tokenization. Given a sequence of characters and a map from words to scores, insert delimiters to the sequence so that the resulted set of words has the maximum score. Hint: Model the positions of the sequence as vertices. A sub-string forms a weighted edge if it is a word with a nonzero score. Consider a shortest path from the first vertex to the last one.
(More to come)

References

1. Bellman–Ford algorithm on Wikipedia.
2. Sedgewick's web exercises for Algorithms, 4th ed.

Thursday, August 20, 2015

Proving theorems with Leon

Here are some high level guides to proving theorems with Leon:
  1. To prove $A\implies B$, you usually need to finds some statements $A_1,...,A_n$ and prove $A\implies A_1$, $A_1\implies A_2$, ..., $A_n\implies B$. These intermediate results are called Lemmas. $A\implies A_1 \implies ... A_n \implies B$ is called an implication chain.
  2. You will find yourself get stuck at some step in the implication chain very often, say at $A_k\implies A_{k+1}$. To get through, you can try to find another route either from $A$ to $B$ or from $A_k$ to $A_{k+1}$. For example, if $A_k$ looks too weak for Leon to infer $A_{k+1}$, you can try to strengthen the hypothesis. At a high level, it is possible to find a sequence of statements $H_m, ..., H_k$, such that beginning from some $m$th step you can prove $A_m\implies A_{m+1}\wedge H_m$ and then $A_i\wedge H_{i-1} \implies A_{i+1}\wedge H_i$ for $i=m,...,k$ in turn, eventually reaching the desired result $A_k\wedge H_{k-1} \implies A_{k+1}$.
    (In Leon, a lemma is just a boolean function. Hence, to strength $A_i \implies A_{i+1}$ to $A_i\wedge H_{i-1} \implies A_{i+1}\wedge H_i$, you have to add $H_{i-1}$ to the pre-condition and add $H_i$ to the post-condition of the boolean function that asserts $A_i \implies A_{i+1}$.)
  3. In general, it is a good idea to made as strong assumption as possible for the proposition to verify, and set as weak requirements as possible for auxiliary lemmas. If the proposition assumes properties of a function (e.g. always returning nonzero values), putting the assumptions in the post-condition of that function can improve the odds Leon successfully verifying your proposition.
  4. I observe that Leon performs far better at proving a property as a post-condition than as a proposition. In many cases, a property that takes complicated proof hints and dozens of auxiliary lemmas to be verified as a proposition can be verified directly as a post-condition. Unfortunately, currently I are not very sure what makes this difference.
  5. Isolating a statement in a proof as a standalone lemma enables Leon to verify that statement via a separate induction process (by adding @induct to the new lemma), and to use a difference base case than the one, if any, used in the main proof.
  6. Use @library and @ignore annotations to help Leon concentrate on the part of interest in your code. Also, don't forget to use the (currently undocumented) --watch command-line option to turn on continuous verification.
  7. Avoid using higher-order functions. For example, try to define list operation from scratch without using helper functions such as map, filter, and foreach. The resulting code, while more tedious, is far easier to be verified by Leon as a whole.
  8. If you have provided logically sufficient lemmas to prove a proposition, and you find that Leon is able prove all the lemmas but the proposition, then check that if @induct annotation is used. If it is, check that if you set a pre-condition for the proposition (by using require) to prove. If you do, then it is possible that the pre-condition is not satisfied in the base case, and thus Leon cannot finish the induction.
  9. There seems to be some kind of non-determinism with Leon. In most cases, a property is either verifiable in milliseconds, or cannot be verified within even an hour. Hence, for efficiency, one usually set timeout as a small amount say 3 seconds. However, I notice that the time Leon takes to verify a lemma may vary drastically under some conditions yet unclear to me. For one of my lemmas, the time Leon takes to verify it ranges from 2 to 20 seconds in a sequence of 20 executions. Be careful not to waste time revising a proof that (suddenly) needs more time to be verified.
  10. Be careful about cyclic arguments. Leon cannot detect cyclic arguments, and is likely to regard them as valid proofs. The simplest example for this logic fallacy may be as follows. If you provide two lemmas A and B, and claim that lemma A holds because of lemma B, and lemma B holds because of lemma A. Then Leon will report that both A and B holds.
  11. More to come.

Thursday, July 16, 2015

Correctness specification for Pregel programs

In this post, we show how to prove correctness of programs written in the Pregel abstraction by specifying loop invariants and variants for the supersteps. This post is only a sketch and will be elaborated incrementally when I have time.

In the following examples, we use $G=(V,E)$ to denote the base graph. Without loss of generality, we maintain a global counter to record the number of supersteps taken so far. Let $C_v(n)$ denote the attribute of vertex $v$ in the $n$-th Pregel iteration. A state of a Pregel program is a tuple $(k, G, \{C_v(k): v \in V\})$, where $k$ denotes the value of the global counter. We shall use $c(s)$ to denote the counter value in state $s$. An invariant for a Pregel program is a sequence of assertions $\{f_n\}_{n\ge0}$ over program states, such that each $f_k$ holds after the $k$-th superstep (i.e., $c(s)=k$ implies $f_k(s)$ is true). Moreover, $f_{k+1}$ should hold given that $f_{k}$ holds and that there are active vertices after the $k$-th superstep. A variant $V$ for a Pregel program is mapping from program states to a poset $(P,\succ)$, such that i) $P$ doesn't contain a infinite descending chain, and ii) $s_1\leadsto s_2 \wedge c(s_2)>c(s_1)$ implies $V(s_1)\succ V(s_2)$. Here we simply set this poset to $(\mathbb N\cup\{0\}, >)$.

Our first example is the connected components algorithm from the Spark library. This algorithm, after it terminates, assigns each vertex the smallest vertex id among the vertices in the same connected components. The vertex attribute $C_v(n)$ records the the lowest vertex id that vertex $v$ has found up to the $n$-th superstep. Let $C_v(0)=id(v)$  for all $v\in V$. It can be shown that
Invariant: $f_n$ $\equiv$ "$C_v(n) = \min\{C_u(n-1) : u\in N(v)\}$"
Variant: $V \equiv \sum_{v\in V} C_v(n)$.

The second example is the shortest landmark distances algorithm from the Spark library. Given a set of landmarks $L \subset V$, the algorithm computes for each vertex $v$ a mapping $C_v(n): L \rightarrow Int$, such that $C_v(n)(u)$ records the shortest distances from $v$ to $u$ that vertex $v$ has found up to the $n$-th superstep. For all $v,u\in V$, let $C_v(0)(u) = 0$ if $v=u \in L$, and $C_v(0)(u) = \infty$ otherwise. It turns out that
Invariant: $f_n$ $\equiv$ "For each $u \in L$, $C_v(n)(u)$ is the distance of the shortest paths from $v$ to $u$ with at most $n$ edges."
Variant: $V \equiv \sum_{v\in V}\sum_{u\in L} C_v(n)(u)$.

Sometimes, loop invariants and variants cannot be found as intuitively even for simple Pregel programs. Consider the following vertex coloring algorithm that colors the vertices using at most $\Delta+1$ colors, where $\Delta$ denotes the maximum degree of the base graph. Also, let the set of available colors be cyclically ordered with cardinality $\Delta+1$.
  Initialize all vertices to the same color.   while there exists $(v, u) \in E$ such that $v$ and $u$ share the same color do       Let $w$ denote the vertex from ${v, u}$ with a larger vertex id.       Set the color of $w$ to the minimal color larger than its current color.   done
Obviously, there exists a $(\Delta+1)$-coloring for the graph. Besides, the algorithm terminates only if it found a proper coloring. Now the question is: Is it possible that this algorithm doesn't terminate, say, given some order in which the vertices are colored? A variant cannot be found as easily as in the previous examples, because the relation $\succ$ is anti-reflexive but the number of colliding vertices doesn't decrease strictly as the algorithm iterates. However, after some reflection, we can still specify a variant as follows:

Variant: $V \equiv $ the maximal number of vertices that collide with their neighbors.

More precisely, after the $n$-th superstep finishes, there exists at least $n$ vertices in the graph, such that the coloring of these $n$ vertices is a subset of a proper $(\Delta+1)$-coloring of the graph. Observe that existence of $V$ here happens to assure both the termination and the correctness of this vertex coloring algorithm.

Monday, June 15, 2015

Writing decentralized algorithms with Pregel: A case study

In this post, we demonstrate how to implement a purely functional decentralized algorithm using the Pregel abstraction offered by the Spark GraphX library.

A vertex coloring of a graph is a way to color the vertices such that no two adjacent vertices share the same color. Formally, given an undirected graph $G=(V,E)$, a k-coloring of $G$ is a map $f: V \rightarrow \{1,...,k\}$ such that $f(v) \neq f(u)$ for any $(v,u) \in E$. A graph is called k-colorable if it has a $k$-coloring.

It takes linear time to find a $(\Delta+1)$-coloring for any graph with maximum degree $\Delta$. On the other hand, finding a 4-coloring for a 3-colorable graph is already NP-hard. There are many randomized algorithms in the literature that color a graph in exponential time and linear space. Among them, the Communication-Free Learning (CFL) algorithm [1] is a fully decentralized algorithm inspired by the problem of allocating non-interfering channels in IEEE 802.11 wireless networks.

The CFL algorithm works as follows. For each vertex $v \in V$, let $C_n(v) \in \{1,...,k\}$ denote the color of $v$ in the $n$-th iteration. Also, let $P_n(v,c)$ denote the probability that vertex $v$ has color $c$ in the $n$-th iteration. That is, $\Pr\{C_n(v) = c\} = P_n(v,c)$. At $n = 0$, the color of each vertex $v$ is chosen uniformly at random, i.e., $P_0(v, c) = 1/k$ for all $v \in V$ and $c = 1,...,k$. For $n \ge 1$, the color of $v$ is chosen randomly according to the following rule: If no adjacent vertex to $v$ shares the same color with $v$, then we let $P_n(v, c) = 1$ for $c = C_{n-1}(v)$ and let $P_n(v, c) = 0$ otherwise. Note that this implies $C_n(v) = C_{n-1}(v)$. If there is a collision of color between $v$ and its neighbors, then we choose $C_n(v)$ according to probability distribution $P_n(v, \cdot)$, such that for $c = 1,...,k$,
$$ P_{n}(v,c)=\begin{cases} (1-\beta)\cdot P_{n-1}(v,c), & c=C_{n-1}(v)\\ (1-\beta)\cdot P_{n-1}(v,c)+\beta\cdot(k-1)^{-1}, & \mbox{otherwise,} \end{cases} $$ where $\beta \in (0,1)$ is a pre-determined parameter. Observe that the vertex colors stabilize if and only if they form a $k$-coloring.
The CFL algorithm can be executed via the following decentralized procedure:
  1. Set up $C_0(v)$ and $P_0(v,\cdot)$ for each vertex $v$ and let all vertices be active.
  2. A) An active vertex $v$ is deactivated if no vertices in $N(v)$ share the same color with $v$. An inactive vertex $v$ is activated if there is a vertex in $N(v)$ sharing the same color with $v$.
    B) Each vertex $v$ computes $C_n(v)$ and $P_n(v,\cdot)$ upon the $n$-th change of the vertex state (from active to inactive or vice versa).
  3. Repeat Step 2 until all vertices are simultaneously inactive.
Thus we implement the algorithm by carrying out the above procedure with Pregel. A vertex $v$ is represented as a 4-tuple
type Palette = (Color, List[Double], Boolean, Random)
composed of the vertex color $C_n(v)$, the color distribution $P_n(v,\cdot)$, the vertex state (active/inactive), and a random number generator in turn. Given a graph RDD graph, we first construct its base graph and initialize the attributes of the vertices:
val seed = Random.nextInt
val baseGraph = graph.mapVertices((id, _) => {
  val rng = new Random(seed + id)
  (rnd.nextInt(k + 1), initColorDist, true, rng)
})
Note that we choose to maintain an RNG at each vertex. One may feel tempted to declare a global RNG and share it among the vertices. However, since Spark serializes the closure as it dispatches jobs, doing so will lead each vertex to use a copy of the global RNG instead of sharing it. Hence, all vertices will obtain the same sequence of random numbers upon their queries, which is definitely an undesired situation.
We use functions sendMsg and mergeMsg to carry out Part A of the second step in the procedure:
def sendMsg (edge: EdgeTriplet[Palette, _]): Iterator[(VertexId, Boolean)] = {
  if (edge.srcAttr._1 == edge.dstAttr._1)
    return Iterator((edge.srcId, true))
  if (edge.srcAttr._3)
    return Iterator((edge.srcId, false))
  Iterator.empty
}
def mergeMsg (msg1: Boolean, msg2: Boolean) = msg1 || msg2
For each edge $(v, u)$, the sendMsg function can activate or deactivate the source vertex $v$ by sending it a Boolean message. The messages are aggregated by mergeMsg via disjunction, such that a vertex is activated if any of its neighbors wants to activate it, and is deactivated if all of its neighbors want to deactivate it.
Part B of the second step is carried out by the vprog function:
def vprog (id: VertexId, attr: Palette, active: Boolean): Palette = {
  val color = attr._1
  val dist  = attr._2
  val rng   = attr._4
  val newDist = dist.foldLeft((1, List[Double]())) {
    case ((i, list), weight) => (i + 1,
      if (active)
        list :+ (weight * (1 - beta) + (if (color == i) 0.0 else beta / (maxNumColors - 1)))
      else
        list :+ (if (color == i) 1.0 else 0.0))
  }._2
  val newColor = if (active) sampleColor(newDist, rng.nextDouble) else color
  (newColor, newDist, active, rng)
}
The vprog function uses a helper function colorSampler to color a vertex randomly according to its current color distribution:
def sampleColor (dist: List[Double], rnd: Double): Int = {
  dist.foldLeft((1, 0.0)) {
    case ((color, mass), weight) => {
      val m = mass + weight
      (if (m < rnd) color + 1 else color, m)
    }}._1
}
Finally, we invoke Pregel with the above functions to compute a $k$-coloring for the base graph baseG:
Pregel(baseGraph, true)(vprog, sendMsg, mergeMsg).mapVertices((_, attr) => attr._1)
It is shown in [2] that the CFL algorithm colors a graph in exponential time with high probability. More precisely, given any $\epsilon \in (0,1)$, the algorithm can find a $k$-coloring for a $k$-colorable graph with probability $1-\epsilon$ in $O(N\exp^{cN\lg 1/\epsilon})$ iterations, where $N$ is the number of vertices and $c$ is a constant depending on $\beta$.

References

1. D. J. Leith, P. Clifford, "Convergence of distributed learning algorithms for optimal wireless channel allocation." IEEE CDC, 2006.
2. Duffy, Ken R., et al. "Complexity analysis of a decentralised graph colouring algorithm." Information Processing Letters, 107.2, 2008.

Thursday, May 28, 2015

A note on formal verification of programs

Formal verification for softwares involves proving properties of programs under various assumptions. In order to automate the verifying process, we have to formalize the program to verify. This usually amounts to transforming the program from source code to model while preserving the properties of interests. In this post, we shall present an example formalization of programs. We shall also demonstrate how program properties can be verified using the presented formalization.

Let's first consider a simple imperative programming language that can encode assignments, arithmetic operations, conditional branches, skips, sequences and non-recursive function calls. Clearly, any program written in this language has a finite number of possible execution paths, as well as a finite number of steps in each execution path. This fact allows us to transform a program to a logical formula, such that each model of the formula corresponds to a valid execution path of the program and vice versa.
More precisely, suppose $F(x_1,...,x_n)$ is a program that takes $n$ inputs and gives one output. Then $F$ can be represented as a formula (called a path formula) in form of $$ \phi_F(x_1,...,x_n,y) = \bigvee\nolimits_{i \in I}\left( p_i \wedge y = e_i \right),$$where $I$ is a finite index set, $p_i$ is a predicate and $e_i$ is an expression. A path formula $\phi$ is called valid if the following conditions are satisfied: i) $\bigvee_{i \in I}\ p_i$ is a tautology, which means all possible paths are covered by the formula, ii) $\neg\bigvee_{i \neq j}(p_i \wedge p_j)$ is a tautology, which means no two paths will be taken simultaneously, and iii) an assignment to $\{x_1,..., x_n, y\}$ satisfies $\phi_F$ if and only if $f(x_1,..., x_n)=y$, which means solutions of the formula precisely characterizes the input-output relation of program $F$.
One can employ techniques of symbolic execution to convert programs of various imperative languages into path formulae, see e.g., [1,2]. Such conversions usually allow one to reduce the problems of verifying program properties to those of checking satisfiability for first-order constraints. For example, to verify that function $F(x_1,x_2)$ is commutative, i.e., $\forall x_1,x_2.\ F(x_1,x_2) = F(x_2,x_1)$, it suffices to check that constraint $\phi_F(x_1,x_2,y) \wedge \phi_F(x_2,x_1,y') \wedge y \neq y'$ is unsatisfiable.
Modern SMT solvers effectively decide formulas over combinations of theories such as uninterpreted functions, boolean algebra, integer arithmetic, bit-vectors, arrays, sets and maps [3]. Program verification thus can benefit from advances in SMT solvers by modelling programs as first-order constraints augmented with theories, see e.g., [4-7].

Handling recursive functions

Uninterpreted functions are often leveraged to facilitate automatic reasoning about recursive functions. The idea is to abstract each recursive call to an uninterpreted function call. If a constraint is unsatisfiable assuming uninterpreted functions, then it is also unsatisfiable assuming the correct interpretation. On the other hand, if a constraint is satisfiable assuming uninterpreted functions, then no reliable conclu-sion could be drawn for the correct interpretation, as the assumptions can be wrong. In [4], the authors proposes a technique to refine assumptions iteratively when the constraint is satisfiable. This technique is the cornerstone of the Leon verification system [5].
As an illustration of the technique, consider function
function size (list) {
  if (list == null) { return 0 } else { return 1 + size(list.next) }
}
with path formula $ \phi_{size} \equiv (b_1 \wedge y = 0) \vee (\neg b_1 \wedge y = 1 + \mathrm{size(list.next)}),$ and define $$ \phi = (\mathrm{size(list)} = y) \wedge (b_1 \Leftrightarrow \mathrm{list} = \mathrm{null}) \wedge \phi_{size}.$$Note that a boolean variable $b_1$, called a control literal, is introduced to represent the truth value of the branching condition in $\mathrm{size}$. The purpose of introducing control literals to a path formula is to guard invocations of uninterpreted functions. It turns out that we can switch between under- and over-approximation of a path formula by changing assignments of its control literals [4], as explained below.
Suppose we want to verify property $\psi$, say $\psi \equiv \mathrm{size(list)} \ge 0$. Define $\psi_1 \equiv \neg\psi \wedge \phi$. Given a correct interpretation of $\mathrm{size}$, $\psi_1$ is satisfiable iff the $\mathrm{size}$ function satisfies property $\psi$. In particular, if $\psi_1$ is unsat, then so is $\psi$ for any interpretation of $\mathrm{size}$. However, if a counterexample is reported assuming uninterpreted functions, then this may be because $\mathrm{size(list.next)}$ is assigned an incorrect value. To exclude such case, we check the satisfiability of $\psi_1 \wedge b_1$. If the later formula is sat, then so is $\psi_1$ and we are done. Otherwise, we have to explore the branch restricted by $b_1$. For this purpose, we unfold the definition of $\mathrm{size(list.next)}$ one more time in $\phi$, and then rewrite formula $\psi_1$ with appropriate substitutions, e.g., replacing $\mathrm{list}$ with $\mathrm{list.next}$, using fresh variables for $b_1$ and $y$, etc. The new formula is denoted as $\psi_2$, and we proceed to check the satisfiabilty of $\psi_2$ and $\psi_2 \wedge b_2$, where $b_2$ is the control literal introduced by the unfolding.
We repeat these steps and produce a sequence of alternating approximations. Thus the verification procedure for property $\psi$ would look like
    $n := 1$ while true do if $\psi_n$ is unsat then output UNSAT and break if $\psi_n\wedge b_n$ is sat then output SAT and break $n := n + 1$ done
Observe that the procedure iteratively computes a succession of under- and over-approximations of the recursive calls in the provided constraint. Also, it finds a counterexample for the constraint whenever the constraint is satisfiable. Intuitively, this happens because a counterexample corresponds to an execution that violates the property, and the procedure enumerates all possible executions in increasing lengths. On the other hand, since our demo language becomes Turing complete after it is equipped with recursions, we cannot expect the procedure to always terminate when the provided constraint is unsatisfiable. For an important class of recursive functions, however, the approach outlined here acts as a decision procedure and terminates in all cases [8]. For instance, the $\mathrm{size}$ function listed above falls into this class.

(To be completed)

References

1. Liu, Chang, et al. "Automating distributed partial aggregation." SoCC, 2014.
2. Srivastava, et al. "Path-based inductive synthesis for program inversion." PLDI, 2010.
3. De Moura, Leonardo, et al. "Satisfiability modulo theories: introduction and applications." Comm. of ACM, 2011.
4. P. Suter, A. S. Koksal, et al. "Satisfiability modulo recursive programs." SAS, 2011.
5. Blanc, Regis, et al. "An overview of the Leon verification system." SCALA, 2013.
6. Bjørner, et al. "Program Verification as Satisfiability Modulo Theories." SMT@ IJCAR, 2012.
7. Bjørner, et al. "Higher-order Program Verification as Satisfiability Modulo Theories with Algebraic Data-types." Higher-Order Program Analysis, 2013.
8. Suter, P., et al. "Decision procedures for algebraic data types with abstractions." POPL, 2010.

Monday, April 6, 2015

Tips for writing Spark programs

Writing serializable closures

When you use a closure, Spark will serialize it and send it around the cluster. This means that any captured variables must be serializable. But sometimes it can't be. Consider this example:
class Foo {
  val factor = 3.14159
  val log = new Log(...) // not serializable
  def multiply(rdd: RDD[Int]) = {
    rdd.map(x => x * factor).reduce(...)
  }
}
The JVM must pass the closure to map to capture variable factor. However, since the closure contains an unserializable variable log, a NotSerializableException will be thrown at runtime. A work around here is to assign factor to a local variable:
class Foo {
  val factor = 3.14159
  val log = new Log(...) // not serializable
  def multiply(rdd: RDD[Int]) = {
    // Only factor2 will be serialized.
    val factor2 = factor 
    rdd.map(x => x * factor2).reduce(...)
  }
}
Serialization is a general issue for distributed programs written on the JVM. A future version of Scala may introduce a "serialization-safe" mechanism for defining closures for this purpose.

Preventing the driver program from OOM

1. Operations on RDDs includes transformations and actions. When the dataset is distributed over the workers, the outputs of transformations are still distributed. In contrast, actions will transfer the outputs to driver program, which can lead to OOM for large outputs. A rule of thumb here is to avoid using actions that may return unbounded outputs, such as collect, countByKey, collectAsMap, etc. If you need to probe a large RDD, try to use actions such as sample, count and take to obtain bounded results. Also, you can write outputs directly from workers to the HDFS using saveAsTextFile and then load them later.
2. Sometimes, you get an OOM Error not because your RDDs don't fit in memory, but because the working set of one of your tasks was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the number of tasks, so that each task’s input set is smaller. Spark reuses one executor across many tasks with low launching costs, so you can safely increase the number of tasks to more than the number of cores in your clusters.

Avoiding imbalant join operations

1. Suppose that you want to inner join two RDDs in Spark, one is very small with $n$ partitions and the other is very large with $m\gg n$ partitions. This operation will shuffle the data in the large RDD into $n$ partitions, carried out by at most $n$ executors regardless of the number of work nodes. A better way to do inner join in this case is to first broadcasts the small RDD to the nodes on which the large RDD is distributed, and then performs the join operation at each executor. In this way, you not only avoid the need of shuffling, but also maintain the parallelism of the larger RDD (with $m$ output partitions instead of $n$). The benefits are obtained at the cost of replicating and broadcasting the smaller RDD around the cluster.
2. If you want to left join a very small RDD with a very large RDD, it may be helpful to first filter the the large RDD for only those entries that sharing keys with the small RDD. This filtering step reduces the data needed to be shuffled over the network and thus may speed up the overall execution.

Note that this kind of balancing tricks are not needed for reduce-like operations such as groupByKey and reduceByKey, where the largest parent RDD’s number of partitions will be used.

References and resources

1. The official guidelines for turning Spark
2. How-to: Tune Your Apache Spark Jobs at Cloudera
3. CME 323: Distributed Algorithms and Optimization at Stanford Univ.

Sunday, March 22, 2015

Designing algorithms with Spark aggregate – Part II

In the previous post, I proposed a framework for Spark aggregate and listed several examples derived from the framework. One may observe that the $\mathrm{comb}$ functions are associative and commutative in all but the last example. One may also observe that the results of aggregate calls are deterministic in all but the last example. Now a question rises: is it necessary to use an associative and commutative $\mathrm{comb}$ function in an aggregate call, so that the result can be deterministic? The answer is (somewhat surprisingly) no. In this post, I shall give a simple guideline to find deterministic aggregate examples with non-commutative $\mathrm{comb}$ functions.

Given domain $A$ and range $B$, the idea is to find $B' \subset B$, $\mathrm{seq}: B \rightarrow A \rightarrow B'$ and $\mathrm{comb}: B \rightarrow B \rightarrow B'$, such that $\mathrm{aggregate(zero,\ seq,\ comb,}\ as)$ is deterministic w.r.t. the provided $\mathrm{zero} \in B$ and all $as \in A^+$. Moreover, $\mathrm{comb}$ is non-commutative over $B$, but the restriction of $\mathrm{comb}$ to $B'$ is associative and commutative.

The above idea can be instantiated using an abelian subgroup $(B', ⊗)$ of a non-abelian group $(B, ⊗)$. For example, we can let $B'$ be a cyclic subgroup of a symmetric group $B$. Here ⊗ is function composition and $A$ is the set of integers:

$\quad \mathrm{zero}$: an arbitrary element from $B$.
$\quad \mathrm{seq}\ (b, n) = b^n$.
$\quad \mathrm{comb}\ (b_1, b_2) = b_1 ⊗ b_2$.

Observer that the range of $\mathrm{comb}$ is confined to $B'$ due to the closure property of group operators. For another example, we can let $B$ be the group comprised of all non-singular 2-by-2 square matrices, and $B'$ be the subgroup formed by the rotation matrices in $B$. Here we set $A = B$ and let ⊗ be matrix multiplication:

$\quad \mathrm{zero}$: the identity matrix.
$\quad \mathrm{seq}\ (b, a) = b ⊗ [a]$, where $[x] = x \cdot |\det(x)|^{-1}$, the normalized matrix of $x$.
$\quad \mathrm{comb}\ (b_1, b_2) = b_1 ⊗ b_2$.

We can also construct examples using an abelian base group. We just have to "twist" the $\mathrm{comb}$ function such that it is commutative only over $B'$. For example, let $B$ be the set of n-by-n square matrices, and B' be the subset of $B'$ formed by all matrices whose last row is zero. Moreover, set $A = B$ and use matrix addition $+$ as a commutative group operator. Then we have the following example:

$\quad \mathrm{zero}$: the zero matrix.
$\quad \mathrm{seq}\ (b, a) = b + \{a\}$, where $\{x\}$ is obtained from setting the last row of ${x}$ to zero.
$\quad \mathrm{comb}\ (b_1, b_2) = b_1 + \langle b_2\rangle$, where $\langle x\rangle$ is obtained from doubling the last row of $x$.

Observe that $\mathrm{seq}$ and $\mathrm{comb}$ behave just like ordinary matrix additions over $B'$. They however become non-commutative operators when they are applied to elements outside $B'$. It turns out that we can use a textbook of elementary abstract algebra as a free source of interesting aggregate examples, at least in theory.

Saturday, February 21, 2015

An example of online algorithm

Online algorithms. An online algorithm are designed to receive and serve an unbounded finite sequence of requests $\sigma=\sigma(1),\sigma(2),...,\sigma(m)$ with a bounded memory. The requests are served in the order of occurrence. When serving request $\sigma(t)$, the algorithm doesn't have knowledge of requests $\sigma(t')$ for $t'>t$. The total number of requests is unknown until the last request is served. Serving requests incurs cost. The goal of an algorithm designer is to minimize the total cost paid on serving the entire request sequence.
Competitive analysis. The performance of an online algorithm is usually evaluated by comparing it with an optimal offline algorithm in the worst case. Let $ALG$ be an online algorithm, and $OPT$ be an algorithm that knows all the requests in advance, so that it can serve them in the minimal cost. Given a request sequence $\sigma$, let $ALG(\sigma)$ and $OPT(\sigma)$ denote the costs incurred by $ALG$ and $OPT$, respectively. Algorithm $ALG$ is called c-competitive if there exists a constant $b$ such that for all sequences $\sigma$, $$ALG(\sigma) \le c · OPT(\sigma) + b.$$The competitive ratio of algorithm $ALG$ is defined as $c(ALG)=\sup_{\sigma}ALG(\sigma)/OPT(\sigma).$ The best competitive ratio an algorithm can achieve is thus $c^*=\inf_{ALG}c(ALG)$.

Example. The ski rental problem is one of the classic online optimization problems. The scenario is as follows. You are having a vacation at a ski resort without skis. You can either rent or buy skis from the local shops. You are determined to ski on every sunny day. Suppose the cost of renting skis is one dollar per day, and the cost of buying skis is $s>1$ dollars. Denote the number of sunny days in your vacation as $d$. Clearly, if $d$ is known beforehand (eg. from a magical local weather forecast service), then the optimal (offline) strategy is to rent skis all the way when $d<s$, and buy skis at the first day when $d\ge s$. Now, let's assume you are in the opposite situation, where $d$ is unknown and can't be estimated. You need an online algorithm to help you make decision to rent or buy skis.
Suppose a deterministic online algorithm tells you to rents skis up to day $k$ ($k\le d$) and buy skis on day $k+1$ when $k<d$. Then the competitive ratio of this algorithm is computed by\begin{align*}\label{eq:eg1} \max\left\{\min\limits_{k\ge d}\max\limits_{k\ge d, s\ge d}\left\{\frac{d}{d}\right\}, \min\limits_{k\ge d}\max\limits_{k\ge d, s<d}\left\{\frac{d}{s}\right\}, \min\limits_{k<d}\max\limits_{k< d, s\ge d}\left\{\frac{k+s}{d}\right\}, \min\limits_{k<d}\max\limits_{k < d, s<d}\left\{\frac{k+s}{s}\right\}\right\}. \end{align*}The four terms in the braces correspond to the online/optimal cost ratios in the four cases. In each case, the ratio is first maximized by the adversary (ie. the weather) by choosing $d$ assuming the values of $k$ and $s$. The online algorithm then minimizes the maximal ratio (ie. the worst case) by selecting $k$ based on the choice of $d$. In this way, we obtain a tight upperbound on the cost ratio for each case. Since competitive analysis is a worst-case analysis, the largest bound among the four upperbounds is chosen as the competitive ratio:$$\max\left\{1, \frac{s+1}{s}, \min_{k<s}\left\{\frac{k+s}{k+1}\right\}\right\} = \max\left\{1, \frac{s+1}{s}, \frac{2s-1}{s}\right\} = 2 - \frac{1}{s}.$$This ratio is met by renting for $s-1$ days and buying at the $s$th day when $s\le d.$ No deterministic algorithms can achieve a competitive ratio smaller than $2-\frac{1}{s}$.

Randomized online algorithms. In many problems, online algorithms can achieve a better performance if they are allowed to make random choices. A randomized online algorithm $ALG$ is called $c$-competitive if there are constants $c,b$ such that $E[ALG(\sigma)] \le c ·OPT(\sigma)+b$ for all request sequences $\sigma$, where expectation is taken over the random choices made by $ALG$. The best competitive ratio for randomized online algorithm is thus $c^*=\inf_{ALG}\sup_{\sigma}E[ALG(\sigma)]/OPT(\sigma).$ For simplicity, here we only consider requests generated by an oblivious adversary, that is, an optimal offline algorithm that generates the entire request sequence before any requests are served and any random choices are made. One can also consider competitive ratios against more powerful adversaries; see [3] for more details.

Example. In the ski rental problem, a randomized algorithm chooses some day $k$ at random according to some distribution, rents for up to $k$ days and buys skis at day $k+1$. Assuming the distribution of $k$, the adversary would choose $d$ to maximize the expected online/optimal cost ratio in each case. The actual distribution of $k$ is then determined to minimize the maximal expected ratio over all cases. It turns out that the optimal competitive ratio of randomized online algorithms is roughly $\frac{e}{e-1}\approx 1.58$ [1,2]. (TODO: give the derivation of $\frac{e}{e-1}$.)

Lowerbound analysis. A randomized algorithm may be interpreted as a random choice among deterministic algorithms. In this sense, the problem of finding the best competitive ratio may be viewed as a 2-person zero-sum game, where i) the players are the Algorithm Designer and the Adversary, ii) Adversary’s strategies are all request sequences, iii) Designer's strategies are all deterministic online algorithms, and iv) the payoff for Designer is $ALG(\sigma)/OPT(\sigma)$. In addition, the players are allowed to use randomized strategies. Suppose that Designer chooses algorithms $ALG$ according to probability distribution $P$, and Adversary chooses sequences $\sigma$ according to probability distribution $Q$. Then it follows from Yao's principle that the best competitive ratio $c^*$ for a randomized online algorithm satisfies $$c^*\ge \inf_{ALG}E_Q[ALG(\sigma)/OPT(\sigma)]$$That is, lowerbounds for $c^*$ can be obtained by computing the minimal expected online/optimal cost ratio of deterministic online algorithms against random request sequences. (In fact, using a more general result called von Neumann's Minimax Theorem, one can show that $c^*=\sup_{Q}\inf_{ALG}E_{Q}[ALG(\sigma)/OPT(\sigma)].$)

References

1. https://en.wikipedia.org/wiki/Ski_rental_problem
2. http://cs.brown.edu/~claire/Talks/skirental.pdf
3. http://www14.in.tum.de/personen/albers/papers/inter.pdf

Thursday, February 12, 2015

Designing algorithms with Spark aggregate – Part I

In this post, I shall give a simple framework to design algorithms in terms of Spark's aggregate function. This framework is just a proof of concept and should not be applied blindly in practice. However, it does induce efficient algorithms in some special cases, as will be shown in the examples.

In general, our goal is to compute a property of any given instance from a domain. Let $G$ denote the domain of all interested instances, and $C$ the domain of the desired property. For example, if $G$ denotes graphs and the property to compute is MST, then $C$ will be the set of trees. Furthermore, we assume that there exists an associative and commutative "composition" operator $\oplus : G \rightarrow G \rightarrow G$, such that an instance of interests can be decomposed into (and composed back from) sub-instances in $G$. We shall denote the desired property of instance $g$ as $f(g)$. For example, if $g$ is a graph, then $f(g)$ can be some graph property such as diameter, MST and vertex cover. Note that the the mapping $f : G \rightarrow C$ can be non-deterministic if the property is not unique w.r.t. the given instance.

To use the aggregate function, we specify zero, seg and comb as follows:
$\mathrm{zero} \in (G, C)$ is a pair comprised of the zero values from $G$ and $C$. If there are no such zero values, we can set $\mathrm{zero} = (g, f(g))$ for an arbitrary sub-instance of $g$.
$\mathrm{seq} : (G, C) \rightarrow G \rightarrow (G, C)$ takes a partially aggregated result $(g_1, f(g_1))$ and a sub-instance $g_2$. It outputs $(g_3, f(g_1 \oplus g_2))$ such that $f(g_3 \oplus g) = f(g_1 \oplus g_2 \oplus g)$ for all $g \in G$.
$\mathrm{comb} : (G, C) \rightarrow (G, C) \rightarrow (G, C)$ takes two partially aggregated results $(g_1, f(g_1))$ and $(g_2, f(g_2))$. It outputs $(g_3, f(g_1 \oplus g_2))$ such that $f(g_3 \oplus g) = f(g_1 \oplus g_2 \oplus g)$ for all $g \in G$.
The rest of the steps are standard: we first decompose the input instance into a sequence of sub-instances. The sequence is then converted to an RDD through parallelize. Finally, we invoke RDD.aggregate(zero)(seq,comb) to compute the desired property of the instance.

Remarks

1. While functions seq and comb look very similar in spec, they can use different methods to compute $f(g_3)$. Due to the partial results provided, seq is suitable to use incremental methods and comb can use (the merge part of) divide-and-conquer methods to compute the desired property. See the convex hull example below.
2. The assumption that operator $\oplus$ is associative and commutative over $G$ can be relaxed to be so only with respect to the given property $f$. For instance, suppose that $G$ is a set of finite strings and $\oplus$ is the string concatenation operator. While $\oplus$ is not commutative per se, it is commutative w.r.t. the string length property. That is, $\mathrm{strlen}(g_1 \oplus g_2) = \mathrm{strlen}(g_2 \oplus g_1)$ for all $g_1, g_2 \in G$. The framework is thus applicable to compute the length of string. See the majority algorithm below for a more non-trivial example.
3. The framework is practical only if $|g_3|$ can be significantly less than $|g_1 \oplus g_2|$, for otherwise we will have to pass almost the whole instance from node to node. In the case that $f(g_1 \oplus g_2)$ can be computed from $f(g_1)$ and $f(g_2)$, there is no need to output $g_3$, i.e., $|g_3|=0$. This special case allows us to operate on simpler domains (e.g., the type of seq becomes $C \rightarrow G \rightarrow C$) and to obtain more efficient aggregate algorithms due to less communication overheads.

Examples

The above naive framework can yield efficient algorithms if $f(g_1 \oplus g_2)$ can be computed solely based on $f(g_1)$ and $f(g_2)$. Below we give three examples that satisfy this condition. The first is a variant of the reverse-delete algorithm that finds an MST of a graph. The second is an outline of a convex hull algorithm. The last is Boyer and Moore's algorithm that finds the majority element in a list.

Finding MST. Let $G$ be the set of all weighted undirected graphs, and $C$ be the set of trees. Given a graph $g$ from $G$, we assign a unique index to each of its edges. Let $w(e)$ and $idx(e)$ denote the weight and index of edge $e$, respectively. Given two edges $e_1$ and $e_2$ of $g$, we say $e_1 < e_2$ iff either $w(e_1) < w(e_2)$, or $w(e_1) = w(e_2)$ and $idx(e_1)$ < $idx(e_2)$. Now we define an operator msf over $G$, such that msf computes a minimum spanning forest of $g_1 \cup g_2$ from $g_1$ and $g_2$ as follows:
def msf (g1, g2) = {
  var g = g1 ∪ g2
  while (g contains a cycle c)
    remove the largest edge from c
  return g
}
Let RDD(g) denote the RDD obtained from decomposing and parallelizing graph $g$. Then RDD(g).aggregate(Nil)(msf,msf) computes a minimum spanning forest of $g$, which would be a MST if $g$ is connected. Note that the result is deterministic if the assignment of indices is deterministic.

Finding convex hull. Here $G$ is the collection of finite sets in Euclidean plane and $C$ is the set of convex polygons. The convex hull of a set of points is the smallest convex polygon containing all the points. If the points are given as a sequence, we can use aggregate to find their convex hull by computing smaller convex hulls incrementally in seq and then merging them in comb to obtain the final convex hull. Both of seq and comb can be implemented to take linear time (see the links for pseudo codes).

Finding majority. Let $C$ be a set of comparable elements. Fix some element $\bot \in C$ and let $G = (C - \{\bot\})^*$ be a set of finite sequences. The majority problem is to determine in a given sequence from $G$ whether there is an element with more occurrences than all the others, and if so, to determine this element. Click here to see a linear-time algorithm that solves the problem using aggregate. Note that the aggregate invocation only finds a candidate for majority; we need a further check to assert that the candidate is indeed the majority element.

Friday, January 23, 2015

Deterministic aggregate semantics in Scala and Spark

The aggregate function in Scala has method signature as follows:
def aggregate[A](z: => A)(seqOp: (A, B) => A, combOp: (A, A) => A): A
The semantics of aggregate depend on the collection upon which it is invoked. When aggregate is invoked on a sequential traversable collection such as a List, it fallbacks to foldLeft and ignores the provided compOp. In this case, aggregate behaves just like the ordinary fold in typical functional programming languages such as Lisp and ML. When aggregate is invoked on a parallel traversable collection, however, it performs a two-stage aggregation over the data elements in parallel. In Scala, a parallel collection can be splitted into partitions and distributed over several cores or nodes. Invoking aggregate on a parallel collection reduces a value for the collection by first folding each partition separately using seqOp, and then merging the results one by one using combOp.

Note that the Scala API document doesn't specify the way in which the collection is partitioned, as well as the order in which the intermediate results are merged. Hence, if a programmer hopes to obtain the same output whenever aggregate is invoked on the same parallel collection, he has to make sure himself that seqOp is associative and combOp is both associative and commutative.

Aggregate v.s fold

The fold function in Scala has method signature as follows:
def fold[A1 >: A](z: A1)(combOp: (A1, A1) => A1): A1
One can observe two differences between the signatures of fold and aggregate: 1) fold only needs one operator, while aggregate needs two; 2) fold requires the result type to be a supertype of the element type, while aggregate doesn't. Note that both methods fallback to foldLeft when they are invoked on a sequential collection. One may notice that Scala's aggregate is in effect the ordinary FP fold
fold :  (A → B → A) → A → ([B] → A)
and Scala's fold is just a special form of its aggregate. The idea behind this design is to allow for parallel FP fold. In Scala, fold and aggregate are both done in two stages, namely folding and combining. For fold, the same operator is used for folding and combining, and this requires the supertype relationship between the element and the result type. Without the supertype relationship, it will take two operators to fold and combine, and this is why Scala provides aggregate that takes two operators as arguments.

Aggregate in Spark

Spark implements its own aggregate operations for RDD. One can observe from the source code of Spark that the intermediate results of aggregate are always iterated and merged in a fixed order (see the Java, Python and Scala implementations for details). Hence, we don't have to use commutative combOp to obtain deterministic results from aggregate, given that at least one of the following conditions holds:
(i) there doesn't exist an empty partition in any partitioning of the collection, or
(ii) z is an identity of combOp, i.e., for all a of type A, combOp(z,a) = combOp(a,z) = a.
Note that this observation is not assured by the official Spark API document and is thus not guaranteed to hold in the future versions of Spark. However, it is still useful if we want to model the behaviours of a Spark program formally.

Below, we shall try to formalize Spark's implementation of aggregate. While Scala is an impure functional language, we only consider pure functions here for simplicity. Let $R(L,m)$ denote an RDD object obtained from partitioning a list $L \in \mathbb{D}^*$ into $m \ge 1$ sub-lists. Suppose that we have fixed an element $zero \in \mathbb{E}$, as well as operators $seq:{\mathbb{E} \times \mathbb{D} \to \mathbb{E}}$ and $comb:{\mathbb{E} \times \mathbb{E} \to \mathbb{E}}$. An invocation of aggregate is said to be deterministic with respect to $zero$, $seq$ and $comb$ if the output of $$R(L,\cdot).\textrm{aggregate}(zero)(seq,\ comb)$$ only depends on $L$. Define a function $\textrm{foldl}:({\mathbb{E} \times \mathbb{D} \to \mathbb{E}}) \times \mathbb{E} \times \mathbb{D}^* \to \mathbb{E}$ by \begin{eqnarray*} \textrm{foldl}(f,\ a,\ Nil) & = & a \\ \textrm{foldl}(f,\ a,\ b \!::\! bs) & = & \textrm{foldl}(f,\ f(a,b),\ bs), \end{eqnarray*} where $f\in {\mathbb{E} \times \mathbb{D} \to \mathbb{E}}$, $a \in \mathbb{E}$, $b \in \mathbb{D}$ and $bs \in \mathbb{D}^*$. Suppose that $1\le m \le |L|$ and $R(L,m)$ partitions $L$ into $\{L_1, \dots, L_m\}$ such that $L=L_1 \cdots L_m$. According to the source code of Spark, we can define the aggregate operation in Spark as \begin{eqnarray*} R(L,m).\textrm{aggregate}(zero)(seq,\ comb) = \textrm{foldl}(comb,\ zero,\ L'), \end{eqnarray*} where $L' \in \mathbb{E}^*$ is a list of length $m$ and $L'[i] = \textrm{foldl}(seq,\ zero,\ L_i)$ for $i=1,\dots,m$.

Note that the partitioning of $L$ is unspecified and may be non-deterministic. Hence, we have to use an associative $comb$ operator if we want to ensure that the result of aggregation is deterministic. Suppose that $comb$ is associative and at least one of conditions (i) and (ii) is satisfied. It turns out that the output of aggregate is deterministic with respect to $zero$, $seq$ and $comb$ if and only if for all lists $L_1$ and $L_2$, $$comb(zero,\ \textrm{foldl}(seq,\ zero,\ L_1L_2)) = comb(\textrm{foldl}(seq,\ zero,\ L_1),\ \textrm{foldl}(seq,\ zero,\ L_2)).$$ While the above condition is undecidable in general, it can be effectively checked for some classes of programs such as FSMs over finite alphabets.

Concluding remarks

Aggregate in Spark is the parallelized version of the conventional fold, which was already shown to be extremely useful in functional programming. It is therefore of practical interests to study the properties and behaviours of the aggregate function. In particular, it is interesting to establish and verify the conditions under which aggregate is deterministic with respect to the provided $zero$, $seq$ and $comb$. We may further investigate this problem in the coming posts.

Thursday, January 15, 2015

Emulating Hadoop in Spark

The map and reduce functions in MapReduce have the following general form:
Map: (K1, V1) → list(K2, V2)
Reduce: (K2, list(V2)) → list(K3, V3)
When a MapReduce program is executed, the Map function is applied in parallel to every pair in the input dataset. Each invocation of map produces a list of pairs. The execution framework then collects all pairs with the same key from all lists and groups them together, creating one group for each key. After that, the Reduce function is applied in parallel to each group, which in turn produces a collection of key-value pairs. In Hadoop's implementation of MapReduce, the values in the same group are sorted before they are sent to a reducer.

Despite the suggestive naming, map and reduce operations in Spark do not directly correspond to the functions of the same name in MapReduce. For example, Map and Reduce functions in MapReduce can return multiple key-value pairs. This feature can be implemented by the flatMap operation in Spark (and typical FP languages in general):
List(1, 2, 3).map(a => List(a))     // List(List(1), List(2), List(3))
List(1, 2, 3).flatMap(a => List(a)) // List(1, 2, 3)
One naive attempt to emulate Hadoop MapReduce (without combiners) in Spark is to use two flatMap operations for map and reduce, separated by a groupByKey and a sortByKey to perform shuffle and sort:
val input: RDD[(K1, V1)] = ...
val mapOutput: RDD[(K2, V2)] = input.flatMap(mapFn)
val shuffled: RDD[(K2, Iterable[V2])] = mapOutput.groupByKey().sortByKey()
val output: RDD[(K3, V3)] = shuffled.flatMap(reduceFn)
Here the key type K2 needs to inherit from Scala’s Ordering type to satisfy the signature of sortByKey. The above code is useful to demonstrate the relationship between Hadoop and Spark, but it should not be applied blindly in practice. For one thing, its semantics are slightly different from Hadoop's MapReduce, since sortByKey performs a total sort over all key-value pairs while Hadoop sorts within groups. This issue can be avoided by using repartitionAndSortWithinPartitions in Spark 1.2 to perform a partial sort:
val shuffled: RDD[(K2, Iterable[V2])] = mapOutput
    .groupByKey()
    .repartitionAndSortWithinPartitions(Partitioner.defaultPartitioner(mapOutput))
The improvement is unfortunately still not as efficient, since Spark uses two shuffles (one for the groupByKey and one for the sort) while MapReduce only uses one. It turns out one cannot precisely and efficiently capture the semantics of Hadoop's MapReduce in the current version of Spark.

Rather than trying to reproduce MapReduce in Spark, it is better to use only the operations that you actually need. For example, if you don’t need keys to be sorted, you can omit the sortByKey call to avoid sorting (which is impossible in regular Hadoop MapReduce). Similarly, groupByKey is too general in most cases. If shuffling is needed only for aggregating values, you should use reduceByKey, foldByKey, or aggregateByKey, which are more efficient than groupByKey since they exploit combiners in the map task. Finally, flatMap may not always be needed either. Instead, you can use map if there is always one return value, or use filter if there is zero or one.

Consult these two articles on Cloudera for details of translating Hadoop programs to efficient Spark programs: How-to: Translate from MapReduce to Apache Spark, part 1 and part 2.

Further readings

1. Jimmy Lin, Chris Dyer. Data-Intensive Text Processing with MapReduce.
2. Tom White. Hadoop: The Definitive Guide, 4th Edition.
3. Srinivasa, K.G., Muppalla, Anil K. Guide to High Performance Distributed Computing.
Related Posts Plugin for WordPress, Blogger...