Ericpony's Blog: March 2014

Thursday, March 13, 2014

Scala: Views of Collections

In Scala, you can create Views for many types of collections. Views are non-strict versions of collections meant to be treated much like a database view. This means that the elements are calculated at access and not eagerly as in normal collections. As an example consider the following code:

val ys = (1 to 10).view.map { x => println(x); }

This will not print anything but every access to the list will perform the calculation and print the value, i.e. every call to ys.head will result in a number being printed. If you want to get a strict version of the collection again you can call ys.force, which will print all numbers out.

One use of views is to traverse a collection of values that are expensive to compute and only one value is needed at a time. Also, views let you build lazy sequences by calling toStream on them, which will also cache the evaluated elements. Another use of views is to avoid intermediate copies of collection. For example, the following code

(1 to 10).map(_-1).map(_*2).map(-_)

creates a new list for each call of map operation. To avoid the intermediate results, one can turn the list first into a view, applying all transformations to the view, and finally forcing the view to a sequence:

(1 to 10).view.map(_-1).map(_*2).map(-_).force

The three stored functions (_-1), (_*2), (-_) get applied as part of the execution of the force operation and a new sequence is constructed. That way, no intermediate data structure is needed. Another example of the use of view would be (assuming that we have very little memory at hand):

scala> (1 to 1000000000).take(5).toList
java.lang.OutOfMemoryError: GC overhead limit exceeded

Here Scala tries to create a collection with 100000000 elements to access the first five. Using view, only the first five elements will be generated and accessed:

scala> (1 to 1000000000).view.take(5).force.toList
res2: List[Int] = List(1, 2, 3, 4, 5)

The third use case applies to views over mutable sequences. Many transformer functions on such views provide a window into the original sequence that can then be used to update selectively some elements of that sequence. The view does not copy these elements, it just provides a reference to them. For example,

def neg(a: collection.mutable.Seq[Int]) = a.indices.foreach(i => {a(i) = -a(i)})
var A = Array(1,2,3,4,5,6)
neg(A.slice(1,4))      // A doesn't change
neg(A.view.slice(1,4)) // A becomes (1,-2,-3,-4,-5,6)

Remarks. For smaller collection sizes the added overhead of forming and applying closures in views is often greater than the gain from avoiding the intermediary data structures. A possibly more important reason is that evaluation in views can be very confusing if the delayed operations have side effects. Because of the penalties incurred by views, one should usually force it after applying the transformations, or keep it as a view if only few elements are expected to ever be fetched, compared to the total size of the view.

Tuesday, March 11, 2014

Loop Invariants & Loop Variants

Loop invariant

In Hoare logic, the partial correctness of a while loop is governed by the following rule of inference:$$\frac{\{C\land I\}\;S\;\{I\}} {\{I\}\;\mathbf{while}\; C \; \mathbf{do}\;S\;\{\lnot C\land I\}}$$ The rule above is a deductive step where the premise states that the loop's body does not change $I$ from true to false given the condition $C$, and the conclusion that if the loop terminates then it leads from a state satisfying $I$ to a state satisfying $\lnot C\land I$. The formula $I$ in this rule is known as the loop invariant. In words, a loop invariant

holds before the loop begins
establishes the desired result (i.e. the postcondition) given that $\lnot C$ holds
holds before and after the execution of each iteration
describes what has been done so far and what remains to be done at each iteration

Finding a useful invariant is often a tricky task requiring sophisticated heuristics. As an illustration, consider the following program that computes the factorial: $$\{x=n \land y=1\}\quad\mathbf{while}\;x\neq 0\;\mathbf{do}\;y:=y\cdot x;\ x:=x-1\quad\{y=n!\}$$ In each iteration, $y$ holds the result so far, $x!$ is what remains to be computed, and $n!$ is the desired results. Also note that $y$ is increasing while $x$ is decreasing. Hence, we can put the decrement of $x$ and the increment of $y$ together to obtain an invariant $x!\cdot y=n!$, as desired.
Now consider a similar program that also computes $n!$: $$\{x=1 \land y=1\}\quad\mathbf{while}\; x<n\; \mathbf{do}\;x:=x+1;\ y:=y\cdot x\quad\{y=n!\}$$This time, both $x$ and $y$ increase as the loop iterates. One can immediately find an invariant $y=x!$ according to rules of thumb. This invariant, however, turns out to be too weak to prove the desired result: we need $y=n!$ after the loop terminates, while the iteration rules only gives $x\ge n \land y=x!$. To fix this, we can strengthen the invariant by conjuncting another invariant $x\le n$ with it. The new invariant $x\le n \land y=x!$ yields $x\ge n\land x\le n \land y=x!$ on termination, which is precisely what we need. (A similar problem would occur in the first example if the guard was $x\ge 1$ instead of $x\neq 0$. Again, this can be solved by strengthening the invariant with $x\ge 0$.)
We refer the reader to this note for more interesting examples of loop invariants.

Loop variant

In Hoare logic, the termination of a while loop can be ensured by the condition $[V=z]\;S\;[V<z]$, where $<$ is a well-founded relation, $V$ is called the loop variant, and $z$ is a unbound symbol that is universally quantified. Combining it with the aforementioned rule of partial correctness, we obtain a rule that expresses the total correctness of the program:$$\frac{[I \land C \land V=z ]\;S\;[I \land V < z]}{[I]\;\mathbf{while}\;C\; \mathbf{do}\;S\;[I\land\lnot C]}.$$ The existence of a variant implies that a while loop terminates. It may seem surprising that the converse is true, as long as we assume the Axiom of Choice: every while loop that terminates (given its invariant) has a variant. To prove this, we need the following theorem:
Theorem. Assuming the Axiom of Choice, a partially ordered set does not possess any infinite descending chains if and only if the corresponding strict order is well-founded.
Assume that the while loop terminates with invariant $I$ and the total correctness assertion $[C\land I]\;S\;[I]$ holds. Consider a "successor" relation $>$ on the state space $\Sigma$ induced by the execution of statement $S$, such that $s>s'$ iff (i) $s$ satisfies $C\land I$, and (ii) $s'$ is the state reached from state $s$ after executing statement $S$. Note that $s\neq s'$, for otherwise the loop would not terminate. Next, we define the "descendant" relation, denoted by $\ge$, as the reflexive and transitive closure of the successor relation. That is, $s\ge s'$ if either $s=s'$, or there exist states $s_1, ..., s_n$ such that $s>s_1>...>s_n>s'$. Note that $\ge$ is antisymmetric, since $s\ge s'$ and $s'\ge s$ implies $s=s'$ under the termination assumption. Therefore, $(\Sigma,\ge)$ is a partially ordered set. Observe that each state has only finitely many distinct descendants, and thus there is no infinite descending chain. Assuming the Axiom of Choice, it follows from above theorem that the successor relation we defined for the loop is well-founded on the state space, since it is strict (irreflexive) and contained in the $\ge$ relation. Thus the identity function on this state space is a variant for the while loop, as we have shown that the state must strictly decrease (to its successor) each time the body $S$ is executed given the invariant $I$ and the condition $C$.

A fixed-point characterization

In predicate transformers semantics, invariant and variant are built by mimicking the Kleene fixed-point theorem. Below, this construction is sketched in set theory. Let $\Sigma$ denote the state space as before. We first define a family $\{A_k\}_{k\ge1}$ of subsets of $\Sigma$ by induction over natural number $k$. Informally, $A_k$ represents the set of initial states that makes the postcondition $P$ satisfied after less than $k$ iterations of the loop: \begin{array}{rcl} A_0 & = & \emptyset \\ A_{k+1} & = & \left\{\ y \in \Sigma\ | \ ((C \Rightarrow wp(S, x \in A_k)) \wedge (\neg C \Rightarrow P))[x \leftarrow y]\ \right\} \\ \end{array}We can then define invariant $I$ as the predicate $\exists k. x \in A_k$ and variant $V$ as the identity function on the state space with a well-founded partial order $<$ such that for $y,z\in\Sigma$, $y<z$ iff $\exists i.\forall j.(y \in A_i \wedge (z \in A_j \Rightarrow i < j))$.

While the theory looks elegant, such an abstract construction cannot be handled efficiently by theorem provers in practice. Hence, loop invariants and variants are often provided by human users, or are inferred by some abstract interpretation procedure.

Sunday, March 9, 2014

Algorithm: Linked List Problems on Leetcode

Copy List with Random Pointer

Problem A linked list is given such that each node contains an additional random pointer which could point to any node in the list or null. Return a deep copy of the list.

Solution It is straightforward to use a hash table to record the new node for each existing node. The new list can then be established from the table in linear time (code). The drawback of a hash table is the space overhead. Here we offer another solution that takes $O(n)$ time and $O(1)$ auxiliary space (code). The idea is as follows:
Step 1: Create a new node for each existing node and join them together, eg: A->B->C will become A->A'->B->B'->C->C', so that n' is a copy of n except that the random pointer in n' is not set.
Step 2: Copy the random links to the new nodes: for each node n,
n'.random = n.random.next. Step 3: Detach the new nodes from the list: for each node n,
n.next = n.next.next; n'.next = n'.next.next.

Merge k Sorted Lists

Problem Merge $k$ sorted linked lists and return it as one sorted list.

Solution Denote the size of the $i$th list by $n_i$. Merging two lists with sizes $n_i$ and $n_j$ takes $\Theta(n_i+n_j)$ time. A naive approach to merge $k$ sorted lists would be merging them one by one. The time complexity would be $$\Theta(n_1+n_2) + \Theta(n_1+n_2+n_3) + ... + \Theta(n_1+...+n_k)=\Theta(kn_1+(k-1)n_2+...+n_k).$$

(To be continued)

Tuesday, March 4, 2014

Node.js: Reading a Large Text File Line by Line

We all know the famous slogan of Perl: "there is more than one way to do it", which encourages a Perl programmer to do things in a creative way using Perl's exuberant syntax. On the other hand, the advocates of Python take a minimalist approach and argue that "there should be one—and preferably only one—obvious way to do it", making a point of programming discipline and code readability. As a beginner to adopt Node.js in my daily projects, I found Node.js blends these two opposite philosophies in a funny manner: there are many ways to do it, but only one or two of them are preferable. However, telling the good one from the others is far from obvious even for veteran programmers.
For example, if you want to read and process a large text file one line at a time, in Perl you can do it like this:

while (<>) { chomp; process_line($_); }

This line of Perl code does almost everything you expect it to do, including

has no limit on file size or number of lines
has no limit on length of lines
can handle full Unicode in UTF-8
can handle *nix, Mac and Windows line endings
does not use any external libraries not included in the core language distribution

Surprisingly, it turns out that there is no simple solution in Node.js fulfilling all of the above requirements. Node.js questions concerning routine jobs of this kind recur in forums like StackOverflow, and the suggested solutions are often either pointed out to be inefficient by other reposters, or far less intuitive than one would've expected for a scripting type language.
As to reading a large text file line by line, there are many third-party modules off the shelf to do this job. However, many of the modules are either out-of-date and abandoned by their developers, or reported to have horrifying bugs such as missing the last line, leaking massive memory, etc. Even when de-facto standard modules are present, they are usually under rapid development without proved reliability.