Ericpony's Blog: Scala: Mapping a set removes duplicates

Friday, May 23, 2014

Scala: Mapping a set removes duplicates

Consider the following code:

def sumSizes(collection: Iterable[List[_]]): Int = collection.map(_.size).sum

The function shall count the accumulated size of all the contained Lists. The problem is that if collection is a Set, then after mapping the objects to their lengths, the result is still a Set and all Lists of the same size are reduced to a single representative. For example, sumSizes(List(List(1,2), List(3,4))) outputs 4 but sumSizes(Set(List(1,2), List(3,4))) mistakenly outputs 2. To avoid such pitfalls one may write the method as

/* version 1 */
def sumSizes(collection: Iterable[List[_]]): Int = collection.toSeq.map(_.size).sum
/* version 2 */
def sumSizes(collection: Iterable[List[_]]): Int = collection.foldLeft(0)(_+_.size)

The first version requires an unnatural discipline of "always using toSeq before mapping" in programming, and worse, it introduces an overhead of allocating memory and copying elements which is seldomly necessary. The second version doesn't causes overheads, but it is asymmetric and thus less elegant. It is even potentially less efficient in a context where the underlying collection is parallel, since sum, which is implemented as fold(0)(_+_) in the Scala library, can be optimized better than foldLeft or foldRight in such a context.

Accidental removal of duplicates is often harder to spot in real code, especially when you write utility code which simply takes a Traversable and/or use the output of methods which are declared to return a Traversable, which may bring surprising results when the implementation happens to be a Set. Imagine the following situation:

val edgeNum = graph.nodes.map(_.edges).map(_.size).sum / 2

You fetch a collection of Node objects from a graph, use them to get their adjacent edges and sum over them. This works if graph.nodes returns a Seq, and it ruins if someone decides to make Graph return its nodes as a Set. A way to defend against surprises is to explicitly label types. For example, annotating methods with return types, even if it's not required. The annotation serves as a caution for the programmer if he wants to change the type of graph.nodes from Seq to Set.

All in all, the best practice to prevent these types of errors is good unit test coverage with representative data, together with sensible API's coupled with knowledge of how the scala compiler maintains source types through map/for generators etc. If you are returning a Set of something you should make that obvious, as returning a Collection/Traversable is hiding a relevant implementation detail and that sometimes causes unexpected problems.

Ericpony's Blog

Friday, May 23, 2014

Scala: Mapping a set removes duplicates

No comments:

Post a Comment