02 May 2008

Algorithmic panic

Last week I read a blog post about an interview for Google and I thought: "Oh my god, I don't know how to do that, I'm lost, I have no idea, I will never find the solution".


Two weeks ago, I've ordered and started reading the excellent book, "The Algorithm Design Manual". That book is very good. Teaching algorithms and data structures can be something very dry when you consider it from a very scholar perspective. But it becomes really fun and challenging when you realize that:
  • it solves real worthy problems
  • no amount of processing power can be overcome a clever algorithm when there is one
  • there's often no "right" solution but a combination of different approaches with different trade-offs
So the book is filled with "war stories" showing the author investigating some problems, trying to ask relevant questions until there's a "ah-ah" moment where the problem is sufficiently characterized. Then the solution is refined to get a completely satisfying result.

Decode this!

For example, the author and his team had to provide a new way to decrypt DNA fragments, knowing that there was a new technique using small probes returning small fragments of the whole string. Say your DNA is ATGCCTCGATTG, the probes will find AT, TG, GA, TT, TG, all of which are substrings of the initial string. And from all those pieces, you have to find the original DNA string. For this kind of problem, the sheer number of elements kills instantly any brutal-force approach. Originally they had to sequence DNA fragments of 50.000 characters, knowing that more that 1.5 million pieces should be combined to get the answer. Don't try this at home!

Binary tree > HashTable > Suffix Tree

For their algorithm to work they had to set up a dictionary allowing the search of substrings of length k inside strings of length 2k. The question is: how to do that very efficiently? A student proposed a HashTable which would do the job in O(log(k)). But this was still too slow! Hey, what can be faster than a HashTable??!! A Suffix Tree.

In this specific case they were searching elements which looked very close to one another, just being different by one character. And a suffix tree happens to organize the suffixes of a set of strings so that it's easy to search for a string which is just one character away.

Understanding the structure of the problem yielded much better results than a HashTable: for a 8000 characters long DNA, the HashTable time was 2 days against 650 seconds for the SuffixTree (the "compressed" version, because the original one was blowing up the memory!). That was my "ah-ah" moment.

The dreadful interview question

So I thought I was ready for any algorithmic question. Until I read the blog: "find the median of 2 sorted lists in O(log(n))". I don't know why, reading that blog, I imagined myself over the phone, trying to solve this problem and my mind got totally blank. Panic.

The same night, going to bed, I told myself: ok, relax, have a fresh look at it. And I slept happily a few minutes later because the principle of the solution wasn't that difficult.

First of all, the median of a sorted list is the middle element. That's an O(1) search! Then, given the medians for each list, the median of both lists is something like the median of the 2 sublists of all elements between median1 and median2. That's where we get our O(log(n)) from, because with a recursive search, we're pretty sure to cut the problem size by 2 at each step.

I programmed the rest the day after. Again, I think that this is a tribute to Scala. The result is really readable and I used ScalaCheck to verify it by generating lists of same and different sizes. On the other hand, I had hard time figuring out the proper bound conditions (ScalaCheck was very helpful here):

trait Median {
def median(l1: List[Int], l2:List[Int]): Int = {
if (l1.isEmpty || l2.isEmpty || l1.last <= l2.head) (l1 ::: l2).median
else if (l2.last < l1.head) (l2 ::: l1).median
else {
val (m1, m2) = (l1.median, l2.median)
if (m2 < m1) median(l1.takeBetween(m2, m1), l2.takeBetween(m2, m1))
else if (m2 > m1) median(l1.takeBetween(m1, m2), l2.takeBetween(m1, m2))
else m1
implicit def toExtendedList(list: List[Int]) = ExtendedList(list)
case class ExtendedList(list: List[Int]) {
def takeBetween(a: Int, b: Int) = list.dropWhile(_ <= a).takeWhile(_ <= b)
def median = list(list.size / 2)

Data structures everywhere

Funnily, yesterday, one of my colleagues asked me if I knew an easy, fast and clever way to access some values classified with 2 keys:
  • there's only one value for (key1, key2)
  • she wants to get the value for (key1, key2) (fast,...)
  • she wants to get all values for key1 (fast again,...)
  • she wants to get all values for key2 (fast also,...)
This looks like a very classical problem for Java programmers but surprisingly I found no existing, open-source, solution for this (if you know that, please tell me. I looked at things like TreeMaps or MultiMaps for example but they don't fit the requirement of a fast 2 key search with either of the keys).

After some research and some thinking we concluded that 3 hashmaps sharing the values and dedicated to answer each different query fast would be the best thing to do. But now that I know that HashTables are not the ultimate solution to everything,...