Pages

02 May 2008

Algorithmic panic

Last week I read a blog post about an interview for Google and I thought: "Oh my god, I don't know how to do that, I'm lost, I have no idea, I will never find the solution".

Flashback

Two weeks ago, I've ordered and started reading the excellent book, "The Algorithm Design Manual". That book is very good. Teaching algorithms and data structures can be something very dry when you consider it from a very scholar perspective. But it becomes really fun and challenging when you realize that:
  • it solves real worthy problems
  • no amount of processing power can be overcome a clever algorithm when there is one
  • there's often no "right" solution but a combination of different approaches with different trade-offs
So the book is filled with "war stories" showing the author investigating some problems, trying to ask relevant questions until there's a "ah-ah" moment where the problem is sufficiently characterized. Then the solution is refined to get a completely satisfying result.

Decode this!

For example, the author and his team had to provide a new way to decrypt DNA fragments, knowing that there was a new technique using small probes returning small fragments of the whole string. Say your DNA is ATGCCTCGATTG, the probes will find AT, TG, GA, TT, TG, all of which are substrings of the initial string. And from all those pieces, you have to find the original DNA string. For this kind of problem, the sheer number of elements kills instantly any brutal-force approach. Originally they had to sequence DNA fragments of 50.000 characters, knowing that more that 1.5 million pieces should be combined to get the answer. Don't try this at home!

Binary tree > HashTable > Suffix Tree

For their algorithm to work they had to set up a dictionary allowing the search of substrings of length k inside strings of length 2k. The question is: how to do that very efficiently? A student proposed a HashTable which would do the job in O(log(k)). But this was still too slow! Hey, what can be faster than a HashTable??!! A Suffix Tree.

In this specific case they were searching elements which looked very close to one another, just being different by one character. And a suffix tree happens to organize the suffixes of a set of strings so that it's easy to search for a string which is just one character away.

Understanding the structure of the problem yielded much better results than a HashTable: for a 8000 characters long DNA, the HashTable time was 2 days against 650 seconds for the SuffixTree (the "compressed" version, because the original one was blowing up the memory!). That was my "ah-ah" moment.

The dreadful interview question

So I thought I was ready for any algorithmic question. Until I read the blog: "find the median of 2 sorted lists in O(log(n))". I don't know why, reading that blog, I imagined myself over the phone, trying to solve this problem and my mind got totally blank. Panic.

The same night, going to bed, I told myself: ok, relax, have a fresh look at it. And I slept happily a few minutes later because the principle of the solution wasn't that difficult.

First of all, the median of a sorted list is the middle element. That's an O(1) search! Then, given the medians for each list, the median of both lists is something like the median of the 2 sublists of all elements between median1 and median2. That's where we get our O(log(n)) from, because with a recursive search, we're pretty sure to cut the problem size by 2 at each step.

I programmed the rest the day after. Again, I think that this is a tribute to Scala. The result is really readable and I used ScalaCheck to verify it by generating lists of same and different sizes. On the other hand, I had hard time figuring out the proper bound conditions (ScalaCheck was very helpful here):

trait Median {
def median(l1: List[Int], l2:List[Int]): Int = {
if (l1.isEmpty || l2.isEmpty || l1.last <= l2.head) (l1 ::: l2).median
else if (l2.last < l1.head) (l2 ::: l1).median
else {
val (m1, m2) = (l1.median, l2.median)
if (m2 < m1) median(l1.takeBetween(m2, m1), l2.takeBetween(m2, m1))
else if (m2 > m1) median(l1.takeBetween(m1, m2), l2.takeBetween(m1, m2))
else m1
}
}
implicit def toExtendedList(list: List[Int]) = ExtendedList(list)
case class ExtendedList(list: List[Int]) {
def takeBetween(a: Int, b: Int) = list.dropWhile(_ <= a).takeWhile(_ <= b)
def median = list(list.size / 2)
}
}


Data structures everywhere

Funnily, yesterday, one of my colleagues asked me if I knew an easy, fast and clever way to access some values classified with 2 keys:
  • there's only one value for (key1, key2)
  • she wants to get the value for (key1, key2) (fast,...)
  • she wants to get all values for key1 (fast again,...)
  • she wants to get all values for key2 (fast also,...)
This looks like a very classical problem for Java programmers but surprisingly I found no existing, open-source, solution for this (if you know that, please tell me. I looked at things like TreeMaps or MultiMaps for example but they don't fit the requirement of a fast 2 key search with either of the keys).

After some research and some thinking we concluded that 3 hashmaps sharing the values and dedicated to answer each different query fast would be the best thing to do. But now that I know that HashTables are not the ultimate solution to everything,...

13 comments:

Waquo said...

I don't see how this algorithm could find medians in O(log n).
takeWhile(_ <= b) alone takes n/2 steps and is therefore O(n).

Cutting your problem size in half in each step does not give you O(log n), unless each step is O(1).
With each step being O(n), you should get O(n*log n) (like mergesort), although I would have to analyze your algorithm more carefully and consult the Master theorem to be sure.

Johan said...

Close enough, but not quite. The key idea here is that it's safe to discard half of the elements of each list, thus cutting the total number of elements in half each step.

To get all the elements between (p1, p2), drop the n/2 elements smaller than p1 from the list containing the smallest median. Then drop the higher m/2 elements from the other list. Then repeat the process.

Unknown said...
This comment has been removed by the author.
Peter said...

Your problem is a graph problem. She should make a graph data structure and store the key as a value attached to edges. Then, (u,v) is an edge of the graph with some value, and edges(u) is all the things attached to u and the same for v.

graphs are easily implemented as two level hash tables

Jorge Ortiz said...

I don't think operations on Lists have the run-time properties you think they do.

For example, ExtendedList.median will run in O(N) time, since List.apply and List.size each run in O(N) time.

What you probably want is a data structure that implements RandomAccessSeq (like Array), which guarantees O(1) element access and O(1) length computation.

Johnny said...

I have to agree with Sam.

For example

map = {
  key1=>{
    key2 => 'value1',
     key3 => 'value2',
  },
   key2=>{
    key1 => 'value1',
  },
   key3=>{ key1 => 'value2' },
}


Then, value of (key1, key2) is
map{key1}{key2}

The value of key1 is
values map{key1}

etc.

Sorry for the perl-ish syntax.

Eric said...

Wow, I didn't think I would get that many comments so soon! My next step on this was to review and instrument the code to check that its complexity is really what I expected.

I will do and post a follow-up. Thanks all!

Unknown said...

Hello,

Strangely I was also puzzled about this problem lately and I had to find something :

http://batiste.dosimple.ch/blog/2008-04-25-1/

Tom Ritchford said...

Regarding the "two keys" problem, you haven't given us enough information as to be able to identify the best solution.

The main missing question is the access patterns; when are items inserted into your structure and how are they accessed?

Seems like there are three possible patterns:

1. the structure is created early and then accessed a very large number of times.

2. at runtime, instances of the structure are created and then accessed.

3. reads and writes are interleaved.

There are other important questions like, "How expensive is memory?" and "How large are the underlying objects being stored?" and "How many objects are there?" and "How important is execution time?"

If the objects stored in the table are small and there are a lot of them, you're going to have significant overhead with the obvious "three hash map" solution, at least 32 and probably 64 bytes overhead per object, which is harsh if the objects are for example integers.

If you're willing to drop from O(1) to O(log n) performance (I think of log n as "almost constant" myself...), store the objects in a large trie, sorted by key1 then key2, and add to each object array offsets sorting the items by key2 in a binary tree.

You can almost certainly make the array for the key2 tree to be 4 bytes, so that's an overhead of only 8 bytes per object: plus you get the advantage of allocating one massive block and filling it linearly, rather than creating a ton of tiny memory requests as you go; in fact, you could easily put such a table in ROM if you were writing for an embedded device (making it essentially "free").

Tom Ritchford said...

Minor editing error: there is no trie involved, you simply sort the items in a list by key1, then key2 (and of course have the left and right pointers to keep the elements sorted by key2).

You can find all items with key1 or key1 and key2 with a binary search. You can find all items with key2 using the inner tree. In both cases it's log n.

Tom Ritchford said...

Finally, if the "object" in the two keys problem is a boolean, you could use two Bloom filters (one for each key) and do it in constant time and space(!) (if you don't mind a very small number of false positives - and it's quite possible in some cases to prove there can never be a false positive on a Bloom filter of large enough size).

Unknown said...

Others have already pointed out complexity issues with your median algo, and I think they're right. Two more notes: 1) the takeBetween implementation as given is O(n), whether the list is a linked one or supports O(1) random access. Using bisection in takeBetween would make that part O(log n). Someone who knows better can tell what the overall time complexity would be then. O((log n)^2) or something better?

2) a naive implementation using arrays would still suck. To really get to O(log n) one must not allocate any new
arrays when discarding elements. Instead it's better to keep the input arrays intact and just play with indices.

However, I don't think the complexity matters that much as the answer is wrong anyway :)
Take e.g. list1 = [1,2,3,4,5,6,7] and list2 = [0,8,9]. The correct answer would be 4.5, or 4 if one cheats a bit. Your code gives result 7, if I read it correctly and Scala starts list indexing at zero.

The problem is that "the median of both lists is something like the median of the 2 sublists of all elements between median1 and median2" is true only if the two sublists have equal lengths. The second commenter's solution seems to be closer to right, and the one linked by "faden" is even more right: when discarding (about) half of the list items on each step, one must be careful to have exactly half of the discarded values below the median, half above. Otherwise the median of the remaining values is not the same as the median of all values.

Eric said...

Hi all,

I've posted a follow-up to that post which should be both more correct and faster.

Thanks again for your comments.

Eric.