The Blog Writing Rule
Writing at least one blog entry per month. That should be feasible, no? Unfortunately, last Friday, I was already more than 10 days late, so I had to do something!
And, when pressing the "Post" button, I had this awkward feeling that something was not really right with my "solution". Not really on the "speed" side because I knew that my list accesses were sloppy, but even on the "correctness" side. But I was past the date, damn it!
Fortunately, someone's watching
Ok, enough justifications. I was in fact very pleased to get so many insightful comments, be it on my blog, or on reddit. It's a pleasure to see that so many people care for The Truth on the Internet ( ;-) )
So I tried my best to review things up and to derive the maximum number of lessons out of theses errors. I'll start with that, then I'll copy the revised version of the algorithm.
Lessons learned
- Check your definitions. The median definition is very straightforward,... if you have an odd number of elements. For an even number, you can elaborate several definitions, such as that one. For the code below, I choose to take define the median as a pair of values, which is both "middle" elements if the list is even. Pairs of values are then conveyed across all computations, so that it's always possible to eventually take the mean value if you feel like it
- Don't trust test generation. I promise my tests were green when I posted! However, the generation parameters were very bad: the number of tests was too small, the number and range of elements in the lists also. The solution here was tested on larger samples (up to 8000 tests), with larger lists (of same size or different sizes)
- Check your invariants. In this kind of "Divide and Conquer" approach, I should make sure that I have indeed the same problem, on a smaller scale, after dividing it. The code below shows that it is possible to take the same number of elements on each list, so that the median of the resulting lists is unchanged
- Know your data structures. I knew that the most basic list was not necessarily efficient but I must say that I didn't check anything. And, in fact, when Jorge pointed out that List had an O(n) access for size, I was quite surprised. The other interesting thing is that optimizing an operation can be totally feasible but quite tricky to implement (see the medianOfSortedListAndOneElement function below). And of course, takeWhile(_) was pretty bad!
The algorithm
First of all, lets start by the definition of the median as a couple of values, which may be the same if the list has an odd number of elements:
case class ExtendedSeq(list: Seq[Int]) {
def median: (Int, Int) = {
if (list.isOdd)
(list(list.size / 2), list(list.size / 2))
else
(list(list.size / 2 - 1), list(list.size / 2))
}
def isOdd = list.size % 2 != 0
}
If elements access and size are O(1) as in the RandomAccessSeq collection, that should be ok in terms of complexity (thanks Jorge).
Here is the heart of the algorithm:
def median(l1: Seq[Int], l2:Seq[Int]): (Int, Int) = {
if (l1.isEmpty) l2.median
else if (l2.isEmpty) l1.median
else if (l1.size == 1) medianOfSortedListPlusOneElement(l2, l1(0))
else if (l2.size == 1) medianOfSortedListPlusOneElement(l1, l2(0))
else {
val (m1, m2) = (l1.median, l2.median)
m1.intersection(m2) match {
case Some(m) => m
case None => if (m1 < m2) median(trimLists(l1, l2))
else median(trimLists(l2, l1))
}
}
}
The code above starts by checking some special cases: one empty list, a list + one element, then 2 "regular" median values.
The first observation is that, if the median values (which are pairs of values) have a non-empty intersection, this intersection will be the median of both sorted lists.
This is the not-so readable code for checking if there is an overlap between 2 pairs:
case class ExtendedPair(pair1: (Int, Int)) {
def intersection(pair2: (Int, Int)) = {
if (pair2._1 <= pair1._1 && pair1._2 <= pair2._2) Some(pair1)
else if (pair1._1 <= pair2._1 && pair2._2 <= pair1._2) Some(pair2)
else if (pair1._1 <= pair2._1 && pair2._1 <= pair1._2 && pair1._2 <= pair2._2)
Some((pair2._1, pair1._2))
else if (pair2._1 <= pair1._1 && pair1._1 <= pair2._2 && pair2._2 <= pair1._2)
Some((pair1._1, pair2._2))
else None
}
}
Finally, if there is no overlap, we need to remove elements from both lists so that the median of the remaining 2 lists will be the same as the median of the original lists:
def trimLists(l1: Seq[int], l2: Seq[Int]) = {
val removableElements = if (l1.size < l2.size) l1.size / 2 else l2.size / 2
(l1.drop(removableElements), l2.take(l2.size - removableElements))
}
This time, I removed the same number of elements below the lower median and above the higher one (but still, a written mathematical proof would be more satisfying than "it's obvious that,...").
On the complexity side, I still have to check that I can find a data structure (a Seq) providing drop and take operations in O(1). From what I read of the implementation of RandomAccessSeq, that should be the case. RandomAccessSeq is using indexes to provide O(1) "slicing" operations as suggested by Mikko in the comments of the previous post.
Brutal force
And finally, I can't resist showing you the horror of finding the median of a sorted list and a supplementary element in O(1). There may be a subtle trick to do that (please say, if you find it!), but I used the brutal force:
def medianOfSortedListPlusOneElement(list: Seq[Int], element: Int) = {
if (list.size == 1){
if (element >= list(0))
(list(0), element)
else
(element, list(0))
}
else if (list.isOdd) {
if (element >= list(list.size / 2)) {
if (element <= list(list.size / 2 + 1))
(list(list.size / 2), element)
else
(list(list.size / 2), list(list.size / 2 + 1))
} else {
if (element >= list(list.size / 2 - 1))
(element, list(list.size / 2))
else
(list(list.size / 2 - 1), list(list.size / 2))
}
} else {
if (element >= list(list.size / 2 - 1)) {
if (element <= list(list.size / 2))
(element, element)
else
(list(list.size / 2), list(list.size / 2))
} else {
(list(list.size / 2 - 1), list(list.size / 2 - 1))
}
}
}
Not pretty, uh? Like the final approach for the 4-colors theorem.
Measuring the complexity
On my way to this post, I wanted to implement something which would store the test data and say if the curve looks like an O(log(n)), an O(nlog(n)),... In fact it doesn't seem so easy to do without any human validation and tweaking of the test data size until the complexity really shows. I'm thinking that using an utility to graph the performances and compare it with known complexities would be a good first step.
But maybe driving the test generation so as to get a good confidence on the possible complexity of the algorithm would be even better. I'll try to have a look at that,... Maybe for my next post?
Thanks for all the comments, I hope I didn't add more and more silliness to my previous errors.
PS: I liked a lot Tom's comments about the 2-keys indexing problem. Indeed, the problem was underspecified, in terms of access frequency, memory, etc,... And a log(n) solution would certainly be ok in my specific case.
No comments:
Post a Comment