Comments on A++ [Eric Torreborre's Blog]: Algorithmic panic

Hi all, I've posted a follow-up to that post which...

2008-05-09T11:02:00.000+09:00

Hi all,

I've posted a follow-up to that post which should be both more correct and faster.

Thanks again for your comments.

Eric.

Others have already pointed out complexity issues ...

2008-05-07T05:59:00.000+09:00

Others have already pointed out complexity issues with your median algo, and I think they're right. Two more notes: 1) the takeBetween implementation as given is O(n), whether the list is a linked one or supports O(1) random access. Using bisection in takeBetween would make that part O(log n). Someone who knows better can tell what the overall time complexity would be then. O((log n)^2) or something better?

2) a naive implementation using arrays would still suck. To really get to O(log n) one must not allocate any new
arrays when discarding elements. Instead it's better to keep the input arrays intact and just play with indices.

However, I don't think the complexity matters that much as the answer is wrong anyway :)
Take e.g. list1 = [1,2,3,4,5,6,7] and list2 = [0,8,9]. The correct answer would be 4.5, or 4 if one cheats a bit. Your code gives result 7, if I read it correctly and Scala starts list indexing at zero.

The problem is that "the median of both lists is something like the median of the 2 sublists of all elements between median1 and median2" is true only if the two sublists have equal lengths. The second commenter's solution seems to be closer to right, and the one linked by "faden" is even more right: when discarding (about) half of the list items on each step, one must be careful to have exactly half of the discarded values below the median, half above. Otherwise the median of the remaining values is not the same as the median of all values.

Finally, if the "object" in the two keys problem i...

2008-05-04T00:35:00.000+09:00

Finally, if the "object" in the two keys problem is a boolean, you could use two Bloom filters (one for each key) and do it in constant time and space(!) (if you don't mind a very small number of false positives - and it's quite possible in some cases to prove there can never be a false positive on a Bloom filter of large enough size).

Minor editing error: there is no trie involved, y...

2008-05-04T00:27:00.000+09:00

Minor editing error: there is no trie involved, you simply sort the items in a list by key1, then key2 (and of course have the left and right pointers to keep the elements sorted by key2).

You can find all items with key1 or key1 and key2 with a binary search. You can find all items with key2 using the inner tree. In both cases it's log n.

Regarding the "two keys" problem, you haven't give...

2008-05-04T00:18:00.000+09:00

Regarding the "two keys" problem, you haven't given us enough information as to be able to identify the best solution.

The main missing question is the access patterns; when are items inserted into your structure and how are they accessed?

Seems like there are three possible patterns:

1. the structure is created early and then accessed a very large number of times.

2. at runtime, instances of the structure are created and then accessed.

3. reads and writes are interleaved.

There are other important questions like, "How expensive is memory?" and "How large are the underlying objects being stored?" and "How many objects are there?" and "How important is execution time?"

If the objects stored in the table are small and there are a lot of them, you're going to have significant overhead with the obvious "three hash map" solution, at least 32 and probably 64 bytes overhead per object, which is harsh if the objects are for example integers.

If you're willing to drop from O(1) to O(log n) performance (I think of log n as "almost constant" myself...), store the objects in a large trie, sorted by key1 then key2, and add to each object array offsets sorting the items by key2 in a binary tree.

You can almost certainly make the array for the key2 tree to be 4 bytes, so that's an overhead of only 8 bytes per object: plus you get the advantage of allocating one massive block and filling it linearly, rather than creating a ton of tiny memory requests as you go; in fact, you could easily put such a table in ROM if you were writing for an embedded device (making it essentially "free").

Hello,Strangely I was also puzzled about this prob...

2008-05-03T18:25:00.000+09:00

Hello,

Strangely I was also puzzled about this problem lately and I had to find something :

http://batiste.dosimple.ch/blog/2008-04-25-1/

Wow, I didn't think I would get that many comments...

2008-05-03T10:03:00.000+09:00

Wow, I didn't think I would get that many comments so soon! My next step on this was to review and instrument the code to check that its complexity is really what I expected.

I will do and post a follow-up. Thanks all!

I have to agree with Sam.For examplemap = { &...

2008-05-03T09:00:00.000+09:00

I have to agree with Sam.

For example

map = {
  key1=>{
    key2 => 'value1',
     key3 => 'value2',
  },
   key2=>{
    key1 => 'value1',
  },
   key3=>{ key1 => 'value2' },
}

Then, value of (key1, key2) is
map{key1}{key2}

The value of key1 is
values map{key1}

etc.

Sorry for the perl-ish syntax.

I don't think operations on Lists have the run-tim...

2008-05-03T08:42:00.000+09:00

I don't think operations on Lists have the run-time properties you think they do.

For example, ExtendedList.median will run in O(N) time, since List.apply and List.size each run in O(N) time.

What you probably want is a data structure that implements RandomAccessSeq (like Array), which guarantees O(1) element access and O(1) length computation.

Your problem is a graph problem. She should make ...

2008-05-03T07:04:00.000+09:00

Your problem is a graph problem. She should make a graph data structure and store the key as a value attached to edges. Then, (u,v) is an edge of the graph with some value, and edges(u) is all the things attached to u and the same for v.

graphs are easily implemented as two level hash tables

2008-05-03T06:37:00.000+09:00

This comment has been removed by the author.

Close enough, but not quite. The key idea here is ...

2008-05-03T06:15:00.000+09:00

Close enough, but not quite. The key idea here is that it's safe to discard half of the elements of each list, thus cutting the total number of elements in half each step.

To get all the elements between (p1, p2), drop the n/2 elements smaller than p1 from the list containing the smallest median. Then drop the higher m/2 elements from the other list. Then repeat the process.

I don't see how this algorithm could find medians ...

2008-05-03T03:23:00.000+09:00

I don't see how this algorithm could find medians in O(log n).
takeWhile(_ <= b) alone takes n/2 steps and is therefore O(n).

Cutting your problem size in half in each step does not give you O(log n), unless each step is O(1).
With each step being O(n), you should get O(n*log n) (like mergesort), although I would have to analyze your algorithm more carefully and consult the Master theorem to be sure.