Pages

15 October 2008

Commando programming

When you program for a trader you don't have time to write tests
I've been wondering for some time if the quote above, or a variant like: "you don't have time to refactor your code", was really true or not. I generally bug my colleagues when they don't write tests along with their code and I usually prophetize that they will actually spare time by writing tests instead of losing time.

Unfortunately I don't have strong evidence if this is true or not. The only thing I can say is that almost every time I've tried to take shortcuts in my developments, I've been bitten by very silly bugs and regretted my recklessness!

Anyway, what would you do if you had to code at the speed of light for a drug-addict trader (and they really need drugs those days,...)? What kind of practices would you adopt to go faster? How would you prepare yourself for those situations?

With those questions in mind, I was happy to be recently involved in a one week project where the biggest part of my job was to do some "Commando programming" to create the necessary tools to support the project.

In this post, I'll do my own retrospective of that project and try to highlight the points which I think are decisive in the context of "Commando programming". Actually most of the "Ding!" points (does that ring a bell to you?) below are applicable to any programming situation. The only difference is that there's no time for preparation in a "Commando" situation. You have to be seriously fit for the job.

This post is too long, that's the Steve Yegge syndrom,... So here's a summary of the take-away points, so you can get a first impression if it's worth reading or not:
Ding! Know your destination
Ding! Be a fast code-reader
Ding! Know at least one scripting language really well
Ding! Know your platform ecosystem
Ding! Don't rewrite anything
Ding! Take careful risks and prototype
Ding! Have at least one high-level functional test
Ding! Test the difficult stuff
Ding! Isolate the file system
Ding! Use the console
Ding! Have lots of sample data
Ding! Time is money, concepts are time
Ding! PPP: Practice, Practice, Practice!!!
Ding! Cool down and document/review your code
Ding! Cool down and take a break


The project: go see your doctor, now!

In my company we've setup a methodology and a set of tools to assess the performance of our customers deployments and one of them asked us to come over for one week to deploy this so-called "HealthCheck" process. After the first initial interviews, we determined that one very essential objective of the project was to reduce the time to save a trade from 1.2 seconds to less than 500 ms. Very clear and quantifiable objective, that's a good start!

Ding! Know your destination

First point here, this may look totally obvious but still it needs to be said: the overall user objective must be very clear for everyone.

This point is important for at least 4 reasons:
  1. It's hard to say that you're going fast if you don't know where you're going!
  2. The fastest development in the world is the one you don't do because you don't need it
  3. It helps a lot, as seen later, to have a clear overall objective when negotiating with yourself or with your customer the next micro-task to develop
  4. Developing fast is not the alpha and omega of productivity. Seeing the project from its overall objective can make you realize that saving other people's time may actually more important than saving your own
Reading the thermometer

Assessing the performance of our systems involves parsing and analyzing lots of different log files: client requests, server requests, database requests, sql execution times, workflow times, garbage collection times,... The scripts we started working with were shell scripts doing the following:
  • Reading and managing log files, including archiving old versions
  • Parsing the files and extracting the time spent for a given "quantity" (client request, SQL call,...)
  • Computing small statistics: minimum / average / maximum time
  • Creating graphs in order to be able to spot "spikes" that we will be able to label as "Red flags"
The scripts we had were doing the job, but were pretty slow considering the amount of data we had to process. More than one hour could be spent just running the scripts on a subset of the log files.

Analyzing the scripts, I thought that I could implement something more effective,(...)

Ding! Be a fast code-reader

Commando programming may involve reading a lot of code that's not yours before being able to do anything else. This has been written before: we usually spend far more time reading code than writing it. Reading code faster will necessarily speed you up.

(...) more effective so I started prototyping something using Ruby. Why Ruby? Because I knew I would be fast to implement my ideas with it. Or was I?

Ding! Know at least one scripting language really well

No, I wasn't so fast,... I haven't programmed in Ruby for quite some time and I've forgotten many of the idioms or methods of the core api. My head is so full of Scala nowadays, that I would certainly have been faster using Scala for that task. The morality here is that knowing a "scripting" language is very helpful but you really need to have everything in your mental RAM in order to be effective.

This goes for api knowledge as well as "environmental" knowledge: you should spend no time figuring out how to develop/build/test/deploy your code. Being stuck for 2 hours because you don't know how to set a class path for example is a huge waste of value.

Anyway, I was able to show the effectiveness of using a language like Ruby to speed up the analysis time so I decided to officially port the scripts from shell to another language. But now which language should I select for my commando developments?

Choosing a language is a vast question and the criteria governing that choice may not be entirely related to the intrinsic language features. This choice may also be governed by more "contextual" considerations, like the ability of other developers to learn the language and associated tools.

In my case, Groovy was the winner for 4 reasons:
  1. It has all sort of features to help write code faster like closures, literals,...
  2. It can access Java libraries and I was planning to integrate some Database analysis tools we had written in Java
  3. That's one of the closest language to Java on the JVM so other developers will be able to pick up the new scripts faster. And it is usually better known than JRuby, Scala or Jython for example
Ding! Know your platform ecosystem

I also knew, when choosing Groovy, that I would be able to use all sorts of niceties. One of them was the Ant integration which allowed me to easily reuse the exec task (see below). Knowing my way around Ant was a good thing (I know that's not exceptional for a Java developer, it's just an example ;-) ).

Rewrite everything?

That was the most "dangerous" part of this project.

I would honestly have preferred avoiding it. Since I had to replace one small part of the scripts with some new Groovy logic and also, since the scripts would need to be extended anyway, I decided to rewrite them all in Groovy.

Ding! Don't rewrite anything

Haha, I just said I did it! Well, not that you can never rewrite code or some library part, but you have to be damn sure about what you're doing. So,...

Ding! Take careful risks and prototype

Doing "Commando programming" will undoubtedly require taking risks and trying out some solutions before going on. More than ever, it needs to be done with lots of care because the last thing you want, is to find yourself in the middle of the road at the end of the week, because you didn't have time to finish implementing your new "Grand Vision", whatever it is.

In my case I mitigated the risks a lot by "wrapping" the existing shell commands in my Groovy scripts. So I was essentially running the same thing but executed from Groovy! That was really worth it because in the process, I could also reduce a lot the existing amount of code thanks to the usual Groovy tricks.

Test or not to test

Now we come to the meat of the discussion! TDD, BDD, unit tests, code-and-see, what should be done? I honestly don't have a definitive answer but here's what I did.

Ding! Have at least one high-level functional test

You need to have at least one high-level functional test guiding you in your developments. Something which you can come back to and say: "this case still doesn't pass, I'm not finished yet". You may actually have implemented more code than just what was necessary to pass that test case but at least this primary use case should be ok.

Ding! Test the difficult stuff

Complex logic is definitely something you want to isolate and test early. If you don't unit test it thoroughly, it will usually haunt you in the most painful way: hidden inside all the rest. Besides, this complex logic usually means "value" for your customer, more than anything else, so it's worth cherishing it.

Ding! Isolate the file system

Actually, you should soon realize that any interaction with the file system is slowing you down.
Being able to mock the file system, or to write to temporary files with automatic clean up is a huge time-saver for unit testing. In a "Commando" situation you would be prepared with this and have all sorts of functions and ready to use design to help with this. In my case, I had to write all lot of it,...

Ding! Use the console

The Read-Eval-Print loop. In my developments I found very useful to test interactively my regular expressions inside the Groovy console. Whether or not I turned those expressions to actual unit tests,... I didn't do it all the time I must say. One reason is that, once I had confidence that this regular expression was encapsulated in a higher-level concept, like "extractClassName", it would be right forever.

Ding! Have lots of sample data

If possible try to gather as much sample data as you can and run it through your program. Does it blow-up right away? Well, that's actually good, it may be the sign that you forgot a requirement. Instead of spending time trying to refine your requirements until they're exhaustive, or try to think about all possible cases, run the system with the maximum data available. There are 2 drawbacks to this: one is the time spent re-executing large datasets and analyzing failures, the other is not knowing what to look at! Your program may not break with enough smoke to prove you wrong.

On my project I had lots of log files I could run my scripts against. And fortunately my scripts broke in very unambiguous ways, showing where I was wrong.

To design or not to design

So you're programming like hell, you don't have time to draw all those nifty UML diagrams or devise the best design patterns for the job. "We'll design when we'll have time."

But is it really buying you time? I don't think so. I'm not saying that you should write diagrams or add tons of interface and clever indirections.

But you need design in the sense that the problem and solution domains should be very, very clear to you.

Ding! Time is money, concepts are time

So it's worth taking a bit of time and think: "What's really this system like?". Clarifying the concepts is very, very important. For us, it was first of all giving proper names to our analysis concepts:
  • An entity can have measurable properties ("A trade server has requests")
  • For each property, you can define several quantities ("each request can take some average time")
  • For a given "Test environment", we can run several "Test campaigns"
This helped a lot in our discussions, in structuring the scripts and planning our actions.

The next step was to realize that the type of queries we wanted to execute to analyze logging data looked pretty much like SQL queries: "between 15:00 and 15:03. return all update queries with a time > 300ms and group them by workflow rules". This gave us a good hint that the next step for our scripts would be to put everything in a database!

Coding

Consider this:

"I know where I'm going, I have a clear mind about my system and it's concepts, I know how to write minimal and effective tests, is there anything more I could do now to speed up my coding?"

Let's reformulate the sentence above:

"I know I want a gold medal at the Olympics, I know I'll get it by being a table tennis champion and I know the rules of the game, I have a very good racket, is there anything more I could do now to be the best at table tennis?"

Ding! PPP: Practice, Practice, Practice!!!

That's so obvious I'm ashamed to have to write it,... but that makes a real difference for "Commando programming":
  • technique: practice your regular-expression-fu for example, or know how to write a parser for a simple language (I planned to do that for the verbose gc logs but we didn't have time nor interest to rewrite this part)
  • libraries: no time should be spent looking at the API docs
  • tools: if you need remote debugging, it should be fast to setup, because you did it thousands of times before
  • computer science: you should have at least a good sense of the complexity of your algorithms
  • integration: how to call Excel from Groovy to create graphics was one of our coding issues
  • system / network tools: that wasn't too necessary on our project, but I can imagine this making a real difference in other settings
Cooling down for a minute

Two things of equal importance I've also noticed during that week. First of all, when I tried to go fast I couldn't help but noticing entropy. Yes my code was doing the job, but at the same time it wasn't always consistent, documented, properly named or refactored.

Ding! Cool down and document/review your code

This really can look like a pure loss of time. But I really observed that it wasn't. For 2 reasons:
  • Actually it doesn't take that long to review/document/refactor the code (try it at home)! Except maybe for the naming part, it's a bit like playing sudoku, you just try to make things fall into place
  • I observed that I was much less dragged down implementing new feature on clear-cut code with obvious intentions, I had much less mental noise. You know, like: "This computeTotal function is actually also sorting the results, so I know I don't have to do it there". If computing and sorting are better separated or named, that can reduce the "mental tax" while reasoning about something else.
The second thing is the equivalent of "cooling down the code",... applied to your body!

Ding! Cool down and take a break

We're not machines. I was so enthusiastic with that project that I almost coded day and night (and I was away from my family,...). But I also observed one very interesting phenomenon: time was slowing down! Sometimes I was just starring at my screen without realizing that 5 minutes had passed without doing nothing substantial. That should be a clear signal for taking a break. This gives me an interesting idea: "stopwatch" programming. Prepare the list of tasks you want to program and program features by 5, 7 or 10 minutes cycles. And watch out when you have almost empty cycles.

The doctor says: fit for service

Conclusion for the project: it was very successful (this doesn't happen all the time so I'm glad to report it). We were able to analyze and spot 3 major bottlenecks and give good recommendations so that our maximum saving time was eventually around 400 ms. The tools served us, but actually ours brains served us more.

But what's the biggest room in the world? The room for improvement ;-) ! I think we'll be able to transfer more of that wisdom to smarter analysis scripts in the future.

Back to the original question
When you program for a trader you don't have time to write tests
True? False? At the light of my experience of this project, the first thing I'm really convinced of is the first part of the sentence: "You don't have time"!

And if you don't have time, you can essentially do 2 things:
  • Observe and experiment: whatever makes you spare time is good. In my case, writing tests is a proven way to go faster for example
  • Practice, learn, invest: new tools, new languages, new libraries, contests, coding dojos,...
Ok. Thanks for listening, I leave you there, I have to go,... and practice my "Commando-fu"!

27 July 2008

My first ICFP contest, Part 2: parser combinators are so cool

This is the sequel to my last post on participating to the ICFP contest 2008.

The Setup

As Curt wrote it, we entered the contest doing things that should have been done well before:
  • setting up the wifi access so that I can connect with my laptop (it never worked so Bryan had to get out and buy new network cables)
  • exchanging keys to access the Subversion repository
  • learning how to use the build and test system
I guess that we need to make some mistakes (at least) once oneself before getting it!

However, the biggest lesson for me in this contest, was the use of Parser Combinators.

Do you speak Martian?

The second programming task in the contest, after setting up the network connection to be able to receive messages and send back commands, was to parse the input messages giving the position of the Martian rover, and its immediate environment.

As the task specifies, there are 5 kinds of messages:
  • initialization messages -> what's the map like? what are you technical constants?
  • telemetry stream -> where are you, what's your speed, what's around?
  • adverse event -> crashed in a crater!
  • success -> yes, home!
  • end of run -> time and status
The format of those messages is not particularly complex but the telemetry message can contain an unlimited number of obstacles descriptions for example.

For many developers, the most obvious way to transforms meaningless strings to meaningful objects like Telemetry or Martian, would be to start traversing the string, check for characters like 'T' for Telemetry, or 'I' for Initialization then pass the rest of the string to another function and continue slicing the string until all data is analyzed.

Parser combinators

But Haskell developers know better,... They can just describe the grammar of the messages they're parsing by combining parsers, then applying the resulting parser to the string input.

Something like that:

telemetry :: Parser Message
telemetry = do ts <- int
state <- accelStateParser
turning <- turningStateParser
spaces
x <- double
y <- double
dir <- double
speed <- double
obstacles <- obstaclesParser
messageEnd
return $ Telemetry ts state turning x y dir speed obstacles

accelStateParser :: Parser AccelState
accelStateParser = (accelerating <|> braking <|> rolling)

accelerating = stateParser 'a' Accelerating
braking = stateParser 'b' Braking
rolling = stateParser '-' Rolling

stateParser :: Char -> a -> Parser a
stateParser c cons = do _ <- char c
return cons

turningStateParser :: Parser TurningState
turningStateParser = hardleft <|> softleft
<|> hardright <|> softright
<|> straight

hardleft = stateParser 'L' HardLeft
softleft = stateParser 'l' SoftLeft
hardright = stateParser 'R' HardRight
softright = stateParser 'r' SoftRight
straight = stateParser '-' Straight
obstaclesParser :: Parser [Object]
obstaclesParser = do l <- many obstacleParser
return l

obstacleParser :: Parser Object
obstacleParser = do code <- oneOf "bchm"
spaces
case code of
'b' -> object Boulder
'c' -> object Crater
'h' -> object Home
'm' -> martian
_ -> error "Can't get here in parser!"

object :: (Circle -> a) -> Parser a
object cons = do x <- double
y <- double
r <- double
spaces
return $ cons (Circle (x:.y) r)

martian :: Parser Object
martian = do x <- double
y <- double
dir <- double
speed <- double
spaces
return $ Martian (Circle (x:.y) 0.4) dir speed

In the above code, you can read that the parser for telemetry messages is a sequence of different parsers: a timestamp parser, a acceleration state parser, a turning state parser, ..., a parser for lists of obstacles. And a parser for list of obstacles is just "many" parsers for one obstacle. This is indeed as easy as being able to describe a grammar for that mini-language.

"k" should better be 1

But I didn't say that describing a grammar was easy! For instance, at first, we made a mistake in the way we managed spaces. We had almost each parser trying to consume spaces before doing its job. This generated unnecessary ambiguity in the parsing.

For example, let's say that you want to be able to parse this kind of message:

" O T O O T"

(honestly, I don't know why you would like to do something like that but you're the boss here)

You'd better combine those 3 parsers:
  • spaces -> will consume any number of spaces

  • oneParser = char 'O' >>= spaces -> will consume 'O' followed by any number of spaces

  • twoParser = char 'T' >>= spaces -> will consume 'T' followed by any number of spaces

If you define parsers differently, having "oneParser" and "twoParser" consuming spaces before they parse their character, which parser should be used on the first space of the message?

Actually, parser combinators in Haskell are also capable of backtracking and "unparsing" results if the chosen parser was not appropriate (trying things with the the "try" parser combinator). But why add more complexity when things can be described more directly?

Of course, there is a formal theory behind the example above: LL(k) grammars, where LL(1) is a type of grammar only needing to look at the next character to know which parser to use. But there's nothing like a simple "real-life" example to be able to grasp it.

Compare and contrast

Now, honestly, most other contest solutions were not using Parser combinators and the aren't soooo ugly (in Dylan for example). I think they're just less readable and more error-prone because of index manipulation.

For fun and comparison, I also developed a similar parser in Scala:

def telemetryMessage = for { timestamp <- intNumber
accState <- accelState
turnState <- turningState
vehicleX <- doubleNumber
vehicleY <- doubleNumber
vehicleDir <- doubleNumber
vehicleSpeed <- doubleNumber
obstacles <- objects }
yield Telemetry(timestamp, accState(), turnState(),
vehicleX, vehicleY, vehicleDir,
vehicleSpeed, obstacles)

def accelState = (
"a" ^^^ Accelerating
| "b" ^^^ Braking
| "-" ^^^ Rolling
)

def turningState = (
"l" ^^^ SoftLeft
| "L" ^^^ HardLeft
| "-" ^^^ Straight
| "r" ^^^ SoftRight
| "R" ^^^ HardRight
)

def objects = rep(objectParser)
def objectParser: Parser[Object] = (
"b" ~ spaces ~> circleParser ^^ (new Boulder(_))
| "c" ~ spaces ~> circleParser ^^ (new Crater(_))
| "h" ~ spaces ~> circleParser ^^ (new Crater(_))
| "m" ~ spaces ~> martianParser
)

def circleParser = for { x <- doubleNumber
y <- doubleNumber
r <- doubleNumber }
yield new Circle(x, y, r)

def martianParser = for { circle <- circleParser
dir <- doubleNumber
speed <- doubleNumber }
yield new Martian(circle, dir, speed)


The interesting things in the Scala implementation, although it could seem a bit ugly at first sight, are:

  • the use of a sequence operator "~>" getting rid of what has already been parsed. Indeed, when we have parsed "b " we know we need to create a Boulder object. The 'b' character and the space are then not usefull anymore

  • the use of functions to transform the parsed string into a meaningful object, like "l" ^^^ SoftLeft, creating a new SoftLeft object (it's a case class) when parsing "l"

  • implicit definitions. In the example above I don't need to declare my parser for 'l' as letter("l"). Combining "l" with another parser automatically creates a parser for the character 'l'

Why XML?

One of the reasons for the ubiquitous adoption (and hate!) of XML is, I think, the general availability of parsers and binding libraries. This is why this often becomes the easiest option for configuration files (is it changing with YAML and JSON?).

It doesn't have to be that way. I hope that parser combinators can now bring another legitimate option when considering the implementation of an external mini-language for configuration or protocol. They're simple to use and readily available on your Java platform through Scala. Try them at home!

08 July 2008

My first ICFP contest, Part 1: ten days to learn Haskell

I wanted to do that since I read Tom's Moertel post on his participation to the ICFP contest 2001. I Told my wife: "One day, I'll participate to the ICFP contest" (head slightly tilted up with a brilliant and determined look in the eyes).

One day, I'll participate to the ICFP contest

Fortunately, that day came quick as I met, last month, other geeks in Tokyo ready to get into the race. I should say über-geeks when I realize the gaps I have in my knowledge of some fields of computing compared to them. System and network programming being only the emerged part of the iceberg.

I met them for the first time during the monthly meeting of the Tokyo Society for the Application of Currying (try to explain that to your grandma). That night we decided that we would form a team for the next ICFP contest. Then, we spent the next week discussing the team name, that was the difficult decision. The easy one was the language we would use. What is THE language of choice for functional hackers? Haskell of course! [flaming starts here]

But how do you learn Haskell in 10 days and hope to be able to contribute at least a little?

Learn Haskell in 10 days!

Actually, I've reading about Haskell for a while now, starting with the very enjoyable A History of Haskell: Being LazyWith Class. From there and working with Scala for the past year, I had understood the following:
  • the "functional" approach vs the "sequential" approach
  • immutable data structures
  • functions as first-class objects
  • lazy evaluation
  • thinking in types
  • thinking in higher-ranked types
  • Options / Maybe
  • List operations: map, folds,...
  • pattern matching
  • type classes / traits
This makes an impressive list of language features but I must say that Scala is a really "transitional language" for Java programmers towards Haskell. Riiiight. I "understand" it but how well can I program within 10 days?

The laundry list

Using a programming language always involves a wealth of down-to-earth considerations worth getting right as fast as possible. Here is my list and how I organized my environment:
  1. the compiler: I used GHC and the Hugs interpreter for Windows
  2. the editor: I used the Functional plugin for eclipse, providing editing, compilation, creation of an executable
  3. the librairies: I downloaded / compiled and installed QuickCheck2 and otherwise used the bundled libraries with GHC
  4. code coverage: hcp (not really indispensable for a beginning but a nice to have for a non-strict language. I was happy to eventually get it to work)
This is clearly non-optimal yet (and I wonder if it can be on Windows?) and I'll certainly explore:
  1. a better editor: considering what can be done in Java, it's pretty amazing that there is no high-class editor for Haskell. I'll try Yi of course, but it's still in its infancy. "Rename function" or "Rename data type" should be pretty accessible (provided enough open-source or company support, that is,...)
  2. a build/test/continuous integration system. The guys from Starling Software have their own system for that matter (QAM) and I haven't seen yet such a standard thing for Haskell.
  3. more libraries, libraries management: I still haven't done it on windows but I plan to install Cabal and download the whole Hackage.
  4. Switching to linux forever. I'll certainly cover this in part 2 but Windows seems less and less a decent alternative for hackers (I didn't say developers ;-) ).
Yes, all set-up! I can start my first tutorial!

No, I'd rather do a code dojo first

Why bother using a tutorial when I have a nice little programming exercise at hand? I did it in Java, I did it in Ruby, I did it in Scala, it was time to use Haskell.

Basically the question is: write a program to transform Roman number back and forth to Integers: unromanize("MCMLXXII") == 1972.

Here is my solution with Haskell:

-- SOME VALUES DECLARATION
-- definition of the roman letters with their numerical value
romansList = [("I", 1), ("IV",4), ("V", 5), ("IX", 9), ("X", 10), ("XL", 40),
("L", 50), ("XC", 100), ("C", 100), ("CD", 400), ("D", 500),
("CM", 900), ("M", 1000)]
-- numerical values only
values = map snd romansList
-- letters only
letters = map fst romansList

-- return the value corresponding to a letter
value :: String -> Int
value c = findWithDefault 1 c (fromList romansList)

-- return the letter corresponding to a value
letter :: Int -> String
letter i = findWithDefault "I" i reversed
where reversed = fromList(map switch romansList)
switch = (\(x, y) -> (y, x))

-- to get the value of a RomanString, start
-- from the end and accumulate the value of each letter
-- substract the value of the current letter if its value is
-- inferior to the previous one
-- ex: "XIV" -> 5 (for "V") - 1 (for "I") + 10 (for "X")
unromanize :: String -> Int
unromanize s = fst(foldr add (0, 0) s)

add :: Char -> (Int, Int) -> (Int, Int)
add c (total, lastElem) =
if (lastElem <= val) then (total + val, val)
else (total - val, val)
where val = value [c]
-- [c] is the Char 'c' as a String

-- return the RomanString corresponding to an Int value
-- start with an empty string and the value to convert
-- use the romanizeString function to compute the RomanString

romanize :: Int -> String
romanize i = fst $ romanizeString ("", i)

-- if the remainder is 0, there is no more conversion to do
-- otherwise, try to substract the biggest possible RomanChar from the remainder
-- this RomanChar is then appended to the current result
romanizeString :: (String, Int) -> (String, Int)
romanizeString result@(res, remain) =
if (remain == 0) then result
else romanizeString(res ++ (letter maxi), remain - maxi)
where maxi = maxValue remain
maxValue r = last (takeWhile (<= r) values)


And here is a simple QuickCheck properties that I wrote to check the conversion:
prop_unromanize_romanize :: Int -> Property
prop_unromanize_romanize x = x >= 0 ==> unromanize(romanize x) == x

First impressions

Working on that simple example (but also on string generators to check the reverse property romanize(unromanize) -- much more tricky) led me to the following observations:
  • programming in Haskell really "feels" pure and elegant. There is an economy of symbols which is quite remarkable for such an expressivity
  • not having variables feels like programming in a straight jacket when coming from an imperative background! But the feeling disappears quickly
  • the "functional/data" orientation makes a whole different mental model when you come from an object-oriented perspective. It looks like your objects, encapsulating their data and showing only their business logic have been turned inside out and methods carefully separated from data.
  • browsing the libraries feels like there a huge drive towards abstraction in Haskell. Nowhere else I've seen concepts decomposed so that Monoids would emerge and be reused
  • new language, new librairies. It really takes time but Haskell libraries are full of jewels which are really worth knowing inside out (the same can certainly be said from many languages)
  • although the Prelude contains a lot of useful functions, some facilities you may find in scripting languages are missing. We recently discussed the lack of a "split" operation for Strings on the Haskell mailing-list recently. I know that different people have different semantics for a "split" operation but not including any kind in the standard library feels like it's missing something
  • the compiler is your friend! It goes at great length explaining you why and where you failed to properly type your program
Go, go, go

After those 2 or 3 sessions of installation and programming I was very far from being able to code on my own for the contest, yet I would be able to pair-up and provide a vigilant pair of eyes on someone else code. Stay tuned for part 2: the contest!

03 July 2008

Everyday monads

Abstraction

Abstraction is the bread and butter of our trade, right? The problem is that when you abstract too much, you don't always know where you came from and why.

You see? The sentence above is already very abstract ;-). Let's try to be more concrete. I want to talk about Monads and the kind of problems they solve:


  • monads --> powerful abstract beast
  • problems they solve --> very concrete

A classical code sample

I was trying to fix an issue with our software 2 days ago when I realized that part of my data wasn't valid: "What!? My exposure is not valid, how come!?". A quick look at the code revealed that the validity of an exposure is determined by the following (actually simplified,...):


public boolean isValid() {
return traded && !matured &&
!excluded && !internal && leId > 1;
}

Of course I started blaming the programmer: thank you smart boy, I know that my exposure is not valid but would you have the decency to explain me WHY? Is it untraded and/or matured and/or excluded,...?


I quickly realized why more precise error reporting hadn't been written: time to market. Writing big if/then/else and accumulating error messages is tedious, error-prone and boring to the extreme.


Monads to the rescue!

Somehow, that's the kind of problem that Monads can solve. One way to see Monads is to consider them as a convenient way to chain computations:


  1. you put something in the Monad
  2. you apply one or more functions to it, the results stay in the Monad ("What goes in the Matrix, stays in the Matrix" as I've read somewhere)
  3. then you can extract the final result


Here is the use of a Monad (coded in Scala but you could achieve the same kind of effect in Java) to solve the problem:


def isValid(e: Exposure) = {
e.traded ?~! ("is not traded") ?~! (!e.matured, "is matured") ?~!
(!e.excluded, "is excluded") ?~! (!e.internal, "is internal") ?~!
(e.leId > 1, "doesn't have a defined legal entity")
}
val exposure = new Exposure
exposure.traded = false
exposure.matured = true
isValid(exposure) match {
case f: Failure => {
println(f.messages.mkString("exposure is not valid because: it ", ", it ", ""))
}
}

It outputs:
exposure is not valid because: it is not traded, it is matured, it doesn't have a defined legal entity

So, at the (small) expense of changing the operator from && to ?~! and adding a specific message at each step we get a nice failure message indicating precisely what went wrong.


How does this work?

I implemented this cryptic ?~! operator as a small variation of the existing ?~! method on the Can class from the lift webframework.

A Can is kind of box which either be Empty, Full(of something) or a Failure(reason). So when you perform computations on a Can, either:

  1. there is nothing to do (Empty) --> the result will stay Empty or become a Failure
  2. there is a value (Full(value)) --> the result will either stay Full or become a Failure
  3. it is a Failure --> it will stay a Failure, possibly with an additional error message

What are the 2 operations I need on the Can objects? First of all I need an operation to "put a boolean in the monad". Here I have used an implicit definition:
implicit def returnCan(b: Boolean): Can[Boolean] = if (b) Full(true) else Empty

then I need one method which will either return the same Can if some boolean is true or a new Failure accumulating one more error message:


def ?~!(b: Boolean, msg: String): Can[A] = {
if (b) this else Failure(msg :: this.messages)
}

What happens if the first boolean expression is false?

  1. an Empty Can is created (representing a failure with no cause yet)
  2. the ?~!(msg: String) method is called and returns a Failure(msg)
  3. when the ?~!(b: Boolean, msg: String) method is called, it either returns the same Failure, or create a new one with one more error message

And what happens if the first boolean expression is true?
  1. a Full Can is created (representing a success)
  2. the ?~!(msg: String) method is called and returns the Full can
  3. when the ?~!(b: Boolean, msg: String) method is called, it either returns the same Full can if b is true, or create a new Failure Can with one error message

What I will not do in this post

Although that would be interesting, I will not try to show why the behavior described above actually correspond to a Monad, with the mandatory Monad laws. The first reason is that it is too late to do so ;-). The second reason is that I'm not sure it matters much. What is certainly more important is that Monads, Arrows, Applicative Functors and their kind provide powerful models of computations which can also be applied to Everyday-Enterprisey jobs.

And, by the way, Scala provides syntactic sugar for operators definition, but there is really nothing which avoid us to do the same thing in Java:

// warning!! pseudo-java code!!
class Result {
private List messages = new ArrayList();
private Result(List msgs) {
this.messages = msgs;
}
public Result andCheck(Boolean b, String msg) {
if (b) return this;
else return new Result(messages.append(msg));
}
static public check(Boolean b, String msg) {
return new Result().andCheck(b, msg);
}
}

public Result isValid(e: Exposure) = {
check(e.traded, ("is not traded")).
andCheck(!e.matured, "is matured").
andCheck(!e.excluded, "is excluded").
andCheck(!e.internal, "is internal").
andCheck(e.leId > 1, "doesn't have a defined legal entity")
}


Eureka

I hope I haven't brought more confusion on the topic of Monads with that not-very-formal post. This is a world I'm just starting to explore and I just love when I can have some "Eureka" moments and write better code!

09 June 2008

Edit distance in Scala

How handy.

I was just looking for a way to highlight differences between 2 strings for my Behavior-Driven Development library, specs. And reading the excellent Algorithm Design Manual. Intuitively, I was thinking that there was a better way to show string differences than the one used in jUnit:

Expected kit<t>en but was kit<ch>en
Expected <skate> but was <kite>

The first message shows that <t> has been replaced by <ch>, while the second message says that everything is different! There must be a way to find the minimal set of operations necessary to transform one string to another, right?

The Levenshtein distance

There is indeed. The "Edit" distance, also called "Levenshtein" distance, computes exactly this, the minimal number of insertions, deletions and substitutions required to transform "World" into "Peace" (5, they're very far apart,...).

The computation is usually done by:
  1. examining the first 2 or last 2 letters of each string
  2. assuming the eventual transformation which may occur for those 2 letters:
  • they're equal: "h..." and "h...." --> distance = 0
  • the first one has been removed: "ah..." and "h..." --> distance = 1
  • the second one has been inserted: "h..." and "ah..." --> distance = 1
  • they've been substituted: "ha..." and "ga..." --> distance = 1
  • 3. saying that the final distance is the minimum of the distance when choosing one the
  • 4 possibilities + the distance resulting from the consequences of that choice

  • You can notice that many variations are possible: the distance could be different depending on the letters being added or removed, there could be more allowed operations such as the swapping of 2 letters (distance 1 instead of 2 substitutions of distance 2),...

    Then the final result can either be constructed incrementally or recursively. Incrementally, for "skate" and "kite" you will compute costs from any small prefix to another prefix:
    • "s" to "k" --> distance 1
    • "s" to "ki" --> distance 2

    • "sk" to "ki" --> distance 2
    • "ska" to "kit" --> distance 3
    Then you increase the prefix size and you reuse the distance numbers found for smaller prefixes. The different results can be stored in the following matrix:

    s k a t e

    k 1 1 2 3 4
    i 2 2 2 3 4
    t 3 3 3 2 3
    e 4 4 4 4 2

    Recursively, you do the inverse and you establish that the distance between 2 strings can be computed from knowing the distance between smaller prefixes and you travel the matrix to its upper left corner.

    This is a very good example of Dynamic Programming, where the optimal quantity (the distance at [i, j] in the matrix) you're looking for only depends on:
    • the optimal quantity for a subset of the data (the distance at [i-1, j-1], [i-1, j], [i, j-1])
    • the resulting state (the substrings defined by a position [i-x, j-x] in the matrix)
    • not how you got to that resulting state
    The EditDistance trait

    For the record, I will write here the code I'm using for the implementation, which is constructing the solution incrementally, as opposed to the recursive solution presented on Tony's blog, here.

    The few interesting things to notice about this code are:
    • the algorithm itself ("initializing the matrix"), which is very straightforward
    • the minimum function which is a bit strange because it is not comparing all alternatives. The cost for an insertion is only computed if the cost for a suppression is not better than the cost for a substitution. I would like to work out a formal proof on why this is correct by my intuition tells me that a substitution is generally a "good" operation. It has the same cost as an insertion or a suppression but processes 2 letters at the time. So if we find a better cost using a suppression, it is going to be the best
    • the matrix traversal to retrieve one of the possible paths showing the letters which needs to be transformed for each string: not so fun code with a lot of corner cases
    • the "separators" feature which allows to select the separators of your choice to display the string differences

    trait EditDistance {
    /**
    * Class encapsulating the functions related to the edit distance of 2 strings
    */
    case class EditMatrix(s1: String, s2: String) {
    /* matrix containing the edit distance for any prefix of s1 and s2:
    matrix(i)(j) = edit distance(s1[0..i], s[0..j])*/
    val matrix = new Array[Array[int]](s1.length + 1, s2.length + 1)

    /* initializing the matrix */
    for (i <- 0 to s1.length;
    j <- 0 to s2.length) {
    if (i == 0) matrix(i)(j) = j // j insertions
    else if (j == 0) matrix(i)(j) = i // i suppressions
    else matrix(i)(j) = min(matrix(i - 1)(j) + 1, // suppression
    matrix(i - 1)(j - 1) + (if (s1(i - 1) == s2(j - 1)) 0
    else 1), // substitution
    matrix(i)(j - 1) + 1) // insertion

    }
    /** @return the edit distance between 2 strings */
    def distance = matrix(s1.length)(s2.length)

    /** prints the edit matrix of 2 strings */
    def print = {
    for (i <- 0 to s1.length) {
    def row = for (j <- 0 to s2.length) yield matrix(i)(j)
    println(row.mkString("|"))
    }
    this
    }

    /** @return a (String, String) displaying the differences between each
    input strings. The used separators are parenthesis: '(' and ')'*/
    def showDistance: (String, String) = showDistance("()")

    /**
    * @param sep separators used to hightlight differences. If sep is empty,
    * then no separator is used. If sep contains one character, it is
    * taken as the unique separator. If sep contains 2 or more characters,
    * the first 2 characters are taken as opening separator and closing
    * separator.
    *
    * @return a (String, String) displaying the differences between each
    * input strings. The used separators are specified by the caller.
    */
    def showDistance(sep: String) = {
    val (firstSeparator, secondSeparator) = separators(sep)
    def modify(s: String, c: Char): String = modifyString(s, c.toString)
    def modifyString(s: String, mod: String): String = {
    (firstSeparator + mod + secondSeparator + s).
    removeAll(secondSeparator + firstSeparator)
    }
    def findOperations(dist: Int, i: Int, j:Int, s1mod: String, s2mod: String): (String, String) = {
    if (i == 0 && j == 0) {
    ("", "")
    }
    else if (i == 1 && j == 1) {
    if (dist == 0) (s1(0) + s1mod, s2(0) + s2mod)
    else (modify(s1mod, s1(0)), modify(s2mod, s2(0)))
    }
    else if (j < 1) (modifyString(s1mod, s1.slice(0, i)), s2mod)
    else if (i < 1) (s1mod, modifyString(s2mod, s2.slice(0, j)))
    else {
    val (suppr, subst, ins) = (matrix(i - 1)(j), matrix(i - 1)(j - 1),
    matrix(i)(j - 1))
    if (suppr < subst)
    findOperations(suppr, i - 1, j, modify(s1mod, s1(i - 1)), s2mod)
    else if (ins < subst)
    findOperations(ins, i, j - 1, s1mod, modify(s2mod, s2(j - 1)))
    else if (subst < dist)
    findOperations(subst, i - 1, j - 1, modify(s1mod, s1(i - 1)),
    modify(s2mod, s2(j - 1)))
    else
    findOperations(subst, i - 1, j - 1, s1(i - 1) + s1mod, s2(j - 1) + s2mod)
    }
    }
    findOperations(distance, s1.length, s2.length, "", "")
    }
    def min(suppr: Int, subst: Int, ins: =>Int) = {
    if(suppr < subst) suppr
    else if (ins < subst) ins
    else subst
    }
    }
    def editDistance(s1: String, s2: String): Int = EditMatrix(s1, s2).distance
    def showMatrix(s1: String, s2: String) = EditMatrix(s1, s2).print
    def showDistance(s1: String, s2: String) = EditMatrix(s1, s2).showDistance
    def showDistance(s1: String, s2: String, sep: String) = EditMatrix(s1, s2).showDistance(sep)

    private def separators(s: String) = (firstSeparator(s), secondSeparator(s))
    private def firstSeparator(s: String) = if (s.isEmpty) "" else s(0).toString
    private def secondSeparator(s: String) = {
    if (s.size < 2) firstSeparator(s) else s(1).toString
    }
    }


    Conclusion

    What's the conclusion? Computer science is useful of course, but you only recognize it once you know it!

    08 May 2008

    Algorithmic panic - take 2

    Shame, shame, shame.

    The Blog Writing Rule

    Writing at least one blog entry per month. That should be feasible, no? Unfortunately, last Friday, I was already more than 10 days late, so I had to do something!

    And, when pressing the "Post" button, I had this awkward feeling that something was not really right with my "solution". Not really on the "speed" side because I knew that my list accesses were sloppy, but even on the "correctness" side. But I was past the date, damn it!

    Fortunately, someone's watching

    Ok, enough justifications. I was in fact very pleased to get so many insightful comments, be it on my blog, or on reddit. It's a pleasure to see that so many people care for The Truth on the Internet ( ;-) )

    So I tried my best to review things up and to derive the maximum number of lessons out of theses errors. I'll start with that, then I'll copy the revised version of the algorithm.

    Lessons learned

    1. Check your definitions. The median definition is very straightforward,... if you have an odd number of elements. For an even number, you can elaborate several definitions, such as that one. For the code below, I choose to take define the median as a pair of values, which is both "middle" elements if the list is even. Pairs of values are then conveyed across all computations, so that it's always possible to eventually take the mean value if you feel like it

    2. Don't trust test generation. I promise my tests were green when I posted! However, the generation parameters were very bad: the number of tests was too small, the number and range of elements in the lists also. The solution here was tested on larger samples (up to 8000 tests), with larger lists (of same size or different sizes)

    3. Check your invariants. In this kind of "Divide and Conquer" approach, I should make sure that I have indeed the same problem, on a smaller scale, after dividing it. The code below shows that it is possible to take the same number of elements on each list, so that the median of the resulting lists is unchanged

    4. Know your data structures. I knew that the most basic list was not necessarily efficient but I must say that I didn't check anything. And, in fact, when Jorge pointed out that List had an O(n) access for size, I was quite surprised. The other interesting thing is that optimizing an operation can be totally feasible but quite tricky to implement (see the medianOfSortedListAndOneElement function below). And of course, takeWhile(_) was pretty bad!

    The algorithm

    First of all, lets start by the definition of the median as a couple of values, which may be the same if the list has an odd number of elements:

    case class ExtendedSeq(list: Seq[Int]) {
    def median: (Int, Int) = {
    if (list.isOdd)
    (list(list.size / 2), list(list.size / 2))
    else
    (list(list.size / 2 - 1), list(list.size / 2))
    }
    def isOdd = list.size % 2 != 0
    }

    If elements access and size are O(1) as in the RandomAccessSeq collection, that should be ok in terms of complexity (thanks Jorge).

    Here is the heart of the algorithm:

    def median(l1: Seq[Int], l2:Seq[Int]): (Int, Int) = {
    if (l1.isEmpty) l2.median
    else if (l2.isEmpty) l1.median
    else if (l1.size == 1) medianOfSortedListPlusOneElement(l2, l1(0))
    else if (l2.size == 1) medianOfSortedListPlusOneElement(l1, l2(0))
    else {
    val (m1, m2) = (l1.median, l2.median)
    m1.intersection(m2) match {
    case Some(m) => m
    case None => if (m1 < m2) median(trimLists(l1, l2))
    else median(trimLists(l2, l1))
    }
    }
    }

    The code above starts by checking some special cases: one empty list, a list + one element, then 2 "regular" median values.

    The first observation is that, if the median values (which are pairs of values) have a non-empty intersection, this intersection will be the median of both sorted lists.

    This is the not-so readable code for checking if there is an overlap between 2 pairs:

    case class ExtendedPair(pair1: (Int, Int)) {
    def intersection(pair2: (Int, Int)) = {
    if (pair2._1 <= pair1._1 && pair1._2 <= pair2._2) Some(pair1)
    else if (pair1._1 <= pair2._1 && pair2._2 <= pair1._2) Some(pair2)
    else if (pair1._1 <= pair2._1 && pair2._1 <= pair1._2 && pair1._2 <= pair2._2)
    Some((pair2._1, pair1._2))
    else if (pair2._1 <= pair1._1 && pair1._1 <= pair2._2 && pair2._2 <= pair1._2)
    Some((pair1._1, pair2._2))
    else None
    }
    }

    Finally, if there is no overlap, we need to remove elements from both lists so that the median of the remaining 2 lists will be the same as the median of the original lists:

    def trimLists(l1: Seq[int], l2: Seq[Int]) = {
    val removableElements = if (l1.size < l2.size) l1.size / 2 else l2.size / 2
    (l1.drop(removableElements), l2.take(l2.size - removableElements))
    }

    This time, I removed the same number of elements below the lower median and above the higher one (but still, a written mathematical proof would be more satisfying than "it's obvious that,...").

    On the complexity side, I still have to check that I can find a data structure (a Seq) providing drop and take operations in O(1). From what I read of the implementation of RandomAccessSeq, that should be the case. RandomAccessSeq is using indexes to provide O(1) "slicing" operations as suggested by Mikko in the comments of the previous post.

    Brutal force

    And finally, I can't resist showing you the horror of finding the median of a sorted list and a supplementary element in O(1). There may be a subtle trick to do that (please say, if you find it!), but I used the brutal force:

    def medianOfSortedListPlusOneElement(list: Seq[Int], element: Int) = {
    if (list.size == 1){
    if (element >= list(0))
    (list(0), element)
    else
    (element, list(0))
    }
    else if (list.isOdd) {
    if (element >= list(list.size / 2)) {
    if (element <= list(list.size / 2 + 1))
    (list(list.size / 2), element)
    else
    (list(list.size / 2), list(list.size / 2 + 1))
    } else {
    if (element >= list(list.size / 2 - 1))
    (element, list(list.size / 2))
    else
    (list(list.size / 2 - 1), list(list.size / 2))
    }
    } else {
    if (element >= list(list.size / 2 - 1)) {
    if (element <= list(list.size / 2))
    (element, element)
    else
    (list(list.size / 2), list(list.size / 2))
    } else {
    (list(list.size / 2 - 1), list(list.size / 2 - 1))
    }
    }
    }

    Not pretty, uh? Like the final approach for the 4-colors theorem.

    Measuring the complexity

    On my way to this post, I wanted to implement something which would store the test data and say if the curve looks like an O(log(n)), an O(nlog(n)),... In fact it doesn't seem so easy to do without any human validation and tweaking of the test data size until the complexity really shows. I'm thinking that using an utility to graph the performances and compare it with known complexities would be a good first step.

    But maybe driving the test generation so as to get a good confidence on the possible complexity of the algorithm would be even better. I'll try to have a look at that,... Maybe for my next post?

    Thanks for all the comments, I hope I didn't add more and more silliness to my previous errors.

    PS: I liked a lot Tom's comments about the 2-keys indexing problem. Indeed, the problem was underspecified, in terms of access frequency, memory, etc,... And a log(n) solution would certainly be ok in my specific case.

    02 May 2008

    Algorithmic panic

    Last week I read a blog post about an interview for Google and I thought: "Oh my god, I don't know how to do that, I'm lost, I have no idea, I will never find the solution".

    Flashback

    Two weeks ago, I've ordered and started reading the excellent book, "The Algorithm Design Manual". That book is very good. Teaching algorithms and data structures can be something very dry when you consider it from a very scholar perspective. But it becomes really fun and challenging when you realize that:
    • it solves real worthy problems
    • no amount of processing power can be overcome a clever algorithm when there is one
    • there's often no "right" solution but a combination of different approaches with different trade-offs
    So the book is filled with "war stories" showing the author investigating some problems, trying to ask relevant questions until there's a "ah-ah" moment where the problem is sufficiently characterized. Then the solution is refined to get a completely satisfying result.

    Decode this!

    For example, the author and his team had to provide a new way to decrypt DNA fragments, knowing that there was a new technique using small probes returning small fragments of the whole string. Say your DNA is ATGCCTCGATTG, the probes will find AT, TG, GA, TT, TG, all of which are substrings of the initial string. And from all those pieces, you have to find the original DNA string. For this kind of problem, the sheer number of elements kills instantly any brutal-force approach. Originally they had to sequence DNA fragments of 50.000 characters, knowing that more that 1.5 million pieces should be combined to get the answer. Don't try this at home!

    Binary tree > HashTable > Suffix Tree

    For their algorithm to work they had to set up a dictionary allowing the search of substrings of length k inside strings of length 2k. The question is: how to do that very efficiently? A student proposed a HashTable which would do the job in O(log(k)). But this was still too slow! Hey, what can be faster than a HashTable??!! A Suffix Tree.

    In this specific case they were searching elements which looked very close to one another, just being different by one character. And a suffix tree happens to organize the suffixes of a set of strings so that it's easy to search for a string which is just one character away.

    Understanding the structure of the problem yielded much better results than a HashTable: for a 8000 characters long DNA, the HashTable time was 2 days against 650 seconds for the SuffixTree (the "compressed" version, because the original one was blowing up the memory!). That was my "ah-ah" moment.

    The dreadful interview question

    So I thought I was ready for any algorithmic question. Until I read the blog: "find the median of 2 sorted lists in O(log(n))". I don't know why, reading that blog, I imagined myself over the phone, trying to solve this problem and my mind got totally blank. Panic.

    The same night, going to bed, I told myself: ok, relax, have a fresh look at it. And I slept happily a few minutes later because the principle of the solution wasn't that difficult.

    First of all, the median of a sorted list is the middle element. That's an O(1) search! Then, given the medians for each list, the median of both lists is something like the median of the 2 sublists of all elements between median1 and median2. That's where we get our O(log(n)) from, because with a recursive search, we're pretty sure to cut the problem size by 2 at each step.

    I programmed the rest the day after. Again, I think that this is a tribute to Scala. The result is really readable and I used ScalaCheck to verify it by generating lists of same and different sizes. On the other hand, I had hard time figuring out the proper bound conditions (ScalaCheck was very helpful here):

    trait Median {
    def median(l1: List[Int], l2:List[Int]): Int = {
    if (l1.isEmpty || l2.isEmpty || l1.last <= l2.head) (l1 ::: l2).median
    else if (l2.last < l1.head) (l2 ::: l1).median
    else {
    val (m1, m2) = (l1.median, l2.median)
    if (m2 < m1) median(l1.takeBetween(m2, m1), l2.takeBetween(m2, m1))
    else if (m2 > m1) median(l1.takeBetween(m1, m2), l2.takeBetween(m1, m2))
    else m1
    }
    }
    implicit def toExtendedList(list: List[Int]) = ExtendedList(list)
    case class ExtendedList(list: List[Int]) {
    def takeBetween(a: Int, b: Int) = list.dropWhile(_ <= a).takeWhile(_ <= b)
    def median = list(list.size / 2)
    }
    }


    Data structures everywhere

    Funnily, yesterday, one of my colleagues asked me if I knew an easy, fast and clever way to access some values classified with 2 keys:
    • there's only one value for (key1, key2)
    • she wants to get the value for (key1, key2) (fast,...)
    • she wants to get all values for key1 (fast again,...)
    • she wants to get all values for key2 (fast also,...)
    This looks like a very classical problem for Java programmers but surprisingly I found no existing, open-source, solution for this (if you know that, please tell me. I looked at things like TreeMaps or MultiMaps for example but they don't fit the requirement of a fast 2 key search with either of the keys).

    After some research and some thinking we concluded that 3 hashmaps sharing the values and dedicated to answer each different query fast would be the best thing to do. But now that I know that HashTables are not the ultimate solution to everything,...

    13 March 2008

    PDD: Properties Driven Development

    Quick quizz: 1 + 2 =? and 1 + 2 + 3 =? and 1 + 2 + 3 + 4 =?

    In other words, what is the result of the sumN function, which sums all integers from 1 to n?

    If we had to use a "direct" (some would say "naive") TDD approach, we could come with the following code:
    [Warning!! The rest of the post assumes that you have some basic knowledge of Scala, specs and Scalacheck,... ]

    "the sumN function should" {
    def sumN(n: Int) = (1 to n) reduceLeft((a:Int, b:Int) => a + b)
    "return 1 when summing from 1 to 1" in {
    sumN(1) must_== 1
    }
    "return 2 when summing from 1 to 2" in {
    sumN(2) must_== 3
    }
    "return 6 when summing from 1 to 3" in {
    sumN(3) must_== 6
    }
    }

    But if we browse our mental book of "mathematical recipes" we remember that

    sumN(n) = n * (n + 1) / 2

    This is actually a much more interesting property to test and Scalacheck helps us in checking that:

    "the sumN function should" {
    def sumN(n: Int) = (1 to n) reduceLeft((a:Int, b:Int) => a + b)
    "return n(n+1)/2 when summing from 1 to n" in {
    val sumNInvariant = (n: Int) => sumN(n) == n * (n + 1) / 2
    property(sumNInvariant) must pass
    }
    }

    Even better, Scalacheck has no reason to assume that n is strictly positive! So it quickly fails on n == -1 and a better implementation is:

    "the sumN function should" {
    def sumN(n: Int) = {
    assume(n >= 0) // will throw an IllegalArgumentException if the constraint is violated
    (1 to n) reduceLeft((a:Int, b:Int) => a + b)
    }
    "return n(n+1)/2 when summing from 1 to n" in {
    val sumNInvariant = (n: Int) => n <= 0 || sumN(n) == n * (n + 1) / 2
    property(sumNInvariant) must pass
    }
    }


    This will be ok and tested for a large number of values for n. Using properties is indeed quite powerful. Recreation time! You can now have a look at this movie, where Simon Peyton-Jones shows that Quickcheck (the Haskell ancestor of Scalacheck) detects interesting defaults in a bit packing algorithm.

    Fine, fine, but honestly, all those examples look very academic: graph algorithms, mathematical formulas, bits packing,... Can we apply this kind of approach to our mundane, day-to-day development? Tax accounting, DVD rentals, social websites?

    I am going to take 2 small examples from my daily work and see how PDD could be used [yes, YAA, Yet Another Acronym,... PDD stands for Properties Driven Development (and not that PDD)].
    1. From the lift framework (courtesy of Jamie Webb): a 'camelCase' function which transforms underscored names to CamelCase names
    2. From my company software: some pricer extension code for Swaps where fees values have to be subtracted from the NPV (Net Present Value) under certain conditions
    I'll develop the examples first and try to draw conclusions latter on.

    Example 1: camelCase

    So what are the properties we can establish for that first example? Can we describe it informally first?

    The camelCase function should CamelCase a name which is under_scored, removing each underscore and capitalizing the next letter.

    Try this at home, this may not be so easy! Here is my proposal:

    def previousCharIsUnderscore(name: String, i: Int) = i > 1 && name.charAt(i - 1) == '_'
    def underscoresNumber(name: String, i: Int) = {
    if (i == 0) 0
    else name.substring(0, i).toList.count(_ == '_')
    }
    def indexInCamelCased(name: String, i: Int) = i - underscoresNumber(name, i)
    def charInCamelCased(n: String, i: Int) = camelCase(n).charAt(indexInCamelCased(n, i))

    val doesntContainUnderscores = property((name: String) => !camelCase(name).contains('_'))

    val isCamelCased = property ((name: String) => {
    name.forall(_ == '_') && camelCase(name).isEmpty ||
    name.toList.zipWithIndex.forall { case (c, i) =>
    c == '_' ||
    indexInCamelCased(name, i) == 0 && charInCamelCased(name, i) == c.toUpperCase ||
    !previousCharIsUnderscore(name, i) && charInCamelCased(name, i) == c ||
    previousCharIsUnderscore(name, i) && charInCamelCased(name, i) == c.toUpperCase
    }
    })
    doesntContainUnderscores && isCamelCased must pass

    This property says that:
    • the CamelCased name must not contain underscores anymore
    • if the name contains only underscores, then the CamelCased name must be empty
    • for each letter in the original name, either:
      • it is an underscore
      • it is the first letter after some underscores, then it becomes the first letter of the CamelCased word and should be uppercased
      • the previous character isn't an underscore, so it should be unchanged
      • the previous character is an underscore, so the letter should be uppercased
    Before running Scalacheck, we also need to create a string generator with some underscores:

    implicit def underscoredString: Arbitrary[String] = new Arbitrary[String] {
    def arbitrary = for { length <- choose(0, 5) string <- vectorOf(length, frequency((4, alphaNumChar), (1, elements('_')))) } yield List.toString(string) }


    This works and in the process of working on the properties I observed that:
    • the full specification for CamelCasing name is not so easy!

    • it is not trivial to relate the resulting name to its original. I had to play with indices and the number of underscores to be able to relate characters before and after. However, if that code is in place, the testing code is almost only 1 line per property to check

    • the properties above specify unambiguously the function. I could also have specify weaker properties with less code, by not avoiding to specify that some letters should be unchanged or that the CamelCased name contains an uppercased letter without checking its position.

    Example 2: Pricer extension

    The logic for this extension is:
    1. to have the NPV (NetPresentValue) being calculated by the Parent pricer
    2. to collect all fees labeled "UNDERLYING PREMIUM" for that trade
    3. to subtract the fee value from the NPV if the valuation date for the trade is >= the fee settlement date
    4. to apply step 3 only if a pricing parameter named "INCLUDE_FEES "is set to true, while another pricing parameter "NPV_INCLUDE_CASH" is set to false
    This looks certainly like a lot of jargon for most of you but I guess that it is pretty close to a lot of "Business" requirements. What would a property for those requirements look like (in pseudo Scala code)?

    (originalNPV, fees, valuation date, pricing parameters) =>

    if (!INCLUDE_FEES || NPV_INCLUDE_CASH)
    newNPV == originalNPV
    else
    newNPV == originalNPV - fees.reduceLeft(0) { (fee, result) => result +
    if (fee.isUnderlyingPremium && fee.settlementDate <= valuationDate)
    fee.getValue
    else
    0
    }

    The most remarkable thing about this property is that it looks very close to the actual implementation. On the other hand, Scalacheck will be able to generate a lot of test cases:
    • an empty fee list
    • a list with no underlying premium fee
    • a list with a fee which settle date is superior to the valuation date
    • a list with a fees which settle date is inferior to the valuation date
    • the 4 possible combinations for the values of the pricing parameters
    You can also notice that I use as the first parameter the originalNPV, which doesn't directly come from a generator but which would be the result of the original pricer with the other generated parameters (fees, valuation date, pricing parameters).


    Conclusion


    As a conclusion, and at the light of the 2 previous examples, I would like to enumerate the result of my recent experiments with Properties-Driven-Development:
    • First of all is that PDD is TDD, on steroids. In PDD, we also have data and assertions but data are generated and assertions are more general.
    • I don't believe that this replaces traditional TDD in all situations. There are situations where generating even 4 or 5 cases manually is easier and faster. Especially when we consider that making an exact oracle (the savant word for verification of test expectation) is sometimes tedious as in the camelCase function. In that situation developping the cases manually using the == method would have been much faster
    • PDD on the other hand allow the specify very clearly what is the rule. This is something that you would have to infer reading several examples when using TDD
    • On the other hand having several examples also facilitate the understanding of what's going on. "foo_bar" becomes "FooBar" is more easy to catch than "if a letter is preceded by,..."
    • PDD is very good at generating data you wouldn't think of: empty list, negative numbers, a string with underscores only,...
    • A tip: sometimes, it is useful to include in the generated parameters the result returned by the function you want to test. For example, in the second example, my parameters could be: (originalNPV, newNPV, fees, valuation date, pricing parameters). That way, when Scalacheck reports an error, it also reports the actual value you got when showing a counter-example
    • Sometimes the properties you want to check will almost mimic the implementation (as in example 2). I think that this is may be very often the case with business code if written properly or that this may show that your code is missing a key abstraction
    • It really gets some time to get your head wrap around finding properties. And soon you'll start thinking things like: "I know that property A, B and C characterize my function, but are they sufficient?" and you realize that you coming close to Programming == Theorem Proving
    As a summary, PDD is not the long awaited Silver Bullet (sorry ;-) ,...) but it is indeed a wonderful tool to have in your toolbox. It will help you test much more thoroughly your programs while seeing them yet another way.

    25 February 2008

    Better mocks with jMock (and specs)

    Now, I don't know how I managed to do without it. Now, I wonder how everyone can do without it. Now, I think that nothing big can be done without it.

    I'm not advertising a brand new toy, but I'm talking about mock objects. Whether they are real mocks or just stubs, it is almost impossible to unit test Java components without them. Not that you can't test them but if you really want to isolate a piece of code, mocks show up one way or the other.

    One of the best libraries for mock objects in Java is jMock. Yet, Java's verbosity makes it sometimes difficult to understand the intention of the mock expectations. Enter now jMocks with Scala!

    Some statistics about my blog

    Let's say I want to write another front-end to publish my posts to Blogger. I have encapsulated all Blogger functionalities in a Scala trait:

    trait Blogger {
    def allPosts: List[Post]
    def todayPosts: List[Post]
    def post(p: Post, tags: List[Tag]): Unit
    ...
    }


    Now I want to test a "Statistics" component which will compute some stats about my posts:

    object statsSpecification extends Specification with JMocker {
    "a statistics component" should {
    "return the number of posts for today" in {
    val blogger = mock(classOf[Blogger])
    val stats = new Statistics(blogger)
    expect {
    one(blogger).todayPosts will returnValue(List(Post("...")))
    }
    stats.numberOfPostsForToday
    }
    }
    }
    class Statistics(blogger: Blogger) {
    def numberOfPostsForToday: Int = blogger.todayPosts.size
    }

    In that short specification we:
    1. create a mock: blogger = mock(classOf[Blogger]). I would have preferred to write blogger = mock[Blogger] but there is no way in Scala to create an object from its type only

    2. Add an expectation in the expect block. Again here the loan pattern makes things a lot clearer than the corresponding "Double-brace block" in Java (even if it is a clever java trick!).

    3. Specify what the return value should be in the same expression by defining "will" as Scala infix operator. In the Java equivalent we would have to make a separate method call (which our favorite IDE may insist on putting on the next line!)
      one(blogger).todayPosts; will(returnValue(List(Post("..."))))
    Pushing further with nested expectations

    There is also a situation where using Scala and jMock could be a real win.
    [What follows is extracted from the specs Wiki, talk about reusability!]

    You need to mock an object, like a Connection, which is supposed to give you access to a service, that you also want to mock and so on. For example, testing some code accessing the Eclipse platform can be very difficult for that reason.

    Using specs you can use blocks to specify nested expectations:

    // A workspace gives access to a project and a project to a module
    case class Module(name: String)
    case class Project(module: Module, name: String)
    case class Workspace(project: Project)
    val workspace = mock(classOf[Workspace])

    expect {
    one(workspace).project.willReturn(classOf[Project]) {p: Project =>
    // nested expectations on project
    one(p).name willReturn "hi"
    one(p).module.willReturn(classOf[Module]) {m: Module =>
    // nested expectation on module
    one(m).name willReturn "module"}
    }
    }

    or

    // a workspace is a list of projects
    case class Project(name: String)
    case class Workspace(projects: List[Project])
    val workspace = mock(classOf[Workspace])
    expect {
    // the workspace will return project mocks with different expectations
    one(workspace).projects willReturnIterable(classOf[Project],
    {p: Project => one(p).name willReturn "p1" },
    {p: Project => one(p).name willReturn "p2" })
    }

    I haven't yet tested this capability on a real project but I clearly remember having had that kind of requirement.

    I hope that this short post can make you feel that using mocks can be easy and elegant, especially if you use them with Scala! (and specs,....)

    PS: Thanks again to Lalit Pant for showing the way with Scala and jMock

    15 January 2008

    Better unit tests with ScalaCheck (and specs)

    Writing unit tests can seem tedious sometimes.

    Some people tell you: "Hey, don't write unit tests only! Do Test-Driven Development". You write the test first, then the code for it. This way:
    • you end-up writing only the code that's necessary to deliver some concrete value for your customer/user
    • you drive the design of your system
    • you add frequent refactorings to the mixture to ensure your code stays clean
    Or even better, "Do Behaviour-Driven Development!". With BDD, you get nice executable specifications for your system, which can almost read like English.

    While I fully adhere to the above principles, I also think that there is a continuum between specifications and tests. And at the end of this continuum, it's all about testing that your software works. Even given silly inputs. Your unit tests should provide that kind of coverage.

    And it's not that easy. A single line of code can go wrong in so many different ways. Try copying a file. Here comes ScalaCheck to the rescue!

    Introducing ScalaCheck

    Using ScalaCheck, you define:
    • properties which should always be true
    • random data to exercise the property
    and ScalaCheck generates the test cases for you. Isn't it great?

    Let's take a concrete example to illustrate this, because I feel I almost lost my only reader here (thanks bro, you're a real brother).

    If you want, you Can

    Last week I started specifying and testing the famous Can class from the lift framework. The Can class is the Option class from Scala library, on steroids. [To Scala newcomers: there are many good posts on Option, Maybe (in Haskell), Either and all this monad folkore but I will send you to a concrete example here].

    Basically, a Can is either Empty (it contains nothing) or Full (it contains a value). This is a fairly common situation in software or elsewhere: the user with name "Smith" exists in the database (Full) or not (Empty), I got the power (Full) or I haven't (Empty).

    When a Can is empty, it can be enhanced with an error message explaining why it is empty. In that case, it will be a Failure object.

    Now, if you want to test an "equals" method working for all different cases you have to specify a lot of test cases:
    1. 2 Full objects which are equal
    2. 2 Full objects which are not equal
    3. 2 Empty objects which are equal
    4. 2 Empty objects which not equal
    5. 2 Failure objects which are equal
    6. 2 Failure objects which not equal
    7. A Full object and an Empty object (not equal)
    8. A Full object and an Failure object (not equal)
    9. A Failure object and an Empty object (not equal)
    When I said it could be tedious,... And I'm even simplifying the situation since Failures can be chained, optionally contain an Exception, etc,...

    Properties

    Here is the solution, implemented using specs and ScalaCheck, with the support of Rickard Nillson, author of the ScalaCheck project:

    object CanUnit extends Specification with CanGen {
    "A Can equals method" should {
    "return true when comparing two identical Can messages" in {
    val equality = (c1: Can[Int], c2: Can[Int]) => (c1, c2) match {
    case (Empty, Empty) => c1 == c2
    case (Full(x), Full(y)) => (c1 == c2) == (x == y)
    case (Failure(m1, e1, l1),
    Failure(m2, e2, l2)) => (c1 == c2) == ((m1, e1, l1) == (m2, e2, l2))
    case _ => c1 != c2
    }
    property(equality) must pass
    }
    }
    }

    How does it read?

    "equality" is a function taking 2 Cans. Then, depending on the Can type, it says that the result from calling the equals method on the Can class should be equivalent to calling equals on the content of the Can if it is a Full Can for instance.

    Create a "property" with this function and declare that the property must pass. That's all.

    Well, you may want to have a look at what's generated. Add the display parameter:

    import org.specs.matcher.ScalacheckParameters._
    ...
    property(equality) must pass(display)

    Then you should see in the console:

    ....
    Tested: List(Arg(,Failure(cn,Full(net.liftweb.util.CanGen$$anon$0$UserException),List()),0),... Tested: ...
    Tested: ...
    ....
    + OK, passed 100 tests.

    And if one test fails:

    A Can equals method should
    x return true when comparing two identical Can messages
    A counter-example is 'Full(0)' (after 1 try) (CanUnit.scala line 21)

    But you may have, at this point, the following nagging question: "Where does all this test Data come from?". Let's have a look below.

    Generating data

    Data generators are defined "implicitly". You define a function which is able to generate random data and you mark it as "implicit". When ScalaCheck tries to generate a given of object, it's looking for any implicit definition providing this. Like:

    implicit def genCan[T](dummy: Arb[Can[T]])
    (implicit a: Arb[T] => Arbitrary[T]) = new Arbitrary[Can[T]] {
    def getArbitrary = frequency(
    (3, value(Empty)),
    (3, arbitrary[T].map(Full[T])),
    (1, genFailureCan)
    )
    }

    This code says that generating a Can, optionally full of an element of type T, which has its own implicit Arbitrary generator, is like choosing between:
    • an Empty object, 3 times out of 7
    • an arbitrary object of type T, put in a Full object, 3 times out of 7
    • a Failure object (which has its own way of being generated via another function), 1 time out of 7
    [The "dummy" parameter is here to help Scala type inferencer, AFAIK. The world is not perfect, I know]

    Here is the Failure generator, which make heavy use of ScalaCheck predefined generation functions:
    def genFailureCan: Gen[Failure] = for {
    msgLen <- choose(0, 4)
    msg <- vectorOf(msgLen, alphaChar)
    exception <- arbitrary[Can[Throwable]]
    chainLen <- choose(1, 5)
    chain <- frequency((1, vectorOf(chainLen, genFailureCan)), (3, value(Nil)))} yield Failure(msg.mkString, exception, chain.toList)


    In the above method,
    • choose returns a random int number inside a range
    • vectorOf returns a collection of arbitrary object, with a specified length
    • alphaChar returns an arbitrary alphanumeric character
    • arbitrary[Can[Throwable]] returns an arbitrary Can, making all this highly recursive!
    Random thoughts

    I hope this sparked some interest in trying to use ScalaCheck and specs to define real thorough unit tests on your system.

    The added value is similar to BDD, you will see "properties" emerge and this will have a better chance at producing rock-solid software.

    From now on, you too can be a ScalaCheck man! (see lesson 4)