27 July 2008

My first ICFP contest, Part 2: parser combinators are so cool

This is the sequel to my last post on participating to the ICFP contest 2008.

The Setup

As Curt wrote it, we entered the contest doing things that should have been done well before:
  • setting up the wifi access so that I can connect with my laptop (it never worked so Bryan had to get out and buy new network cables)
  • exchanging keys to access the Subversion repository
  • learning how to use the build and test system
I guess that we need to make some mistakes (at least) once oneself before getting it!

However, the biggest lesson for me in this contest, was the use of Parser Combinators.

Do you speak Martian?

The second programming task in the contest, after setting up the network connection to be able to receive messages and send back commands, was to parse the input messages giving the position of the Martian rover, and its immediate environment.

As the task specifies, there are 5 kinds of messages:
  • initialization messages -> what's the map like? what are you technical constants?
  • telemetry stream -> where are you, what's your speed, what's around?
  • adverse event -> crashed in a crater!
  • success -> yes, home!
  • end of run -> time and status
The format of those messages is not particularly complex but the telemetry message can contain an unlimited number of obstacles descriptions for example.

For many developers, the most obvious way to transforms meaningless strings to meaningful objects like Telemetry or Martian, would be to start traversing the string, check for characters like 'T' for Telemetry, or 'I' for Initialization then pass the rest of the string to another function and continue slicing the string until all data is analyzed.

Parser combinators

But Haskell developers know better,... They can just describe the grammar of the messages they're parsing by combining parsers, then applying the resulting parser to the string input.

Something like that:

telemetry :: Parser Message
telemetry = do ts <- int
state <- accelStateParser
turning <- turningStateParser
x <- double
y <- double
dir <- double
speed <- double
obstacles <- obstaclesParser
return $ Telemetry ts state turning x y dir speed obstacles

accelStateParser :: Parser AccelState
accelStateParser = (accelerating <|> braking <|> rolling)

accelerating = stateParser 'a' Accelerating
braking = stateParser 'b' Braking
rolling = stateParser '-' Rolling

stateParser :: Char -> a -> Parser a
stateParser c cons = do _ <- char c
return cons

turningStateParser :: Parser TurningState
turningStateParser = hardleft <|> softleft
<|> hardright <|> softright
<|> straight

hardleft = stateParser 'L' HardLeft
softleft = stateParser 'l' SoftLeft
hardright = stateParser 'R' HardRight
softright = stateParser 'r' SoftRight
straight = stateParser '-' Straight
obstaclesParser :: Parser [Object]
obstaclesParser = do l <- many obstacleParser
return l

obstacleParser :: Parser Object
obstacleParser = do code <- oneOf "bchm"
case code of
'b' -> object Boulder
'c' -> object Crater
'h' -> object Home
'm' -> martian
_ -> error "Can't get here in parser!"

object :: (Circle -> a) -> Parser a
object cons = do x <- double
y <- double
r <- double
return $ cons (Circle (x:.y) r)

martian :: Parser Object
martian = do x <- double
y <- double
dir <- double
speed <- double
return $ Martian (Circle (x:.y) 0.4) dir speed

In the above code, you can read that the parser for telemetry messages is a sequence of different parsers: a timestamp parser, a acceleration state parser, a turning state parser, ..., a parser for lists of obstacles. And a parser for list of obstacles is just "many" parsers for one obstacle. This is indeed as easy as being able to describe a grammar for that mini-language.

"k" should better be 1

But I didn't say that describing a grammar was easy! For instance, at first, we made a mistake in the way we managed spaces. We had almost each parser trying to consume spaces before doing its job. This generated unnecessary ambiguity in the parsing.

For example, let's say that you want to be able to parse this kind of message:

" O T O O T"

(honestly, I don't know why you would like to do something like that but you're the boss here)

You'd better combine those 3 parsers:
  • spaces -> will consume any number of spaces

  • oneParser = char 'O' >>= spaces -> will consume 'O' followed by any number of spaces

  • twoParser = char 'T' >>= spaces -> will consume 'T' followed by any number of spaces

If you define parsers differently, having "oneParser" and "twoParser" consuming spaces before they parse their character, which parser should be used on the first space of the message?

Actually, parser combinators in Haskell are also capable of backtracking and "unparsing" results if the chosen parser was not appropriate (trying things with the the "try" parser combinator). But why add more complexity when things can be described more directly?

Of course, there is a formal theory behind the example above: LL(k) grammars, where LL(1) is a type of grammar only needing to look at the next character to know which parser to use. But there's nothing like a simple "real-life" example to be able to grasp it.

Compare and contrast

Now, honestly, most other contest solutions were not using Parser combinators and the aren't soooo ugly (in Dylan for example). I think they're just less readable and more error-prone because of index manipulation.

For fun and comparison, I also developed a similar parser in Scala:

def telemetryMessage = for { timestamp <- intNumber
accState <- accelState
turnState <- turningState
vehicleX <- doubleNumber
vehicleY <- doubleNumber
vehicleDir <- doubleNumber
vehicleSpeed <- doubleNumber
obstacles <- objects }
yield Telemetry(timestamp, accState(), turnState(),
vehicleX, vehicleY, vehicleDir,
vehicleSpeed, obstacles)

def accelState = (
"a" ^^^ Accelerating
| "b" ^^^ Braking
| "-" ^^^ Rolling

def turningState = (
"l" ^^^ SoftLeft
| "L" ^^^ HardLeft
| "-" ^^^ Straight
| "r" ^^^ SoftRight
| "R" ^^^ HardRight

def objects = rep(objectParser)
def objectParser: Parser[Object] = (
"b" ~ spaces ~> circleParser ^^ (new Boulder(_))
| "c" ~ spaces ~> circleParser ^^ (new Crater(_))
| "h" ~ spaces ~> circleParser ^^ (new Crater(_))
| "m" ~ spaces ~> martianParser

def circleParser = for { x <- doubleNumber
y <- doubleNumber
r <- doubleNumber }
yield new Circle(x, y, r)

def martianParser = for { circle <- circleParser
dir <- doubleNumber
speed <- doubleNumber }
yield new Martian(circle, dir, speed)

The interesting things in the Scala implementation, although it could seem a bit ugly at first sight, are:

  • the use of a sequence operator "~>" getting rid of what has already been parsed. Indeed, when we have parsed "b " we know we need to create a Boulder object. The 'b' character and the space are then not usefull anymore

  • the use of functions to transform the parsed string into a meaningful object, like "l" ^^^ SoftLeft, creating a new SoftLeft object (it's a case class) when parsing "l"

  • implicit definitions. In the example above I don't need to declare my parser for 'l' as letter("l"). Combining "l" with another parser automatically creates a parser for the character 'l'

Why XML?

One of the reasons for the ubiquitous adoption (and hate!) of XML is, I think, the general availability of parsers and binding libraries. This is why this often becomes the easiest option for configuration files (is it changing with YAML and JSON?).

It doesn't have to be that way. I hope that parser combinators can now bring another legitimate option when considering the implementation of an external mini-language for configuration or protocol. They're simple to use and readily available on your Java platform through Scala. Try them at home!