27 July 2008

My first ICFP contest, Part 2: parser combinators are so cool

This is the sequel to my last post on participating to the ICFP contest 2008.

The Setup

As Curt wrote it, we entered the contest doing things that should have been done well before:
  • setting up the wifi access so that I can connect with my laptop (it never worked so Bryan had to get out and buy new network cables)
  • exchanging keys to access the Subversion repository
  • learning how to use the build and test system
I guess that we need to make some mistakes (at least) once oneself before getting it!

However, the biggest lesson for me in this contest, was the use of Parser Combinators.

Do you speak Martian?

The second programming task in the contest, after setting up the network connection to be able to receive messages and send back commands, was to parse the input messages giving the position of the Martian rover, and its immediate environment.

As the task specifies, there are 5 kinds of messages:
  • initialization messages -> what's the map like? what are you technical constants?
  • telemetry stream -> where are you, what's your speed, what's around?
  • adverse event -> crashed in a crater!
  • success -> yes, home!
  • end of run -> time and status
The format of those messages is not particularly complex but the telemetry message can contain an unlimited number of obstacles descriptions for example.

For many developers, the most obvious way to transforms meaningless strings to meaningful objects like Telemetry or Martian, would be to start traversing the string, check for characters like 'T' for Telemetry, or 'I' for Initialization then pass the rest of the string to another function and continue slicing the string until all data is analyzed.

Parser combinators

But Haskell developers know better,... They can just describe the grammar of the messages they're parsing by combining parsers, then applying the resulting parser to the string input.

Something like that:

telemetry :: Parser Message
telemetry = do ts <- int
state <- accelStateParser
turning <- turningStateParser
x <- double
y <- double
dir <- double
speed <- double
obstacles <- obstaclesParser
return $ Telemetry ts state turning x y dir speed obstacles

accelStateParser :: Parser AccelState
accelStateParser = (accelerating <|> braking <|> rolling)

accelerating = stateParser 'a' Accelerating
braking = stateParser 'b' Braking
rolling = stateParser '-' Rolling

stateParser :: Char -> a -> Parser a
stateParser c cons = do _ <- char c
return cons

turningStateParser :: Parser TurningState
turningStateParser = hardleft <|> softleft
<|> hardright <|> softright
<|> straight

hardleft = stateParser 'L' HardLeft
softleft = stateParser 'l' SoftLeft
hardright = stateParser 'R' HardRight
softright = stateParser 'r' SoftRight
straight = stateParser '-' Straight
obstaclesParser :: Parser [Object]
obstaclesParser = do l <- many obstacleParser
return l

obstacleParser :: Parser Object
obstacleParser = do code <- oneOf "bchm"
case code of
'b' -> object Boulder
'c' -> object Crater
'h' -> object Home
'm' -> martian
_ -> error "Can't get here in parser!"

object :: (Circle -> a) -> Parser a
object cons = do x <- double
y <- double
r <- double
return $ cons (Circle (x:.y) r)

martian :: Parser Object
martian = do x <- double
y <- double
dir <- double
speed <- double
return $ Martian (Circle (x:.y) 0.4) dir speed

In the above code, you can read that the parser for telemetry messages is a sequence of different parsers: a timestamp parser, a acceleration state parser, a turning state parser, ..., a parser for lists of obstacles. And a parser for list of obstacles is just "many" parsers for one obstacle. This is indeed as easy as being able to describe a grammar for that mini-language.

"k" should better be 1

But I didn't say that describing a grammar was easy! For instance, at first, we made a mistake in the way we managed spaces. We had almost each parser trying to consume spaces before doing its job. This generated unnecessary ambiguity in the parsing.

For example, let's say that you want to be able to parse this kind of message:

" O T O O T"

(honestly, I don't know why you would like to do something like that but you're the boss here)

You'd better combine those 3 parsers:
  • spaces -> will consume any number of spaces

  • oneParser = char 'O' >>= spaces -> will consume 'O' followed by any number of spaces

  • twoParser = char 'T' >>= spaces -> will consume 'T' followed by any number of spaces

If you define parsers differently, having "oneParser" and "twoParser" consuming spaces before they parse their character, which parser should be used on the first space of the message?

Actually, parser combinators in Haskell are also capable of backtracking and "unparsing" results if the chosen parser was not appropriate (trying things with the the "try" parser combinator). But why add more complexity when things can be described more directly?

Of course, there is a formal theory behind the example above: LL(k) grammars, where LL(1) is a type of grammar only needing to look at the next character to know which parser to use. But there's nothing like a simple "real-life" example to be able to grasp it.

Compare and contrast

Now, honestly, most other contest solutions were not using Parser combinators and the aren't soooo ugly (in Dylan for example). I think they're just less readable and more error-prone because of index manipulation.

For fun and comparison, I also developed a similar parser in Scala:

def telemetryMessage = for { timestamp <- intNumber
accState <- accelState
turnState <- turningState
vehicleX <- doubleNumber
vehicleY <- doubleNumber
vehicleDir <- doubleNumber
vehicleSpeed <- doubleNumber
obstacles <- objects }
yield Telemetry(timestamp, accState(), turnState(),
vehicleX, vehicleY, vehicleDir,
vehicleSpeed, obstacles)

def accelState = (
"a" ^^^ Accelerating
| "b" ^^^ Braking
| "-" ^^^ Rolling

def turningState = (
"l" ^^^ SoftLeft
| "L" ^^^ HardLeft
| "-" ^^^ Straight
| "r" ^^^ SoftRight
| "R" ^^^ HardRight

def objects = rep(objectParser)
def objectParser: Parser[Object] = (
"b" ~ spaces ~> circleParser ^^ (new Boulder(_))
| "c" ~ spaces ~> circleParser ^^ (new Crater(_))
| "h" ~ spaces ~> circleParser ^^ (new Crater(_))
| "m" ~ spaces ~> martianParser

def circleParser = for { x <- doubleNumber
y <- doubleNumber
r <- doubleNumber }
yield new Circle(x, y, r)

def martianParser = for { circle <- circleParser
dir <- doubleNumber
speed <- doubleNumber }
yield new Martian(circle, dir, speed)

The interesting things in the Scala implementation, although it could seem a bit ugly at first sight, are:

  • the use of a sequence operator "~>" getting rid of what has already been parsed. Indeed, when we have parsed "b " we know we need to create a Boulder object. The 'b' character and the space are then not usefull anymore

  • the use of functions to transform the parsed string into a meaningful object, like "l" ^^^ SoftLeft, creating a new SoftLeft object (it's a case class) when parsing "l"

  • implicit definitions. In the example above I don't need to declare my parser for 'l' as letter("l"). Combining "l" with another parser automatically creates a parser for the character 'l'

Why XML?

One of the reasons for the ubiquitous adoption (and hate!) of XML is, I think, the general availability of parsers and binding libraries. This is why this often becomes the easiest option for configuration files (is it changing with YAML and JSON?).

It doesn't have to be that way. I hope that parser combinators can now bring another legitimate option when considering the implementation of an external mini-language for configuration or protocol. They're simple to use and readily available on your Java platform through Scala. Try them at home!


Sanny Sanoff said...

just curious about "double" and "int" parsers you use; could you provide reference or implementation plz.

p.s. Scala implementation is shorter, what a piece of art!

Eric said...

You're right Sanny, I left them out but here they are:

def intNumber: Parser[Int] = wholeNumber <~ spaces ^^ (java.lang.Integer.parseInt(_.mkString))

def doubleNumber: Parser[Double] = floatingPointNumber <~ spaces ^^ (java.lang.Double.parseDouble(_))

And the "spaces" parser too:

def spaces = rep(whiteSpace)

Daniel Spiewak said...

Personally, I think that parser combinators are probably the smartest innovation in the entire world of functional programming. It is interesting that the Haskell syntax is quite a bit cleaner (no need for tildes), but the Scala implementation is shorter. It could just be language familiarity, but I like to think that it's more a factor of Scala's hybrid power. :-)

Sanny Sanoff said...

Thanks, Eric. But I was asking about haskell's "double" and "int" parsers ;-).

Apart from "integer haskell" (which is too hi-level), I could not find any standard ones.


Eric said...

This is how we did it in Haskell for the int and double parsers:

int :: Parser Int
int = do digits <- many digit
return $ read digits

double :: Parser Double
double = do digits <- many doubleChars
return $ read digits
where doubleChars = oneOf "-.0123456789"

You can notice that the double parser is far from perfect since it will try to parse strings like "1-23.45-6." before failing on "read".

Eric said...


I also love Haskell's syntax. It is truly minimal *and* powerful.

The funny thing is that I don't like the infix notation with backquotes, which you don't need in Scala. On the other hand, in Haskell, you can write "function parameter" which you can't in Scala.

That would be very useful to allow this in some embedded Scala DSLs. But I guess this is not possible to have both notations available in a language with functions as first class objects.

Regarding the Scala implementation size, it seems that the Haskell implementation could also be shortened using Control.Applicative.*> . This is similar to ^^^ in Scala as far as I understand.

Similarly there should be a way to use type classes to get the same effect as implicits for letter parsers, but I haven't experimented with that yet.

Now that the contest is over, I need to find some other pet project to continue training in Haskell. Err, no. Actually I mostly need time!