06 May 2010

Mini-Parsers to the rescue

An example of curryfication

I'm implementing a reasonably complex algorithm at the moment with different transformation phases. One transformation is a "curryfication" of some terms to be able to transform some expressions like:

f(a, b, c)


.(.(.(f, a), b), c)

Being test-compulsive, I want to be able to test that my transformation works. Unfortunately the data structures I'm dealing with are pretty verbose in my tests.

One of my specs examples was this:

"A composed expression, when curried" should {
"be 2 successive applications of one parameter" +
"when there are 2 parameters to the method expression" in {
ComposedExp(MethodEx(method), const :: arb :: Nil).curryfy must_==
Apply(Apply(Curry(method), Curry(const)), Curry(arb))

But Apply(Apply(Curry(method), Curry(const)), Curry(arb))) is really less readable than .(.(method, const), arb). And this can only get worse on a more complex example.

With the help of a mini-parser

So I thought that I could just write a parser to recreate a "Curried" expression from a String.

A first look at the Scala 2.8.0 Scaladoc(filter with "parser") was a bit scary. Especially because my last parser exercise was really a long time ago now.

But no, parser combinators and especially small, specific parsers like the one I wrote, are really straightforward:

object CurriedParser extends JavaTokenParsers {
val parser: Parser[Curried] = application | constant
val constant = ident ^^ { s => Curry(s) }
val application = (".(" ~> parser) ~ (", " ~> constant <~ ")") ^^ { case a ~ b =>
Apply(a, b)
def fromString(s: String): Curried = parser.apply(new CharSequenceReader(s)).get

In the snippet above I declare that:

  • my parser returns a Curried object

  • my parser is either an application or a constant (with the | operator)

  • a constant is a Java identifier (ident is a parser inherited from the JavaTokenParsers trait)

  • a constant should be transformed to a Curry object

  • an application is composed of two parts (separated by the central ~ operator). The first part is: some syntax ("(") and the result of something to parse recursively with the parser (i.e. an application or a constant). The second part is a constant, surrounded by some syntax (", " and ")"). Note that the syntactic elements are being discarded by using ~> and <~ instead of ~.

  • once an application is parsed, part 1 and part 2 are accessible as matchable object of the form a ~ b and this object can be used to build an Apply object

  • the fromString method simply passes a String to the parser and gets the result

I have to say that this parser is really rudimentary. It doesn't handle errors in the input text, there absolutely needs to be a space after the comma, and so on.

Yet it really fits my testing purpose for a minimum amount of development time.

Open question

I hope that this post can serve as an example to anyone new to Scala wanting to play with parsers and I leave an open question for senior Scala developers:

Is there a way to extends the case classes representing algebraic datatypes in Scala so that:
  • each case class has a proper toString representation (that's already the case and that's one benefit of case classes), but that representation can be overriden (for example to replace Apply(x, y) with .(x, y))

  • there is an implicit parser that is able to reconstruct the hierarchy of objects from its string representation


dramage said...

A related approach that might or might not answer your questions.

David Hall and I built up a (non-Parser based) serialization architecture that uses implicit conversion to ReadWritable[T] objects to marshal and unmarshall (rich) data types to a variety of formats, including strings. It's in our ScalaNLP project.

Check out the scalanlp.serialization package. I think you should be able to set up a custom TextSerialization.ReadWritable[Apply] function (define it as an implicit val in Apply's companion object). Then you'd just call TextSerialization.fromString[Apply](yourStringRep). The current TextSerialization package doesn't use scala's parser package, so the guts of the parsing certainly won't be as pretty as what you've described in your post, but our approach might spark some ideas.

Eric said...

Great, I'll have a look at it. BTW, I looked at the ScalaNLP source and I noticed that you switched from specs to ScalaTest. Is there anything that I could improve on specs that you were missing?

David Hall said...

re: specs, I think I made the switch a while back, and I'm a little hazy on the exact thing that made me actually make the switch. (Was ScalaTest the first to switch to 2.8? Or maybe JUnit integration? I dunno.)

However, I personally prefer the chain of asserts (or chains of boolean expressions) to the pseudo-English DSL in Specs. Just a matter of taste in API, was all.

Daniel Spiewak said...

I think I would probably shorten the parser up a bit, mostly for reasons of clarity:

lazy val parser = (
| "." ~> "(" ~> parser ~ ("," ~> constant <~ ")") ^^ flatten2(Apply)
val constant = ident ^^ Curry

Take advantage of the fact that case class companion objects extend their corresponding function types. :-)

More importantly though, I've changed your ".(" to ("." ~ "("). This is significant because it means that you can now put whitespace between "." and "(", whereas before this was not possible. I've also removed the unnecessary space from the ", " literal, making it possible to define parameters *without* any whitespace padding the comma.

Eric said...

Thanks a bunch for that comment Daniel, I realized none of this:

- case classes extending function types

- the management of whitespace with ~>


Daniel Spiewak said...

Actually, ~> and <~ have nothing to do with whitespace; they're just a convenient way to drop specific parsers from the aggregated parameter list. :-)

Eric said...

Hmmm, then I don't get how the whitespace can be dropped. How are they parsed?

Daniel Spiewak said...

> Hmmm, then I don't get how the whitespace can be dropped. How are they parsed?

Sneakily. :-) JavaTokenParsers inherits from RegexParsers which defines the implicit `literal` and `regex` parser constructors. These methods synthesize parsers which consume substrings exactly matching a given string or regular-expression (respectively).

The trick is that before they do this matching, they *first* check the value of `skipWhitespace`, and if true, they run the input stream through the `handleWhitespace` function. Thus, any *prefix* whitespace is stripped off of tokens before they are parsed. Suffix whitespace is handled using a special case at the end of the parse (when using `parseAll`).

Eric said...

Thanks for the explanation, I see now. I was re-reading your post: and I see that this was also one of the questions in the comments.

I'll definitely go to bed less ignorant today!