11 October 2011

Counting words

In this 3 parts post I want to show:

  1. how standard, out-of-the-box, Scala helped me to code a small application
  2. how Functional Reactive Programming brings a real improvement on how the GUI is built
  3. how to replace Parser Combinators with the Parboiled library to enhance error reporting

You can access the application project via github.

I can do that in 2 hours!

That's more or less what I told my wife as she was explaining one small problem she had. My wife is studying psychology and she has lots of essays to write, week after week. One small burden she's facing is keeping track of the number of words she writes because each essay must fit in a specific number of words, say 4000 +/- 10%. The difficulty is that quotations and references must not be counted. So she cannot check the file properties in Word or Open Office and she has to keep track manually.

For example, she may write: "... as suggested by the Interpretation of Dreams (Freud, 1905, p.145) ...". The reference "(Freud, 1905, p.145)" must not be counted. Or, "Margaret Malher wrote: "if the infant has an optimal experience of the symbiotic union with the mother, then the infant can make a smooth psychological differentiation from the mother to a further psychological expansion beyond the symbiotic state." (Malher cited in St. Clair, 2004, p.92)" (good luck with that :-)). In that case the quotation is not counted either and we must only count 3 words.

Since this counting is a bit tedious and has to be adjusted each time she does a revision of her essay, I proposed to automate this check. I thought "a few lines of Parser Combinators should be able to do the trick, 2 hours max". It actually took me a bit more (tm) to:

  • write a parser robust enough to accommodate for all sorts of variations and irregularities. For example, pages can be written as "p.154" or "pp.154-155", years can also be written "[1905] 1962" where 1905 is the first edition, and so on

  • use scala-swing to display the results: number of words, references table, file selection

  • write readers to extract the text from .docx or .odt files

Let's see now how Scala helped me with those 3 tasks.

Parsing the text

The idea behind parser combinators is very powerful. Instead of building a monolithic parser with lots of sub-routines and error-prone tracking of character indices, you describe the grammar of the text to parse by combining smaller parsers in many different ways.

In Scala, to do this, you need to extend one of the Parsers traits. The one I've choosen is RegexParsers. This is a parser which is well suited for unstructured text. If you were to parse something more akin to a programming language you might prefer StdTokenParsers which already define keywords, numeric/string literals, identifiers,...

I'm now just going to comment on a few points regarding the TextParsing trait which is parsing the essay text. If you want to understand how parser combinators work in detail, please read the excellent blog post by Daniel Spiewak: The Magic behind Parser Combinators.

The main definition for this parser is:

  def referencedText: Parser[Results] =
    rep((references | noRefParenthesised | quotation | words | space) <~ opt(punctuation)) ^^ reduceResults

This means that I expect the text to be:

  • a repetition (the rep combinator) of a parser

  • the repeated parser is an alternation (written |) of references, parenthetised text, quotations, words or spaces. For each of these "blocks" I'm able to count the number of words. For example, a reference will be 0 and parenthetised text will be the number of words between the parentheses

  • there can be a following punctuation sign (optional, using the opt combinator), but we don't really care about it, so it can be discarded (hence the <~ combinator, instead of ~ which sequences 2 parsers)

Then I have a function called reduceResults taking the result of the parsing of each repeated parser, to create the final Result, which is a case class providing:

  • the number of counted words
  • the references in the text
  • the quotations in the text

Using the RegexParser trait is very convenient. For example, if I want to specify how to parse "Pages" in a reference: (Freud, 1905, p.154), I can sequence 2 parsers built from regular expressions:

  val page = "p\\.*\\s*".r ~ "\\d+".r
  • appending .r to a string returns a regular expression (of type Regex)
  • there is an implicit conversion method in the RegexParsers trait, called regex from a Regex to a Parser[String]
  • I can sequence 2 parsers using the ~ operator

The page parser above can recognize page numbers like p.134 or p.1 but it will also accept p134. You can argue that this is not very well formatted, and my wife will agree with you. However she certainly doesn't want to see the count of words being wrong or fail just because she forgot a dot! The plan here is to display what was parsed so that she can eventually fix some incorrect references, not written according to the academia standards. We'll see, in part 3 of this series how we can use another parsing library to manage those errors, without breaking the parsing.

One more important thing to mention about the use of the RegexParsers trait is the skipWhitespace method. If it returns true (the default), any regex parser will discard space before any string matching a regular expression. This is convenient most of the time but not here where I need to preserve spaces to be able to count words accurately.

To finish with the subject of Parsing you can have a look at the TextParsingSpec specification. This specification features a ParserMatchers trait to help with testing your parsers. It also uses the Auto-Examples feature of specs2 to use the text of the example directly as a description:

  "Pages"                                                                           ^
  { page must succeedOn("p.33") }                                                   ^
  { page must succeedOn("p33") }                                                    ^
  { pages must succeedOn("pp.215-324") }                                            ^
  { pages must succeedOn("pp.215/324") }                                            ^
  { pages("pp. 215/324") must haveSuccessResult(===(Pages("pp. 215/324"))) }        ^

Displaying the results

The next big chunk of this application is a Swing GUI. The Scala standard distribution provides a scala-swing library adding some syntactic sugar on top of regular Swing components. If you want to read more about Scala and Swing you can have a look at this presentation.

The main components of my application are:

  • a menu bar with 2 buttons to: select a file, do the count
  • a field to display the currently selected file
  • a results panel showing: the number of counted words and the document references

wordcount application
count example, note that the parsing is not perfect since the word counts do not add up!

If you have a look at the code you will see that this translates to:

  • an instance of SimpleSwingApplication defining a top method and including all the embedded components: a menu bar, a panel with the selected file and results
  • the subcomponents themselves: the menu items, the count action, the results panel
  • the "reactions" which is a PartialFunction listening to the events created by some components, SelectionChanged for example, and triggering the count or displaying the results

I was pretty happy to see that much of the verbosity of Swing programming is reduced with Scala:

  • you don't need to create xxxListeners for everything
  • there are components providing both a Panel and a LayoutManager with the appropriate syntax to display the components: FlowPanel, BoxPanel, BorderPanel
  • thanks to scala syntax you can write action = new Action(...) instead of setAction(new Action(...))

This is nice but I think that there is a some potential for pushing this way further and create more reusable out-of-the-box components. For example, I've created an OpenFileMenuItem which is a MenuItem with an Action to open a FileChooser. Also, something like a pervasive LabeledField with just a label and some text would very useful to have in a standard library.

I also added a bit of syntactic sugar to have actions executed on a worker thread, instead of the event dispatch thread (to avoid grey screens), using the SwingUtilities.invokeLater method. For example: myAction.inBackground will be executed on a separate thread.

Eventually, I was able to code up the GUI of the application pretty fast. The only thing which I didn't really like was the Publish/React pattern. It felt a bit messy. The next part of this series will show how Functional Reactive Programming with the reactive library helped me write cleaner code.

Reading the file

I anticipated this part to be a tad difficult. My first experiments of text parsing were using a simple text file and I knew that having the user (my wife, remember,...) copy and paste her text to another file just for counting would be a deal-breaker. So I tried to read .odt and .docx files directly. This was actually much easier than anything I expected!

Both formats are zipped xml files. Getting the content of those files is just a matter of:

  • reading the ZipFile entries and find the file containing the text

    val rootzip = new ZipFile(doc.path)
  • loading the xml as a NodeSeq

  • find the nodes containing the actual text of the document and transform them to text

    // for a Word document text is under <p><t> tags
    (xml \\ "p") map (p => (p \\ "t").map(_.text) mkString "") mkString "\n"

For further details you can read the code here.


That's it, parser combinators + scala-swing + xml = a quick app solving a real-world problem. In the next posts we'll try to make this even better!


Unknown said...

Great post! Looking forward to the next installment.

tporcham said...

Link to DocReader.scala isn't working. Should be

Eric said...

I fixed it, thanks.

Bart Schuller said...

For the ODT format, if you want to have the correct word count, you'll also need the "h" tags for the headings.

Eric said...

Good to know. So far my wife hasn't used headers because I think she's not supposed to.

But your point makes me realize that this bug would not be caught and is not easy to catch.

Unknown said...

Really great