A++ [Eric Torreborre's Blog]: specs2

Showing posts with label specs2. Show all posts

24 December 2014

specs2-three

This post presents the next major version of specs2, specs2 3.0:

what are the motivations for a major version?
what are the main benefits and changes?
when will it be available?

The motivations

I started working on this new version a bit more than one year ago now. I had lots of different reasons for giving specs2 a good face-lift.

The Open Source reason

specs2 has largely been the effort of a single person. This has probably some advantages, like the possibility to maintain some kind of vision for the project, but also lots of drawbacks. Quality is one of them.

As a programmer I have all sorts of shortcomings. I have always been amazed by other people taking a look at my code and spotting obvious deficiencies either big or small (for example @jedws introduced named threads to me, much easier for debugging!). I want to maximize the possibility that other people will jump in to fix and extend the library as necessary (and be able to go on holidays for 3 weeks without a laptop :-)). Improving the code base can only help other people review my implementation.

The Design reason

In specs2 I’ve had this vision of a flow of “specification fragments” that would get created, executed, and then reported, possibly with different reporters for various output formats. This is not as easy at it seems:

I want fragments to be executed concurrently while being printed in sequence
the fragments should also be displayed as soon as executed
I want to be able to recreate a view of the sequence of fragments as a tree to be displayed in IDEs like Eclipse and Intellij

This is all done and working in specs2 < 3.0 but in a very clumsy way, subverting Scalaz Reducers to maintain state and try to compose reporters.

One of these reporters is a HTML reporter and I’ve always wanted to improve it. This was not something I was eager to change given the situation 1 year ago. Luckily scalaz-stream version 0.2 came out in December 2013 and allowed me to try out new ideas.

The Functional Programming reason

The major difference between specs and specs2 was the use of immutable data structures and the avoidance of exceptions for control flow. Yet there was still lots of side-effects!

I hadn’t fully grasped how to use the IO monad to structure my program. Fortunately I happen to work with the terrific @markhibberd and he showed me how to use a proper monad stack to track IO effects but also how to thread in configuration data and track errors.

The main benefits and changes

First of all, a happy maintainer! That goes without saying but my ability to fix bugs and add features will be improved a lot if I can better reason about the code :-).

Now for users…

For casual users there should be no changes! If you just use org.specs2.Specification or org.specs2.mutable.Specification with no other traits, you should not see any change (except in the User Guide, see below). For “advanced” users there are new benefits and API changes (in no particular order).

Refactored user guide

The existing User Guide has been divided into a lot more pages (around 60) and follows a pedagogical progression:

a Quick Start presenting a simple specification (and the mandatory link to the installation page)
some links from the Quick Start to the most common concepts: what is the structure of a Specification? Which matchers are available? How to run a Specification?
then, on each other page there is a presentation focusing on one topic plus additional links: "Now learn how to..." (what is the next thing you will probably need?) and "If you want to know more" (what is some more advanced topic that is related to this one?)

In addition to this refactoring there are some “tools” to help users find faster what they are looking for:

a search box
reference pages to summarize in one place some topics (matchers and run arguments for example)
a Troubleshooting page with the most common issues

You can have a first look at it here.

Generalized Reader pattern

One consequence of the “functional” re-engineering is that the environment is now available at different levels. By “environment”, I mean the org.specs.specification.core.Env class which gives you access to all the components necessary to execute a specification, among which:

the command line arguments
the lineLogger used to log results to the console (from Sbt)
the systemLogger used to log issues when instantiating the Specification for example
the execution environment, containing a reference to the thread pool used to execute the examples
the statsRepository to get and store execution statistics
the fileSystem which mediates all interactions with the file system (to read and write files)

I doubt that you will ever need all of this, but parts of the environment can be useful. For example, you can define the structure of your Specification based on command line arguments:

class MySpec extends Specification with CommandLineArguments { def is(args: CommandLine) = s2"""
  Do something here with a command line parameter ${args.valueOr("parameter1", "not found")}
"""
}

The CommandLineArguments uses your definition of the def is(args: CommandLine): Fragments method to build a more general method Env => Fragments which is the internal representation of a Specification (fragments that depend on the environment). This means that now you don’t have to skip examples based on condition (isDatabaseAvailable for example), you can simply remove them!

You can also use the environment, or part of it, to define examples:

class MySpec extends Specification { def is = s2"""

 Here are some examples using the environment.
 You can access
   the full environment                            $e1
   the command line arguments                      $e2
   the execution context to create a Scala future  $e3
   the executor service to create a Scalaz future  $e4

"""

  def e1 = { env: Env =>
    env.statisticsRepository.getStatistics(getClass.getName).runOption.flatten.foreach { stats =>
      println("the previous results for this specification are "+stats)
    }
    ok
  }

  def e2 = { args: CommandLine =>
    if (args.boolOr("doit", false)) success
    else skipped
  }

  def e3 = { implicit executionContext: ExecutionContext =>
    scala.concurrent.Future(1); ok
  }

  def e4 = { implicit executorService: ExecutorService =>
    scalaz.concurrent.Future(1); ok
  }
}

Better reporting framework

This paragraph is mostly relevant to people who want to extend specs2 with additional outputs. The reporting framework has been refactored around 4 concepts:

A Runner (for example the SbtRunner)

instantiates the specification class to execute
creates the execution environment (arguments, thread pool)
instantiates a Reporter
instantiates Printers and starts the execution

A Reporter

reads the previous execution statistics if necessary
selects the fragments to execute
executes the specification fragments
calls the printers for printing out the results
saves the execution statistics

A Printer

prepares the environment for printing
uses a Fold to print or to gather execution data. For example the TextPrinter prints results to the console as soon as they are available and the HtmlPrinter

A Fold

has a Sink[Task, (T, S)] (see scalaz-stream for the definition of a Sink) to perform side-effects (like writing to a file)
has a fold: (T, S) => S method to accumulate some state (to compute statistics for example, or create an index)
has an init: S element to initialize the state
has a last(s: S): Task[Unit] method to perform one last side-effect with the final state once all the fragments have been executed

It is unlikely that you will create a new Runner (except if you build an Eclipse plugin for example) but you can create custom reporters and printers by passing the reporter <classname> and printer <classname> options as arguments. Note also that Folds are composable so if you need 2 outputs you can create a Printer that will compose 2 folds into 1.

The Html printer

Pandoc

The Html printer has been reworked to use Pandoc as a templating system and Markdown engine. I decided this move to Pandoc for several reasons:

Pandoc is one of the libraries that is officially endorsing the CommonMark format
I’ve had less corner cases with rendering mixed html/markdown with Pandoc than previously
Pandoc opens the possibility to render other markup languages than CommonMark, LaTeX for example

However this comes with a huge drawback, you need to have Pandoc installed as a command line tool on your machine. If Pandoc is not installed, specs2 will use a default template renderer, but won’t render CommonMark.

Templates

I’ve extracted a specs2.html template (and a corresponding specs2.css stylesheet) and it is possible for you to substitute another template (with the html.template option) if you want your html files to be displayed differently. This template is using the Pandoc template system so it is pretty primitive but should still cover most cases.

Better API

The specs2 API has been split into a lot more traits to support various objectives:

support the new execution model with scalaz-stream
make it possible to separate the DSL methods from the core ones (see Lightweight spec)
offer a better Fragment API

Let’s start with the heart of specs2, the Fragment.

`FragmentFactory` methods

Advanced specs2 users need to tweak the creation of fragments. For example, when using a “template specification”:

abstract class DatabaseSpec extends Specification {
  override def map(fs: => Fragments): Fragments =
    step(startDb) ^ fs ^ step(closeDb)
}

In the DatabaseSpec you are using different methods to work with Fragments. The ^ method to append them, the step method to create a Step fragment. Those 2 methods are part of the Fragment API. Here is a list of the main changes, compared to specs2 < 3.0

first of all there is only one Fragment type (instead of Text, Step, Example,…). This type contains a Description and an Execution. By combining different types of Descriptions and Executions it is possible to recreate all the previous specs2 < 3.0 types
however you don’t need to create a Fragment by yourself, what you do is invoke the FragmentFactory methods: example, step, text,… This now unifies the notation between immutable and mutable specifications because in specs2 < 3.0 you would write step in a mutable specification and Step in an immutable one (Step is now deprecated)
there is no ExampleFactory trait anymore since it has been subsumed by methods on the FragmentFactory trait (so this will break code for people who were intercepting Example creation to inject additional behaviour)

Finally those “core” objects have been moved under the org.specs2.specification.core package, in order to restructure the org.specs2.specification package into

core: Fragment Description, SpecificationStructure…
dsl: all the syntactic sugar FragmentsDsl, ExampleDsl, ActionDsl…
process: the “processing” classes Selector, Executor, StatisticsRepository…
create: traits to create the specification FragmentFactory, AutoExamples, S2StringContext (for s2 string interpolation)…

`FragmentsDsl` methods

When you want to assemble Fragments together you will need the FragmentsDsl trait to do so (mixed-in the Specification trait, you don’t have to add it).

The result of appending 2 Fragments is a Fragments object. The Fragments class has changed in specs2 3.0. It doesn’t hold a reference to the specification title and the specification arguments anymore, this is now the role of the SpecStructure. So in summary:

a Specification is a function Env => SpecStructure
a SpecStructure contains: a SpecHeader, some Arguments and Fragments
Fragments is a sequence of Fragments (actually a scalaz-stream Process[Task, Fragment])

The FragmentsDsl api allows to combine almost everything into Fragments with the ^ operator:

a String to a Seq[Fragment]
2 Fragments
1 Fragments and a String

One advantage of this fine-grained decomposition of the fragments API is that there is now a Spec lightweight trait.

Lightweight `Spec` trait

Compilation times can be a problem with Scala and specs2 makes it worse by providing lots of implicit methods in a standard Specification to provide various DSLs. In specs2 3.0 there is a Spec trait which contains a reduced number of implicits to:

create a s2 string for an “Acceptance Specification”
create should and in blocks in a “Unit Specification”
to create expectations with must
to add arguments to the specification (like sequential)

If you use that trait and you find yourself missing an implicit you will have to either:

use the Specification class instead
search specs2 for the trait or object providing the missing implicit. There is no magic recipe for this but the MustMatchers trait and the Spec2StringContext trait should bring most of the missing implicits in scope

It is possible that this trait will be adjusted to find the exact balance between expressivity and compile times but I hope it will remain pretty stable.

Durations

When specs2 started, the package scala.concurrent.duration didn’t exist. This is why there was a Duration type in specs2 < 3.0 and a TimeConversions trait. Of course this introduced annoying collisions with the implicits coming from scala.concurrent.duration when that one came around.

There is no reason to go on using specs2 Durations anymore now so you can use the standard Scala durations everywhere specs2 expects a Duration.

Contexts

Context management has been slowly evolving in specs2. In specs2 3.0 we end up with the following traits:

BeforeAll do something before all the examples (you had to use a Step in specs2 < 3.0)
BeforeEach do something before each example (was BeforeExample in specs2 < 3.0)
AfterEach do something after each example (was AfterExample in specs2 < 3.0)
BeforeAfterEach do something before/after each example (was BeforeAfterExample in specs2 < 3.0)
ForEach[T] provide an element of type T (a “fixture”) to some each example (was FixtureExample[T] in specs2 < 3.0)
AfterAll do something after all the examples examples (you had to use a Step in specs2 < 3.0)
BeforeAfterAll do something before/after all the examples examples (you had to use a Step in specs2 < 3.0)

There are some other cool things you can do. For example set a time-out for all examples based on a command line parameter:

trait ExamplesTimeout extends EachContext with MustMatchers with TerminationMatchers {

  def context: Env => Context = { env: Env =>
    val timeout = env.arguments.commandLine.intOr("timeout", 1000 * 60).millis
    upTo(timeout)(env.executorService)
  }

  def upTo(to: Duration)(implicit es: ExecutorService) = new Around {
    def around[T : AsResult](t: =>T) = {
      lazy val result = t

      val termination =
        result must terminate(retries = 10,
                              sleep = (to.toMillis / 10).millis).orSkip((ko: String) => "TIMEOUT: "+to)

      if (!termination.toResult.isSkipped) AsResult(result)
      else termination.toResult
    }
  }
}

The ExamplesTimeout trait extends EachContext which is a generalization of the xxxEach traits. With the EachContext trait you get access to the environment to define the behaviour used to “decorate” each example. So, in that case, we use a timeout command line parameter to create an Around context that will timeout each example if necessary. You can also note that this Around context uses the executorService passed by the environment so you don’t have to worry about resources management for your Specification.

Included specifications

As I was reworking the implementation of specs2 I also looked for ways to simplify its internal model. In specs2 < 3.0 you can nest a specification inside another one. This adds some significant complexity because a nested specification have its own arguments, its own title. For example during the execution of the inner specification we need to be careful enough to override the outer specification arguments with the inner ones.

I decided to let go of this functionality in favor of a view of specifications as “referencing” each other, with 2 types of references:

“link” reference
“see” reference

The idea is to model dependency relationships with “link” and weaker relationships with “see” (when you just want to mention that some information is present in another specification).

Then there are 2 modes of execution:

the default one
the “all” mode

By default when a specification is executed, the Runner will try to display the status of “linked” specifications but not “see” specifications. If you use the all argument then we collect all the “linked” specifications transitively and run them respecting dependencies (if s1 has a link to s2, then s2 is executed first).

This is particularly important for HTML reporting when the structure of “link” references is used to produce a table of contents and “see” references are merely used to display HTML links.

Online specifications

I find this exciting even if I don’t know if I will ever use this feature! (it has been requested in the past though).

In specs2 < 3.0 there is a clear distinction between the “creation” time and the “execution” time for specification. Once you have defined your examples you can not add new ones based on your execution results. But wait! this is more or less the property of a Monad! “Produce an action based on the value returned by another action”. Since specs2 3.0 is using scalaz-stream Process under the covers which is a Monad, this means that it is now possible to do the following:

class WikipediaBddSpec extends Specification with Online { def is = s2"""

  All the pages mentioning the term BDD must contain a reference to specs2 $e1

"""

    def e1 = {
      val pages = Wikipedia.getPages("BDD")

      // if the page is about specs2, add more examples to check the links
      (pages must contain((_:Page) must mention("specs2"))) continueWith
       pagesSpec(pages)
    }

    /** create one example per linked page */
    def pagesSpec(pages: Seq[Page]): Fragments = {
      val specs2Links = pages.flatMap(_.getLinks).filter(_.contains("specs2"))

      s2"""
       The specs2 links must be active
       ${Fragments.foreach(specs2Links)(active)}
      """
    }

    def active(link: HtmlLink) =
      s2"""
      The page at ${link.url} must be active ${ link must beActive }"""
  }

The specification above is “dynamic” in the sense that it creates more examples based on the tested data. All Wikipedia pages for BDD must mention “specs2” and for each linked page (which we can’t know in advance) we create a new example specifying that the link must be active.

ScalaCheck

The ScalaCheck trait has been reworked and extended to provide the following features:

you can specify Arbitrary[T], Gen[T], Shrink[T], T => Pretty instances at the property level (for any or all of the arguments)
you can easily collect argument values by appending .collectXXX to the property (XXX depends on the argument you want to collect. collect1 for the first, collectAll for all)
you can override default parameters from the command line. For example pass scalacheck.mintestsok 10000
you can set individual before, after actions to be executed before and after the property executes to do some setup/teardown

Also, specs2 was previously doing some message reformatting on top of ScalaCheck but now the ScalaCheck original messages have been preserved to keep the consistency between the 2 libraries.

Note: the ScalaCheck trait stays in the org.specs2 package but all the traits it depends on now live in the org.specs2.scalacheck package.

Bits and pieces

This section is about various small things which have changed with specs2 3.0:

Implicit context

There is no more implicit context when you use the .await method to match futures. This means that you have to either import the scala.concurrent.ExecutionContext.global context or to use a function ExecutionContext => Result to define your examples:

s2"""
An example using an ExecutionContext $e1
"""

  def e1 = { implicit ec: ExecutionContext =>
    // use the context here
    ok
  }

Foreach methods

It is possible now to create several examples or results with a foreach method which will not return Unit:

// create several examples
Fragment.foreach(1 to 10)(i => "example "+i ! ok)

// create several examples with breaks in between
Fragments.foreach(1 to 10)(i => ("example "+i ! ok) ^ br)

// create several results for a sequence of numbers
Result.foreach(1 to 10)(i => i must_== i)

Removed syntax

(action: Any).before to create a “before” context is removed (same thing for after)
function.forAll to create a Prop from a function

Dependencies

specs2 3.0 uses scalacheck 1.12.1
you need to use a recent version of sbt, like 0.13.7
you need to upgrade to scalaz-specs2 0.4.0-SNAPSHOT for compatibility

Can I use it?

specs2 3.0 is now available as specs2-core-3.0-M2 on Sonatype. I am making it available for early testing and feedback. Please use the mailing-list or the github issues to ask questions and tell me if there is anything going wrong with this new version. I will incorporate your comments in this blog post, serving as a migration guide.

Special thanks

to Clinton Freeman who started the re-design of the specs2 home page more than one year ago and sparked this whole refactoring
to Pavel Chlupacek and Frank Thomas for patiently answering many of my questions about scalaz-stream
to Paul Chiusano for starting scalaz-stream in the first place!
to Mark Hibberd for his guidance with functional programming

20 June 2013

A Zipper and Comonad example

There are some software concepts which you hear about and after some time you roughly understand what they are. But you still wonder: "where can I use this?". "Zippers" and "Comonads" are like that. This post will show an example of:

using a Zipper for a list
using the Comonad cojoin operation for the Zipper
using the new specs2 contain matchers to specify collection properties

The context for this example is simply to specify the behaviour of the following function:

def partition[A](seq: Seq[A])(relation: (A, A) => Boolean): Seq[NonEmptyList[A]]

Intuitively we want to partition the sequence seq into groups so that all the elements in a group have "something in common" with at least one other element. Here is a concrete example

val near = (n1: Int, n2: Int) => math.abs(n1 - n2) <= 1
partition(Seq(1, 2, 3, 7, 8, 9))(near)

> List(NonEmptyList(1, 2, 3), NonEmptyList(7, 8, 9))

Properties

If we want to encode this behaviour with ScalaCheck properties we need to check at least 3 things:

for each element in a group, there exists at least another related element in the group
for each element in a group, there doesn't exist a related element in any other group
2 elements which are not related must end up in different groups

How do we translate this to some nice Scala code?

Contain matchers

"for each element in a group, there exists another related element in the same group"

prop { (list: List[Int], relation: (Int, Int) => Boolean) =>
  val groups = partition(list)(relation)
  groups must contain(relatedElements(relation)).forall
}

The property above uses a random list, a random relation, and does the partitioning into groups. We want to check that all groups satisfy the property relatedElements(relation). This is done by:

using the contain matcher
passing it the relatedElements(relation) function to check a given group
do this check forall groups

The relatedElements(relation) function we pass has type NEL[Int] => MatchResult[NEL[Int]] (type NEL[A] = NonEmptyList[A]) and is testing each group. What does it do? It checks that each element of a group has at least one element that is related to it.

def relatedElements(relation: (Int, Int) => Boolean) = (group: NonEmptyList[Int]) => {
  group.toZipper.cojoin.toStream must contain { zipper: Zipper[Int] =>
    (zipper.lefts ++ zipper.rights) must contain(relation.curried(zipper.focus)).forall
  }
}

This function is probably a bit mysterious so we need to dissect it.

Zipper

In the relatedElements function we need to check each element of a group in relation to the other elements. This means that we need to traverse the sequence, while keeping the context of where we are in the traversal. This is exactly what a Zipper is good at!

A List Zipper is a structure which keeps the focus on one element of the list and can return the elements on the left or the elements on the right. So in the code above we transform the group into a Zipper with the toZipper method. Note that this works because the group is a NonEmptyList. This wouldn't work with a regular List because a Zipper cannot be empty, it needs something to focus on:

// a zipper for [1, 2, 3, 4, 5, 6, 7, 8, 9]
//     lefts      focus    rights
// [  [1, 2, 3]     4      [5, 6, 7, 8, 9]  ]

Now that we have a Zipper that is focusing on the one element of the group. But we don't want to test only one element, we want to test all of them, so we need to get all the possible zippers over the original group!

Cojoin

It turns out that there is a method doing exactly this for Zippers, it is called cojoin. I won't go here into the full explanation of what a Comonad is, but the important points are:

Zipper has a Comonad instance
Comonad has a cojoin method with this signature cojoin[A](zipper: Zipper[A]): Zipper[Zipper[A]]

Thanks to cojoin we can create a Zipper of all the Zippers, turn it into a Stream[Zipper[Int]] and do the checks that really matters to us

def relatedElements(relation: (Int, Int) => Boolean) = (group: NonEmptyList[Int]) => {
  group.toZipper.cojoin.toStream must contain { zipper: Zipper[Int] =>
    val otherElements = zipper.lefts ++ zipper.rights
    otherElements must contain(relation.curried(zipper.focus))
  }
}

We get the focus of the Zipper, an element, and we check it is related to at least one other element in that group. This is easy because the Zipper gives us all the other elements on the left and on the right.

Cobind

If you know a little bit about Monads and Comonads you know that there is a dualism between join in Monads and cojoin in Comonads. But there is also one between bind and cobind. Is it possible to use cobind then to implement the relatedElements function? Yes it is, and the result is slightly different (arguably less understandable):

def relatedElements(relation: (Int, Int) => Boolean) = (group: NonEmptyList[Int]) => {
  group.toZipper.cobind { zipper: Zipper[Int] =>
    val otherElements = zipper.lefts ++ zipper.rights
    otherElements must contain(relation.curried(zipper.focus))
  }.toStream must contain((_:MatchResult[_]).isSuccess).forall
}

In this case we cobind each zipper with a function that will check if there are related elements in the groups. This will gives us back a Zipper of results and we need to make sure that it full of success values.

Second property

"for each element in a group, there doesn't exist a related element in another group"

prop { (list: List[Int], relation: (Int, Int) => Boolean) =>
  val groups = partition(list)(relation)
  groups match {
    case Nil          => list must beEmpty
    case head :: tail => nel(head, tail).toZipper.cojoin.toStream must not contain(relatedElementsAcrossGroups(relation))
  }
}

This property applies the same technique but now across groups of elements by creating a Zipper[NonEmptyList[Int]] instead of a Zipper[Int] as before:

def relatedElementsAcrossGroups(relation: (Int, Int) => Boolean) = (groups: Zipper[NonEmptyList[Int]]) =>
  groups.focus.list must contain { e1: Int =>
    val otherGroups = (groups.lefts ++ groups.rights).map(_.list).flatten
    otherGroups must contain(relation.curried(e1))
  }

Note that the ability to "nest" the new specs2 contain matchers is very useful in this situation.

Last property

Finally the last property is much easier because it doesn't require any context to be tested. For this property we just make sure that no element is ever related to another one and check that they end up partitioned into distinct groups.

"2 elements which are not related must end up in different groups"

prop { (list: List[Int]) =>
  val neverRelated = (n1: Int, n2: Int) => false
  val groups = partition(list)(neverRelated)
  groups must have size(list.size)
}

Conclusion

Building an intuition for those crazy concepts is really what counts. For me it was "traversal with a context". Then I was finally able to spot it in my own code.

08 June 2013

Specs2 2.0 - Interpolated - RC2

This is a quick update to present the main differences with specs2 2.0-RC1. I have been fixing a few bugs but more importantly I have:

made the Tags trait part of the standard specification
removed some arguments for reporting and made the formatting of specifications more granular

This all started with an issue on Github...

Formatting

Creating reports for specifications is a bit tricky. On one hand you hand different possible "styles" for the specifications: "old" acceptance style (with the ^ operator), "new" acceptance style (with interpolated strings), "unit" style... Then, on the other hand, you want to report the results in the console, where information is logged on a line-by-line base and in HTML files, where newlines, whitespace and indentation all needs great care.

I don't think I got it quite right yet, especially for HTML, but working on issue #162 forced me to make specs2 implementation and API a bit more flexible. In particular, in specs2 < 2.0, you could set some arguments to control the display of the specification in the console and/or HTML. For example noindent is a Specification argument saying that you don't want the automatic indentation of text and examples. And markdown = false means that you don't want text to be parsed as Markdown before being rendered to HTML.

However issue 162 shows that setting formatting properties at the level of the whole specification doesn't play well with other features like specification inclusion. I decided to fix this issue by using an existing specs2 feature: tags.

Tags and Specification

Tags in specs2 are different from tags you can find in other testing libraries. Not only you can tag single examples but you can also mark a full section of a specification with some tags. We can use this capability to select specific parts of a specification for execution but we can also use it to direct the formatting of the specification text. For example you can now write:

class MySpec extends Specification { def is = s2""" ${formatSection(verbatim = false)}
 This text uses Markdown when printed to html, however if some text is indented with 4 spaces
     it should *not* be rendered as a code block because `verbatim` is false.

  """
}

Given the versatile use of tags now, I decided to include the Tags trait, by default, in the Specification class. I resisted doing that in the past because I didn't want to encumber too much the Specification namespace with something that was rarely used by some users. Which leads me to the following tip on how to use the Specification class:

when starting a new project or prototyping some code, use the Specification class directly with all inherited features
when making your project more robust and production-like, create your own Spec trait, generally inheriting from the BaseSpecification class for basic features, and mix in only the traits you think you will generally use

This should give you more flexibility and choice over which specs2 feature you want to use with a minimal cost in terms of namespace footprint and compile times (because each new implicit you bring in might have an impact in terms of performances)

API changes

The consequence of this evolution is yet another API break:

the Text and Example classes now use a FormattedString class containing the necessary parameters to display that string as HTML or in the console
for implementation reasons I have actually changed the constructor parameters of all Fragment classes to avoid storing state as private variables
the noindent, markdown arguments are now gone (you need to replace them with ${formatSection(flow=true)} and ${formatSection(markdown=true)}, see below)
the Tags trait is mixed in the Specification class so if you had methods like def tag you might get conflicts

And there are now 2 methods formatSection(flow: Boolean, markdown: Boolean, verbatim: Boolean) and formatTag(flow: Boolean, markdown: Boolean, verbatim: Boolean) to tag specification fragments with the following parameters:

flow: the fragment (Text or Example) shouldn't be reported with automatic indenting (default = false, set automatically to true when using s2 interpolated strings)
markdown: the fragment is using Markdown (default = true)
verbatim: indented text with more than 4 spaces must be rendered as a code block (default = true but can be set to false to solve #162)

HTML reports

I'm currently thinking that I should try out a brand new way of translating an executed specification with interpolated text into HTML. My first attempts were not completely successful and I find it hard to preserve the original layout of the specification text, especially with the Markdown translation in the middle. Yet, I must say a word on the Markdown library I'm using, Pegdown. I found this library extremely easy to adapt for my current needs (to implement the verbatim = false option) and I send my kudos to Mathias for such a great job.

This is it. Download RC2, use it and provide feedback as usual, thanks!

21 May 2013

Specs2 2.0 - Interpolated

The latest release of specs2 (2.0) deserves a little bit more than just release notes. It needs explanations, apologies and a bit of celebration!

Explanations

why is there another (actually several!) new style(s) of writing acceptance specifications
what are Scripts and ScriptTemplates
what has been done for compilation times
what you can do with Snippets
what is an ExampleFactory

Apologies

the >> / in problem
API breaks
Traversable matchers
AroundOutside and Fixture
the never-ending quest for Given/When/Then specifications

Celebration

compiler-checked documentation!
"operator-less" specifications!
more consistent Traversable matchers!

Explanations

Scala 2.10 is a game changer for specs2, thanks to 2 features: String interpolation and Macros.

String interpolation

Specs2 has been designed from the start with the idea that it should be immutable by default. This has led to the definition of Acceptance specifications with lots of operators, or, as someone put it elegantly, "code on the left, brainfuck on the right":

class HelloWorldSpec extends Specification { def is =

  "This is a specification to check the 'Hello world' string"            ^
                                                                         p^
    "The 'Hello world' string should"                                    ^
    "contain 11 characters"                                              ! e1^
    "start with 'Hello'"                                                 ! e2^
    "end with 'world'"                                                   ! e3^
                                                                         end
  def e1 = "Hello world" must have size(11)
  def e2 = "Hello world" must startWith("Hello")
  def e3 = "Hello world" must endWith("world")
}

Fortunately Scala 2.10 now offers a great alternative with String interpolation. In itself, String interpolation is not revolutionary. A string starting with s can have interpolated variables:

val name = "Eric"
s"Hello $name!"

Hello Eric!

But the great powers behind Scala realized that they could both provide standard String interpolation and give you the ability to make your own. Exactly what I needed to make these pesky operators disappear!

class HelloWorldSpec extends Specification { def is =         s2"""

 This is a specification to check the 'Hello world' string

 The 'Hello world' string should
   contain 11 characters                                      $e1
   start with 'Hello'                                         $e2
   end with 'world'                                           $e3
                                                              """

   def e1 = "Hello world" must have size(11)
   def e2 = "Hello world" must startWith("Hello")
   def e3 = "Hello world" must endWith("world")
}

What has changed in the specification above is that text Fragments are now regular strings in the multiline s2 string and the examples are now inserted as interpolated variables. Let's explore in more details some aspects of this new feature:

layout
examples descriptions
other fragments
implicit conversions
auto-examples

Layout

If you run the HelloWorldSpec you will see that the indentation of each example is respected in the output:

This is a specification to check the 'Hello world' string

The 'Hello world' string should
  + contain 11 characters
  + start with 'Hello'
  + end with 'world'

This means that you don't have to worry anymore about the layout of text and use the p, t, bt, end, endp formatting fragments as before.

Examples descriptions

On the other hand, the string which is taken as the example description is not as well delimited anymore, so it is now choosen by convention to be everything that is on the same line. For example this is what you get with the new interpolated string:

s2"""
My software should
  do something that it pretty long to explain,
  so long that it needs 2 lines" ${ 1 must_== 1 }
"""

My software should
  do something that it pretty long to explain,
  + so long that it needs 2 lines"

If you want the 2 lines to be included in the example description you will need to use the "old" form of creating an example:

s2"""
My software should
${ """do something that it pretty long to explain,
    so long that it needs 2 lines""" ! { 1 must_== 1 } }
"""

My software should+ do something that it pretty long to explain,
    so long that it needs 2 lines

But I suspect that there will be very few times when you will want to do that.

Other fragments and variables

Inside the s2 string you can interpolate all the usual specs2 fragments: Steps, Actions, included specifications, Forms... However you will quickly realize that you can not interpolate arbitrary objects. Indeed, excepted specs2 objects, the only other 2 types which you can use as variables are Snippets (see below) and Strings.

The restriction is there to remind you that, in general, interpolated expressions are "unsafe". If the expression you're interpolating is throwing an Exception, as it is commonly the case with tested code, there is no way to catch that exception. If that exception is uncaught, the whole specification will fail to be built. Why is that?

Implicit conversions

When I first started to experiment with interpolated strings I thought that they could even be used to write Unit Specifications:

s2"""
This is an example of conversion using integers ${
  val (a, b) = ("1".toInt, "2".toInt)
  (a + b) must_== 3
}
"""

Unfortunately such specifications will horribly break if there is an error in one of the examples. For instance if the example was:

This is an example of conversion using integers ${
  // oops, this is going to throw a NumberFormatException!
  val (a, b) = ("!".toInt, "2".toInt) 
  (a + b) must_== 3
}

Then the whole string and the whole specification will fail to be instantiated!

The reason is that everything you interpolate is converted, through an implicit conversion, to a "SpecPart" which will be interpreted differently depending on its type. If it is a Result then we will interpret this as the body of an Example and use the preceding text as the description. If it is just a simple string then it is just inserted in the specification as a piece of text. But implicit conversions of a block of code, as above, are not converting the whole block. They are merely converting the last value! So if anything before the last value throws an Exception you will have absolutely no way to catch it and it will bubble up to the top.

That means that you need to be very prudent when interpolating arbitrary blocks. One work-around is to do something like that

import execute.{AsResult => >>}
s2"""

This is an example of conversion using integers ${>>{
  val (a, b) = ("!".toInt, "2".toInt)
  (a + b) must_== 3
}}
  """

But you have to admit that the whole ${>>{...}} is not exactly gorgeous.

Auto-examples

One clear win of Scala 2.10 for specs2 is the use of macros to capture code expressions. This particularly interesting with so-called "auto-examples". This feature is really useful when your examples are so self-descriptive that a textual description feels redundant. For example if you want to specify the String.capitalize method:

s2"""
 The `capitalize` method verifies
 ${ "hello".capitalize       === "Hello" }
 ${ "Hello".capitalize       === "Hello" }
 ${ "hello world".capitalize === "Hello world" }
"""

 The `capitalize` method verifies
 + "hello".capitalize       === "Hello"
 + "Hello".capitalize       === "Hello"
 + "hello world".capitalize === "Hello world"

It turns out that the method interpolating the s2 extractor is using a macro to extract the text for each interpolated expression and so, if on a given line there is no preceding text, we take the captured expression as the example description. It is important to note that this will only properly work if you enable the -Yrangepos scalac option (in sbt: scalacOptions in Test := Seq("-Yrangepos")).

However the drawback of using that option is the compilation speed cost which you can incur (around 10% in my own measurements). If you don't want (or you forget :-)) to use that option there is a default implementation which should do the trick in most cases but which might not capture all the text in some edge cases.

Scripts

The work on Given/When/Then specifications has led to a nice generalisation. Since the new GWT trait decouples the specification text from the steps and examples to create, we can push this idea a bit further and create "classical" specifications where the text is not annotated at all and examples are described somewhere else.

Let's see what we can do with the org.specs2.specification.script.Specification class:

import org.specs2._
import specification._

class StringSpecification extends script.Specification with Grouped { def is = s2"""

Addition
========

 It is possible to add strings with the + operator
  + one string and an empty string
  + 2 non-empty strings

Multiplication
==============

 It is also possible to duplicate a string with the * operator
  + using a positive integer duplicates the string
  + using a negative integer returns an empty string
                                                                                """

  "addition" - new group {
    eg := ("hello" + "") === "hello"
    eg := ("hello" + " world") === "hello world"
  }
  "multiplication" - new group {
    eg := ("hello" * 2) === "hellohello"
    eg := ("hello" * -1) must beEmpty
  }
}

With script.Specifications you just provide a piece of text where examples are starting with a + sign and you specify examples groups. Example groups were introduced in a previous version of specs2 with the idea of providing standard names for examples in Acceptance specifications.

When the specification is executed, the first 2 example lines are mapped to the examples of the first group, and the examples lines from the next block (as delimited with a Markdown title) are used to build examples by taking expectations in the second group (those group are automatically given names, g1 and g2, but you can specify them yourself: "addition" - new g1 {...).

This seems to be a lot of "convention over configuration" but this is actually all configurable! The script.Specification class is an example of a Script and it is associated with a ScriptTemplate which defines how to parse text to create fragments based on the information contained in the Script (we will see another example of this in action below with the GWT trait which proposes another type of Script named Scenario to define Given/When/Then steps).

There are lots of advantages in adopting this new script.Specification class:

it is "operator-free", there's no need to annotate your specification on the right with strange symbols
tags are automatically inserted for you so that it's easy to re-run a specific example or group of examples by name: test-only StringSpecification -- include g2.e1
examples are marked as pending if you haven't yet implemented them
it is configurable to accomodate for other templates (you could even create Cucumber-like specifications if that's your thing!)

The obvious drawback is the decoupling between the text and the examples code. If you restructure the text you will have to restructure the examples accordingly and knowing which example is described by which piece of text is not obvious. This, or operators on the right-hand side, choose your poison :-)

Compilation times

Scala's typechecking and JVM interoperability comes with a big price in terms of compilation times. Moderately-sized projects can take minutes to compile which is very annoying for someone coming from Java or Haskell.

Bill Venners has tried to do a systematic study of which features in testing libraries seems to have the biggest impact. It turns out that implicits, traits and byname parameters have a significant impact on compilation times. Since specs2 is using those features more than any other test library, I tried to do something about it.

The easiest thing to do was to make Specification an abstract class, not a trait (and provide the SpecificationLike trait in its place). My unscientific estimation is that this single change removed 0.5 seconds per compiled file (from 313s to 237s for the specs2 build, and a memory reduction of 55Mb, from 225Mb to 170Mb).

Then, the next very significant improvement was to use interpolated specifications instead of the previous style of Acceptance specifications. The result is impressive: from 237 seconds to 150 seconds and a memory reduction of more than 120Mb, from 170Mb to 47Mb!

On the other hand, when I tried to remove some of the byname parameters (the left part of a must_== b) I didn't observe a real impact on compilation times (only 15% less memory).

The last thing I did was to remove some of the default matchers (and to add a few others). Those matchers are the "content" matchers: XmlMatchers, JsonMatchers, FileMatchers, ContentMatchers (and I added instead the TryMatchers). I did this to remove some implicits from the scope when compiling code but also to reduce the namespace footprint everytime you extend the Specification class. However I couldn't see a major improvement to compile-time performances with this change.

Snippets

One frustration of software documentation writers is that it is very common to have stale or incorrect code because the API has moved on. What if it was possible to write some code, in the documentation, that will be checked by the compiler? And automatically refactored when you change a method name?

This is exactly what Snippets will do for you. When you want to capture and display a piece of code in a Specification you create a Snippet:

s2"""
This is an example of addition: ${snippet{

// who knew?
1 + 1 == 2
}}
"""

This renders as:

This is an example of addition

// who knew?
1 + 1 == 2

And yes, you guessed it right, the Snippet above was extracted by using another Snippet! I encourage you to read the documentation on Snippets to see what you can do with them, the main features are:

code evaluation: the last value can be displayed as a result
checks: the last value can be checked and reported as a failure in the Specification
code hiding: it is possible to hide parts of the code (initialisations, results) by enclosing them in "scissors" comments of the form // 8<--

Example factory

Every now and then I get a question from users who want to intercept the creation of examples and use the example description to do interesting things before or after the example execution. It is now possible to do so by providing another ExampleFactory rather than the default one:

import specification._

class PrintBeforeAfterSpec extends Specification { def is =
  "test" ! ok

  case class BeforeAfterExample(e: Example) extends BeforeAfter {
    def before = println("before "+e.desc)
    def after  = println("after "+e.desc)
  }

  override def exampleFactory = new ExampleFactory {
    def newExample(e: Example) = {
      val context = BeforeAfterExample(e)
      e.copy(body = () => context(e.body()))
    }
  }
}

The PrintBeforeAfterSpec will print the name of each example before and after executing it.

Apologies

the `>>` / `in` problem

This issue has come up at different times and one lesson is: Unit means "anything" so don't try to be too smart about it. So I owe an apology to the users for this poor API design choice and for the breaking API change that is now ensuing. Please read the thread in the Github issue to learn how to fix compile errors that would result from this change.

API breaks

While we're on the subject of API breaks, let's make a list:

Unit values in >> / in: now you need to explicitly declare if you mean "a list of examples created with foreach" or "a list of expectations created with foreach"
Specification is not a trait anymore so you should use the SpecificationLike trait instead if that's what you need (see the Compilation times section)
Some matchers traits have been removed from the default matchers (XML, JSON, File, Content) so you need to explicitly mix them in (see the Compilation times section)
The Given/When/Then functionality has been extracted as a deprecated trait specification.GivenWhenThen (see the Given/When/Then? section)
the negation of the Map matchers has changed (this can be considered as a fix but this might be a run-time break for some of you)
many of the Traversable matchers have been deprecated (see the next section)

`Traversable` matchers

I've had this nagging thought in my mind for some time now but it only reached my conscience recently. I always felt that specs2 matchers for collections were a bit ad-hoc, with not-so-obvious ways to do simple things. After lots of fighting with implicit classes, overloading and subclassing, I think that I have something better to propose.

With the new API we generalize the type of checks you can perform on elements:

Seq(1, 2, 3) must contain(2) just checks for the presence of one element in the sequence
this is equivalent to writing Seq(1, 2, 3) must contain(equalTo(2)) which means that you can pass a matcher to the contain method. For example containAnyOf(1, 2, 3) is contain(anyOf(1, 2, 3)) where anyOf is just another matcher
and more generally, you can pass any function returning a result! Seq(1, 2, 3) must contain((i: Int) => i must beEqualTo(2)) or Seq(1, 2, 3) must contain((i: Int) => i == 2) (you can even return a ScalaCheck Prop if you want)

Then we can use combinators to specify how many times we want the check to be performed:

Seq(1, 2, 3) must contain(2) is equivalent to Seq(1, 2, 3) must contain(2).atLeastOnce
Seq(1, 2, 3) must contain(2).atMostOnce
Seq(1, 2, 3) must contain(be_>=(2)).atLeast(2.times)
Seq(1, 2, 3) must contain(be_>=(2)).between(1.times, 2.times)

This covers lots of cases where you would previously use must have oneElementLike(partialFunction) or must containMatch(...). This also can be used instead of the forall, atLeastOnce methods. For example forall(Seq(1, 2, 3)) { (i: Int) => i must be_>=(0) } is Seq(1, 2, 3) must contain((i: Int) => i must be_>=(0)).forall.

The other type of matching which you want to perform on collections is with several checks at the time. For example:

Seq(1, 2, 3) must contain(allOf(2, 3))

This seems similar to the previous case but the combinators you might want to use with several checks are different. exactly is one of them:

Seq(1, 2, 3) must contain(exactly(3, 1, 2)) // we don't expect ordered elements by default

Or inOrder

Seq(1, 2, 3) must contain(exactly(be_>(0), be_>(1), be_>(2)).inOrder) // with matchers here

One important thing to note though is that, when you are not using inOrder, the comparison is done greedily, we don't try all the possible combinations of input elements and checks to see if there would be a possibility for the whole expression to match.

Please explore this new API and report any issue (bug, compilation error) you will find. Most certainly the failure reporting can be improved. The description of failures is much more centralized with this new implementation but also a bit more generic. For now, the failure messages are just listing which elements were not passing the checks but they do not output something nice like: The sequence 'Seq(1, 2, 3) does not contain exactly the elements 4 and 3 in order: 4 is not found'.

`AroundOutside` vs `Fixture`

My approach to context management in specs2 has been very progressive. First I provided the ability to insert code (and more precisely effects) before or after an Example, reproducing here standard JUnit capabilities. Then I've introduced Around to place things "in" a context, and Outside to pass data to an example. And finally AroundOutside as the ultimate combination of both capabilities.

I thought that with AroundOutside you could do whatever you needed to do, end of story. It turns out that it's not so simple. AroundOutside is not general enough because the generation of Outside data cannot be controled by the Around context. This proved to be very problematic for me on a specific use case where I needed to re-run the same example, based on different parameters, with slightly different input data each time. AroundOutside was just not doing it. The solution? A good old Fixture. Very simple, a Fixture[T], is a trait like that:

trait Fixture[T] {
  def apply[R : AsResult](f: T => R): Result
}

You can define an implicit fixture for all the examples:

class s extends Specification { def is = s2"""
  first example using the magic number $e1
  second example using the magic number $e1
"""

  implicit def magicNumber = new specification.Fixture[Int] {
    def apply[R : AsResult](f: Int => R) = AsResult(f(10))
  }

  def e1 = (i: Int) => i must be_>(0)
  def e2 = (i: Int) => i must be_<(100)
}

I'm not particularly happy to add this to the API because it adds to the overall API footprint and learning curve, but in some scenarios this is just indispensable.

Given/When/Then?

With the new "interpolated" style I had to find another way to write Given/When/Then (GWT) steps. But this is tricky. The trouble with GWT steps is that they are intrisically dependent. You cannot have a Then step being defined before a When step for example.

The "classic" style of acceptance specification is enforcing this at compile time because, in that style, you explicitly chain calls and the types have to "align":

class GivenWhenThenSpec extends Specification with GivenWhenThen { def is =

  "A given-when-then example for a calculator"                 ^ br^
    "Given the following number: ${1}"                         ^ aNumber^
    "And a second number: ${2}"                                ^ aNumber^
    "And a third number: ${6}"                                 ^ aNumber^
    "When I use this operator: ${+}"                           ^ operator^
    "Then I should get: ${9}"                                  ^ result^
                                                               end

  val aNumber: Given[Int]                 = (_:String).toInt
  val operator: When[Seq[Int], Operation] = (numbers: Seq[Int]) => (s: String) => Operation(numbers, s)
  val result: Then[Operation]             = (operation: Operation) => (s: String) => { operation.calculate  must_== s.toInt }

  case class Operation(numbers: Seq[Int], operator: String) {
    def calculate: Int = if (operator == "+") numbers.sum else numbers.product
  }
}

We can probably do better than this. What is required?

to extract strings from text and transform them to well-typed values
to define functions using those values so that types are respected
to restrict the composition of functions so that a proper order of Given/When/Then is respected
to transform all of this into Steps and Examples

So, with apologies for coming up with yet-another-way of doing the same thing, let me introduce you to the GWT trait:

import org.specs2._                                                                                      
import specification.script.{GWT, StandardRegexStepParsers}                                                                                                         
                                                                                                         
class GWTSpec extends Specification with GWT with StandardRegexStepParsers { def is = s2"""              
                                                                                                         
 A given-when-then example for a calculator                       ${calculator.start}                   
   Given the following number: 1                                                                         
   And a second number: 2                                                                                
   And a third number: 6                                                                                 
   When I use this operator: +                                                                           
   Then I should get: 9                                                                                  
   And it should be >: 0                                          ${calculator.end}
                                                                  """

  val anOperator = readAs(".*: (.)$").and((s: String) => s)

  val calculator =
    Scenario("calculator").
      given(anInt).
      given(anInt).
      given(anInt).
      when(anOperator) { case op :: i :: j :: k :: _ => if (op == "+") (i+j+k) else (i*j*k) }.
      andThen(anInt)   { case expected :: sum :: _   => sum === expected }.
      andThen(anInt)   { case expected :: sum :: _   => sum must be_>(expected) }

}

In the specification above, calculator is a Scenario object which declares some steps through the given/when/andThen methods. The Scenario class provides a fluent interface in order to restrict the order of calls. For example, if you try to call a given step after a when step you will get a compilation error. Furthermore steps which are using extracted values from previous steps must use the proper types, what you pass to the when step has to be a partial function taking in a Shapeless HList of the right type.

You will also notice that the calculator is using anInt, anOperator. Those are StepParsers, which are simple objects extracting values from a line of text and returning Either[Exception, T] depending on the correct conversion of text to a type T. By default you have access to 2 types of parsers. The first one is DelimitedStepParser which expects that values to extract are enclosed in {} delimiters (this is configurable). The other one is RegexStepParser which uses a regular expression with groups in order to know what to extract. For example anOperator defines that the operator to extract will be just after the column at the end of the line.

Finally the calculator scenario is inserted into the s2 interpolated string to delimitate the text it applies to. Scenario being a specific kind of Script it has an associated ScriptTemplate which defines that the last lines of the text should be paired with the corresponding given/when/then method declarations. This is configurable and we can imagine other ways of pairing text to steps (see the org.specs2.specification.script.GWT.BulletTemplate class for example).

For reasons which are too long to expose here I've never been a big fan of Given/When/Then specifications and I guess that the multitude of ways to do that in specs2 shows it. I hope however that the GWT fans will find this approach satisfying and customisable to their taste.

Celebration!

I think there are some really exciting things in this upcoming specs2 release for "Executable Software Specifications" lovers.

Compiler-checked documentation

Having compiler-checked snippets is incredibly useful. I've fixed quite a few bugs in boths specs2 and Scoobi user guides and I hope that I made them more resistant to future changes that will happen through refactoring (when just renaming things for example). I'm also very happy that, thanks to macros, the ability to capture code was extended to "auto-examples". In previous specs2 versions, this is implemented by looking at stack traces and doing horrendous calculations on where a piece of code would be. This gives me the shivers everytime I have to look at that code!

No operators

The second thing is Scripts and ScriptTemplates. There is a trade-off when writing specifications. On one hand we would like to read pure text, without the encumbrance of implementation code, on the other hand, when we read specification code, it's nice to have a short sentence explaining what it does. With this new release there is a continuum of solutions on this trade-off axis:

you can have pure text, with no annotations but no navigation is possible to the code (with org.specs2.specification.script.Specification)
you can have annotated text, with some annotations to access the code (with org.specs2.Specification)
you can have text interspersed with the code (with org.specs2.mutable.Specification)

New matchers

I'm pretty happy to have new Traversable matchers covering a lot more use cases than before in a straight-forward manner. I hope this will reduce the thinking time between "I need to check that" and "Ok, this is how I do it".

Please try out the new Release Candidate, report bugs, propose enhancements and have fun!

11 October 2011

Counting words

In this 3 parts post I want to show:

how standard, out-of-the-box, Scala helped me to code a small application
how Functional Reactive Programming brings a real improvement on how the GUI is built
how to replace Parser Combinators with the Parboiled library to enhance error reporting

You can access the application project via github.

I can do that in 2 hours!

That's more or less what I told my wife as she was explaining one small problem she had. My wife is studying psychology and she has lots of essays to write, week after week. One small burden she's facing is keeping track of the number of words she writes because each essay must fit in a specific number of words, say 4000 +/- 10%. The difficulty is that quotations and references must not be counted. So she cannot check the file properties in Word or Open Office and she has to keep track manually.

For example, she may write: "... as suggested by the Interpretation of Dreams (Freud, 1905, p.145) ...". The reference "(Freud, 1905, p.145)" must not be counted. Or, "Margaret Malher wrote: "if the infant has an optimal experience of the symbiotic union with the mother, then the infant can make a smooth psychological differentiation from the mother to a further psychological expansion beyond the symbiotic state." (Malher cited in St. Clair, 2004, p.92)" (good luck with that :-)). In that case the quotation is not counted either and we must only count 3 words.

Since this counting is a bit tedious and has to be adjusted each time she does a revision of her essay, I proposed to automate this check. I thought "a few lines of Parser Combinators should be able to do the trick, 2 hours max". It actually took me a bit more (tm) to:

write a parser robust enough to accommodate for all sorts of variations and irregularities. For example, pages can be written as "p.154" or "pp.154-155", years can also be written "[1905] 1962" where 1905 is the first edition, and so on
use scala-swing to display the results: number of words, references table, file selection
write readers to extract the text from .docx or .odt files

Let's see now how Scala helped me with those 3 tasks.

Parsing the text

The idea behind parser combinators is very powerful. Instead of building a monolithic parser with lots of sub-routines and error-prone tracking of character indices, you describe the grammar of the text to parse by combining smaller parsers in many different ways.

In Scala, to do this, you need to extend one of the Parsers traits. The one I've choosen is RegexParsers. This is a parser which is well suited for unstructured text. If you were to parse something more akin to a programming language you might prefer StdTokenParsers which already define keywords, numeric/string literals, identifiers,...

I'm now just going to comment on a few points regarding the TextParsing trait which is parsing the essay text. If you want to understand how parser combinators work in detail, please read the excellent blog post by Daniel Spiewak: The Magic behind Parser Combinators.

The main definition for this parser is:

  def referencedText: Parser[Results] =
    rep((references | noRefParenthesised | quotation | words | space) <~ opt(punctuation)) ^^ reduceResults

This means that I expect the text to be:

a repetition (the rep combinator) of a parser
the repeated parser is an alternation (written |) of references, parenthetised text, quotations, words or spaces. For each of these "blocks" I'm able to count the number of words. For example, a reference will be 0 and parenthetised text will be the number of words between the parentheses
there can be a following punctuation sign (optional, using the opt combinator), but we don't really care about it, so it can be discarded (hence the <~ combinator, instead of ~ which sequences 2 parsers)

Then I have a function called reduceResults taking the result of the parsing of each repeated parser, to create the final Result, which is a case class providing:

the number of counted words
the references in the text
the quotations in the text

Using the RegexParser trait is very convenient. For example, if I want to specify how to parse "Pages" in a reference: (Freud, 1905, p.154), I can sequence 2 parsers built from regular expressions:

  val page = "p\\.*\\s*".r ~ "\\d+".r

appending .r to a string returns a regular expression (of type Regex)
there is an implicit conversion method in the RegexParsers trait, called regex from a Regex to a Parser[String]
I can sequence 2 parsers using the ~ operator

The page parser above can recognize page numbers like p.134 or p.1 but it will also accept p134. You can argue that this is not very well formatted, and my wife will agree with you. However she certainly doesn't want to see the count of words being wrong or fail just because she forgot a dot! The plan here is to display what was parsed so that she can eventually fix some incorrect references, not written according to the academia standards. We'll see, in part 3 of this series how we can use another parsing library to manage those errors, without breaking the parsing.

One more important thing to mention about the use of the RegexParsers trait is the skipWhitespace method. If it returns true (the default), any regex parser will discard space before any string matching a regular expression. This is convenient most of the time but not here where I need to preserve spaces to be able to count words accurately.

To finish with the subject of Parsing you can have a look at the TextParsingSpec specification. This specification features a ParserMatchers trait to help with testing your parsers. It also uses the Auto-Examples feature of specs2 to use the text of the example directly as a description:

  "Pages"                                                                           ^
  { page must succeedOn("p.33") }                                                   ^
  { page must succeedOn("p33") }                                                    ^
  { pages must succeedOn("pp.215-324") }                                            ^
  { pages must succeedOn("pp.215/324") }                                            ^
  { pages("pp. 215/324") must haveSuccessResult(===(Pages("pp. 215/324"))) }        ^

Displaying the results

The next big chunk of this application is a Swing GUI. The Scala standard distribution provides a scala-swing library adding some syntactic sugar on top of regular Swing components. If you want to read more about Scala and Swing you can have a look at this presentation.

The main components of my application are:

a menu bar with 2 buttons to: select a file, do the count
a field to display the currently selected file
a results panel showing: the number of counted words and the document references

wordcount application
count example, note that the parsing is not perfect since the word counts do not add up!

If you have a look at the code you will see that this translates to:

an instance of SimpleSwingApplication defining a top method and including all the embedded components: a menu bar, a panel with the selected file and results
the subcomponents themselves: the menu items, the count action, the results panel
the "reactions" which is a PartialFunction listening to the events created by some components, SelectionChanged for example, and triggering the count or displaying the results

I was pretty happy to see that much of the verbosity of Swing programming is reduced with Scala:

you don't need to create xxxListeners for everything
there are components providing both a Panel and a LayoutManager with the appropriate syntax to display the components: FlowPanel, BoxPanel, BorderPanel
thanks to scala syntax you can write action = new Action(...) instead of setAction(new Action(...))

This is nice but I think that there is a some potential for pushing this way further and create more reusable out-of-the-box components. For example, I've created an OpenFileMenuItem which is a MenuItem with an Action to open a FileChooser. Also, something like a pervasive LabeledField with just a label and some text would very useful to have in a standard library.

I also added a bit of syntactic sugar to have actions executed on a worker thread, instead of the event dispatch thread (to avoid grey screens), using the SwingUtilities.invokeLater method. For example: myAction.inBackground will be executed on a separate thread.

Eventually, I was able to code up the GUI of the application pretty fast. The only thing which I didn't really like was the Publish/React pattern. It felt a bit messy. The next part of this series will show how Functional Reactive Programming with the reactive library helped me write cleaner code.

Reading the file

I anticipated this part to be a tad difficult. My first experiments of text parsing were using a simple text file and I knew that having the user (my wife, remember,...) copy and paste her text to another file just for counting would be a deal-breaker. So I tried to read .odt and .docx files directly. This was actually much easier than anything I expected!

Both formats are zipped xml files. Getting the content of those files is just a matter of:

reading the ZipFile entries and find the file containing the text

val rootzip = new ZipFile(doc.path)
rootzip.entries.find(_.getName.equals("word/document.xml"))

loading the xml as a NodeSeq
```
XML.load(rootzip.getInputStream(f)))
```

find the nodes containing the actual text of the document and transform them to text

// for a Word document text is under <p><t> tags
(xml \\ "p") map (p => (p \\ "t").map(_.text) mkString "") mkString "\n"

For further details you can read the code here.

Recap

That's it, parser combinators + scala-swing + xml = a quick app solving a real-world problem. In the next posts we'll try to make this even better!

Pages