Pages

09 September 2019

Processing CSV files in Haskell

This blog post is the result of a little experiment. I wanted to check how hard it would be to use Haskell to write a small program to help me solve a “real-life” problem. I have always been pretty bad at doing accounting for the family, with a mix of Excel spreadsheets filled with random amounts and dates.

In order to improve our budgeting I decided to give a go at an application called YNAB “You Need A Budget”. Many applications of that nature require you to import your bank transactions in order to really precise about your income and expenses. And now we have an IT problem, right at home. I have, for historical and practical reasons, a bunch of different bank accounts. Of course not one of them exports my transaction data in a format that’s compatible with what YNAB expects.

Is this going to stop a software engineer? No, a software engineer and devoted householder would find the right combination of awk and sed to do the job. But I am also a Haskeller and I wonder how difficult it is to solve that task using Haskell. More precisely I want to gauge what amount of Haskell knowledge is required to do this. Since I am not a beginner anymore (yay!) this is a bit biased but I think that it is very important to do our best to remember we were once beginners.

In the following sections I want to explain what I did and give some pointers to help beginners getting started with Haskell and being able to code a similar application:

  1. set-up the project
  2. create data types
  3. decode CSV lines
  4. write tests for the decoders
  5. parse and process a full file
  6. parse options from the command line
  7. tie it all together in an application

For each section I will recommend some things to start learning first and some others to learn later.

The full code can be found here.

Setting-up a Haskell project

This is something I didn’t have to do from scratch since I already had the Haskell build tool stack installed on my machine. From now on I am going to assume that you have installed stack already. Creating your first Haskell project is not that obvious. You need to learn how to declare a few things according to the “Cabal” format:

  • where to put your sources, your tests?
  • are you going to produce a library, an executable?
  • which libraries do you need as dependencies?

Fear not, there is a great Haskell command-line tool helping you with all of this: summoner. Go stack install summoner. Just follow the prompts and create your first project in no time, with the corresponding Github project and CI configuration. I think this is the best way to get started on some immediate coding. You will have plenty of time later to learn Cabal/Stack/Hpack/nix and become a pro at setting up projects.

Funny enough, this step took me a bit of time. Indeed I am frequently using the ghci REPL (with stack ghci, we will talk about it later) when programming in Haskell and I have a global set-up for it in .ghci

:set prompt "λ> "
import Prelude

:def hoogle \str -> return $$ ":!hoogle --count=50 \"" ++ str ++ "\""
:def pointfree \str -> return $$ ":!pointfree \"" ++ str ++ "\""
:def pointful \str -> return $$ ":!pointful \"" ++ str ++ "\""

This configuration file gives me a cute ghci prompt but it also gives me access to some very useful Haskell tools like hoogle for searching type signatures, right in my REPL. Unfortunately when I started my ghci session, stack informed me that it didn’t know about Prelude. The reason is that the project created by summoner is created with a custom prelude which removes the standard Prelude from the search path. Custom preludes are definitely important to know in Haskell but they are also something which is best left for a bit later, when you want to get serious about Haskell development and make sure you are using “safe” functions as much as possible (no head :: [a] -> a for example). In my case I decided to switch to another custom prelude, Protolude.

Learn now

Learn later

  • the cabal format
  • the hpack format, as an alternate format
  • the cabal or stack commands for building/testing a project
  • other custom preludes: protolude, classy-prelude

Create data types

This is a real cool part of Haskell, cheap (to create) and powerful data types. For this application we want to have a datatype representing the input data and a data type for the output data. Wait, actually no. We just need a data type modelling what it means to be a transaction for YNAB and ways to:

  • create values of that type from a CSV line (next section)
  • output a CSV line from values of that type from a CSV line (next section)

Each transaction (or line in a ledger) must contain at least a date, an amount, a payee and possibly a category.

import           Data.Text (Text)
import           Data.Time (Day)

data LedgerLine = LedgerLine {
  date      :: Day
, amount    :: Amount
, reference :: Reference
, category  :: Maybe Category
} deriving (Eq, Show)

data Category = Category Text
  deriving (Eq, Show)

data Reference = Reference Text
  deriving (Eq, Show)

data Amount = Amount Double
  deriving (Eq, Show)

Here we are re-using some standard data types like Text and Day but wrapping them with custom data types. This is quite useful because we can’t make the mistake of putting a Reference into a Category for example. This also better documents the LedgerLine fields. The deriving clauses give us ways to display and compare values out of the box (think toString and equals in Java).

My actual datatypes are a bit more complicated:

data LedgerLine = LedgerLine {
  _date      :: Maybe Day
, _amount    :: Amount
, _reference :: Reference
, _category  :: Maybe Category
} deriving (Eq, Show)

newtype Category = Category Text
  deriving (Eq, Show, IsString)
  deriving newtype FromField
  deriving newtype ToField

newtype Reference = Reference Text
  deriving (Eq, Show, IsString)
  deriving newtype FromField
  deriving newtype ToField

newtype Amount = Amount Double
  deriving (Eq, Show)
  deriving newtype FromField
  deriving newtype ToField
  deriving newtype Num

First of all some CSV lines might not have a date yet if the transactions have been created today. Then:

  • I use newtype instead of data for Category, Amount, Reference to avoid paying the cost at runtime of wrapping a type
  • an IsString instance is used for Category and Reference to be able to use strings directly in tests (to write "restaurant" instead of Category "restaurant")
  • I am declaring a Num instance to be able to +, negate,… amounts later as if they were Doubles
  • the field names are prefixed with _ to avoid potential clashes with variables having similar names amount, category etc…
  • there are some instances for FromField and ToField for… see the next section :-)

Learn now

  • how to create data types and the difference between data and newtype
  • typeclasses and instances: Show, Eq, Num,…

Learn later

Decode a CSV line

This is becoming more involved. We need to find a library knowing how to parse CSV lines. The standard library for CSV files in Haskell is cassava. Like many other libraries for encoding / decoding data structures it uses type classes:

  • FromField to specify how to parse a value in a CSV column and transform it to a data type value
  • FromNamedRecord to specify how to parse a full CSV row and how to assemble the parsed values

In our case we want parse at least 3 formats, from different banks: Commerzbank, N26, Revolut so we need an auxiliary data type:

data InputLedgerLine =
    CommerzbankLine LedgerLine
  | N26Line LedgerLine
  | RevolutLine LedgerLine
  deriving (Eq, Show)

and we can start defining parsers for each format:

instance FromNamedRecord InputLedgerLine where
  parseNamedRecord r =
        parseCommerzBank
    <|> parseN26
    <|> parseRevolut

    where
      parseCommerzBank = fmap CommerzbankLine $$
            LedgerLine
        <$$> (fmap unCommerzbankDay <$$> r .: "Transaction date")
        <*> r .: "Amount"
        <*> r .: "Booking text"
        <*> r .: "Category"

      parseN26     = panic "todo N26"
      parseRevolut = panic "todo Revolut"

newtype CommerzbankDay = CommerzbankDay { unCommerzbankDay :: Day } deriving (Eq, Show)

instance FromField CommerzbankDay where
  parseField f = CommerzbankDay <$$>
    parseTimeM True defaultTimeLocale "%d.%m.%Y" (toS f)

This is whole jump in complexity all of a sudden, but also quite some power! Think about it, in a few lines of code we have:

  • specified how to parse rows for the Commerzbank file format
  • specified how to parse each field and what are the field names in the CSV file
  • specified a date format for dates like 26.08.2019
  • specified that other parsers must be tried if the first parser fails (when we are parsing another format)

I am not going to unpack everything here but give you some pointers what to learn.

Learn now

Learn later

Decode a CSV line

Pretty cool, if you understand how the parsers in the above section work, you should be able to open a GHCi session and try them out (read the doc of the cassava library for the decodeByName function).

λ> import Data.Csv
λ> let commerzbankHeader = "Transaction date,Value date,Transaction type,Booking text,Amount,Category
λ> let line = "30.08.2019,30.08.2019,debit,\"mobilcom-debitel Kd\",-15.00,Home Phone and Internet"
λ>
λ> fmap snd $$ decodeByName @InputLedgerLine $$ commerzbankHeader <> "\n" <> line
Right [CommerzbankLine (LedgerLine {
  _date = Just 2019-08-30,
  _amount = Amount (-15.0),
  _reference = Reference "mobilcom-debitel Kd",
  _category = Just (Category "Home Phone and Internet")})]

It works!

Perhaps we still want to make sure this code will still work if we make further modifications, so it is time to… write tests! There are many alternatives for writing tests in Haskell and I have my own preferences :-). I reached for my own library, registry-hedgehog which is a layer on top of several libraries:

  • hedgehog for writing property-based tests
  • registry for assembly data generators without using typeclasses
  • tasty-hedgehog for executing hedgehog properties as Tasty tests
  • tasty-discover to automatically find tests in files and assemble them into a large suite

This is totally overblown for that little project since I haven’t written a single property so far. But I know the API well and like it since I made it to my taste :-). What do the tests look like?

test_parse_commerzbank_with_date = test "we can parse the commerzbank format with a date" $$ do
  let line = "30.08.2019,30.08.2019,debit,\"mobilcom-debitel Kd\",-15.00,Home Phone and Internet"
  let result = fmap snd $$ decodeByName (toS $$ unlines [header, line])

  result === Just (CommerzbankLine $$ LedgerLine {
      _reference = "mobilcom-debitel Kd"
    , _date = Just (fromGregorian 2019 8 30)
    , _amount = Amount (-15.0)
    , _category = Just ("Home Phone and Internet")
    })

A test is simply a piece of text describing the intention, some action (decodeByName) and an assertion (with ===). This is very similar to what I tried on the command line earlier.

Learn now

  • hspec: an easy library to start writing unit tests

Learn later

  • quickcheck/hedgehog: for writing property tests
  • tasty: a test framework dedicated to the structuring and the running of test suites
  • hspec-discover/[tasty-discover]: to avoid having to manually create test suites from tests in test modules
  • registry-hedgehog: for an alternative to typeclasses when creating data generators

Parse and process a full file

Now we are entering serious territory. When we parse files we have to be conscious about:

  • memory usage: it is not advised to read the full content of a file before processing it
  • resource usage: files must be properly closed after use to avoid leaking resources

None of this really counts for my application since the files I am processing are quite small (< 1 Mb) and the application exits right after processing. Anyway I wanted to see if it was as easy to do the “right thing” rather than go for a quick and dirty solution.

There is a beautiful library for streaming data in Haskell, streaming, which I used before. I am in luck since someone created a streaming-cassava library to stream rows decoded by cassava. It provides a function decodeByName which is the equivalent of Data.Csv.decodeByName I have used in the tests but it now operates on “streams” of data. A similar function, encodeByName, also exists to encode values to CSV rows. That’s fine ut we also need to read and write those rows. I am going to decompose the whole processing in 6 parts and explain what are the data types involved in each step:

  1. read an input file to get a ByteString m () which is a stream of bytes
  2. decode the rows with decodeByName to get a Stream (Of InputLedgerLine) m ()
  3. deal with decoding errors
  4. process the input ledger lines and transform them to LedgerLine
  5. encode the lines as CSV rows with encodeByName to get back a stream of bytes ByteString m ()
  6. write those bytes to an output file

Read a file as a Stream of bytes

Again we are lucky, the streaming-with library gives us a function, withBinaryFileContents to read the contents of a file as a stream:

withBinaryFileContents filePath $$ \(contents :: ByteString m ()) ->
  doSomething contents

Not only the contents are being streamed using the ByteString m () data structure, but also withBinaryFileContents is going to make sure the file is closed when the processing (doSomething) is finished, even if there are exceptions.

Decode the lines

The Streaming.Cassava.decodeByName function does the job for us. It takes a ByteString m () and returns a Stream (Of InputLedgerLine) m (), provided we have a FromNamedRecord typeclass instance for InputLedgerLine. Now is a good time to talk about those streaming data types: ByteString and Stream.

What is a stream of data?

Indeed I owe a bit of explanation on the “streaming” types: ByteString m r and Stream (Of a) m r. Why so many types parameters to represent streams? I will just explain Stream here because ByteString m r is just a specialization of Stream when we are streaming bytes.

NOTE: The ByteString name in Haskell (found in Data.ByteString or Data.Lazy.BytesString) could make you believe that we are dealing with strings and their underlying bytes. It is better to think about it as just a collection of bytes. Same thing for Data.ByteString.Streaming.ByteString m () but streaming bytes.

So, what is a Stream (Of a) m r? If you run :info Stream in GHCi, you will more or less read (I’m simplifying a bit here) that it is either:

  • Return r: returning a value r, nothing more to do. If you use the fmap operation you can “map” this value to something else (so Stream is a Functor)
  • Effect (m (Stream (Of a) m r)): creating a stream with the effect m. For example m = IO when we read from a file
  • Step (Of a (Stream (Of a) m r)): producing a value a and another stream of values: “what comes next”. Think about Of as pair where the first element is strictly evaluated

I found it a bit confusing at first because of the various type variables (“do we really need a type for the return value? Yes we do”) but after a while I realized that it was the simplest thing to do to stream values and already super-powerful!

Deal with decoding errors

I think this part is difficult for beginners. I wrote that Streaming.Cassava.decodeByName was returning Stream (Of InputLedgerLine) m (). No error in sight there. How are the parsing errors signaled then? On the monad m. The decodeByName full signature is:

decodeByName :: (MonadError CsvParseException m, FromNamedRecord a) =>
  ByteString m r -> Stream (Of a) m r

Meaning that the monad m must support errors which are CsvParseException. For example m can be ExceptT CsvParseException n where n is another monad. On one hand this is quite nice because we get back a data type Stream (Of a) m r where we don’t have to think too much about errors, it is mostly a stream of parsed values. It is easier to work with than Stream (Of (Either CsvParseException a)) m r for example. On the other hand the constraint on m is going to propagate to the rest of the application and things can become awkward for example if another part of the application is requiring MonadError OtherException m. Then the compilation errors can become confusing and it is not immediately obvious how the error types can be aligned. In this application we nip the problem in the bud by doing to following:

  • catch the error as soon as possible
  • rethrow it as an exception in IO
rethrow :: (Exception e, MonadIO m) => ExceptT e m a -> m a
rethrow ma = do
   r <- runExceptT ma
   case r of
     Left e  -> throwIO e
     Right a -> pure a

rethrow assumes that we are working with values a in a monad which is ExceptT e m. It catches the errors of type e and, assuming that m is capable of doing IO it is going to throwIO the errors. What we do here is essentially transforming a constraint MonadError CsvParseException m into MonadIO m. We lose a bit in terms of abstraction, m is less general than it could be. But we gain in terms of inter-operability with other parts of the application.

Well, that is, if we can even apply rethrow on our stream! What we need is a function Stream (Of a) m r -> Stream (Of a) n r where m is ExceptT CsvParseException n. This function exists in much more general cases than Stream. It is called hoist. This function works on data types of the form t m a (t = Stream here) and is defined in the mmorph library. This is probably the most complicated transformation of this whole project. However situations with nested “monads/containers” (t and m) appear quite frequently in Haskell so after a while you will reach for hoist quite naturally.

What if I hadn’t done any of this?

The MonadError CsvParseException m constraint would have “bubbled-up” to the top-level, up to the main function where Haskell would have asked me to do something like runExceptT to make sure I dealt with parsing errors.

Process values

The values we read are of type InputLedgerLine but we want to a single format LedgerLine. We are not that far since each parser is already normalizing the input values to a LedgerLine. We only need to extract that line from each InputLedgerLine case:

toLedgerLine :: InputLedgerLine -> LedgerLine
toLedgerLine (CommerzbankLine l) = l
toLedgerLine (N26Line l)         = l
toLedgerLine (RevolutLine l)     = l

Now, how can we use toLedgerLine to convert the lines in a Stream (Of InputLedgerLine) m () to get Stream (Of LedgerLine) m ()? By using the map function in Streaming.Prelude:

import Streaming.Prelude as SP

let decoded = decodeByName contents :: Stream (Of InputLedgerLine) m ()
let processed = SP.map toLedgerLine decoded :: Stream (Of LedgerLine) m ()

I really encourage you to read the documentation on Streaming.Prelude because you will find there most of the operations you generally use on lists but this time on streams.

Encode the lines as CSV

Again streaming-cassava helps us here. encodeByName encodes our values, Stream (Of LedgerLine) m () to a ByteString m (), provided we have a ToNamedRecord instance:

instance ToNamedRecord LedgerLine where
  toNamedRecord (LedgerLine date amount reference category) =
    namedRecord [
       "Date"   .= date
     , "Amount" .= amount
     , "Payee"  .= reference
     , "Memo"   .= category
     ]

Since all our fields have ToField instances which are derived automatically because they are newtypes of well-known types like Text and Double, we just have to specify the name of the fields in the output file, so that cassava knows in which column to put the values.

Write a Stream to a file

streaming-with gives us writeBinaryFile which takes a ByteString m () and writes to an output file, again making sure that resources are properly cleaned-up even if there is an exception in the meantime.

To sum-up all those transformations in a block of code:

processAll =
  withBinaryFileContents inputFilePath $$ \contents -> do
    let decoded   = decodeByName contents
    let processed = Streaming.map toLedgerLine $$ hoist rethrow decoded
    let encoded   = encodeByName ynabHeader processed
    writeBinaryFile outputFilePath encoded

Thanks to all those libraries we have a nice isolation of responsibilities, and guarantees about memory and file handle usage!

Learn now

Learn later

Parse options from the command line

At the minimum we need to be able to read the name of the input file. This can be done with System.getArgs :: IO[String] and would be sufficient for this application. However you are going to need more elaborate parsing of command line options for a non-trivial CLI application. I have used a very well-known library for this: optparse-applicative.

With this library, we define a data type for the data we want to read from the command line:

data CliOptions = CliOptions {
  inputFile  :: Text
, outputFile :: Maybe Text
} deriving (Eq, Show)

The output file is left optional, since we can provide either a hard-coded name for the ouput file result.csv or append a piece of text to the input file name. The parser for CliOptions looks like this:

cliOptionsParser :: Parser CliOptions
cliOptionsParser = CliOptions
   <$$> strArgument
       ( metavar "INPUT FILE"
      <> help "Input CSV file" )
   <*> option auto
       ( long "output-file"
      <> short 'o'
      <> value Nothing
      <> help "Output CSV file" )

This style of parser definition is very similar to the one we used for FromNamedRecord to parse CSV fields. It relies on the notion of an Applicative (hence the library name) and on a series of helper functions to specify the options:

  • strArgument parses a string given as an argument (so it is not optional)
  • option parses an option (starting with -- on the command line) and the exact type of parser is auto meaning that it will parse everything with a Read instance

You can also see some additional information, like the option name (long and short) for the output file. This information is used both for parsing and for documenting the command line options. Talking about documenting, how do we provide a --help option? optparse-applicative gives a way to “wrap” a Parser with more information

defineCliOptions :: ParserInfo CliOptions
defineCliOptions =
   info (cliOptionsParser <**> helper) $$
   header "ledgit - massage ledger files" <>
   progDesc "Transform a CSV ledger file into a suitable YNAB file"

In defineCliOptions we enrich the CliOptions with a helper option and provide additional information to our parser with “modifiers”:

  • progDesc adds a text description of the program displayed under the “Usage” section showing a summary of the options
  • header adds an additional header when we display the help
  • those 2 modifiers are being “appended” into one with <> (yes they form a Monoid)

While the whole library is quite powerful, there is quite a lot to explain if you really want to understand how it works: parsers, Applicative, Monoid, Read,… Yet I more or less took the examples from the documentation, changed a few things and it worked immediately.

Learn now

Learn later

Tie it all together

In reality you could put all the code in one Haskell file (you could even create a stack script) and you would be done. For fun I decided to create small components to isolate the different pieces of the application, using “records-of-functions”:

  • Data.hs contains all the data types + the CSV encoders/decoders
  • Importer.hs contains the Importer component tasked with reading the file and decoding it
  • Exporter.hs contains the Exporter component which takes a stream of lines and outputs it to a file
  • App.hs just connects the 2
  • Ledgit.hs calls the options parser and create the App

The Importer

Let’s have a closer look at those components. The Importer is defined as:

data Importer m = Importer {
  importCsv :: (Stream (Of LedgerLine) m () -> m ()) -> m ()
}

It is kind of weird. Instead of just exposing an interface like importCsv :: Stream (Of LedgerLine) m () returning the decoded lines, it takes a “consumer” of Stream (Of LedgerLine) m () and executes it. This is because of a limitation of the Streaming library and the libraries we have been using with so far.

The Streaming library does not support any resources management. The resource management (properly closing file handles) is done with withBinaryFileContents which take a function consuming the file contents. If we want to use that library and define a component we need to propagate the same pattern.

There is actually quite a profound principle at play here. In programming, some “things” can be either defined by how they are produced or how they are consumed. For example you can define the Maybe datatype by either

data Maybe a = Just a | Nothing

or

data Maybe a = forall b . ((a -> b), b)

In the second case you specify how to “consume” values that are Just a or values that are Nothing.

If you squint a bit you will also recognize a “continuation-like” type in importCsv :: (a -> r) -> r. The computer science literature is full of such transformations, from “direct style” to “continuation-passing style”. This is a lot of hand-waving, just to justify the weird shape of the Importer interface :-).

Otherwise you will notice that the Importer does not mention its “configuration”, there is no inputFilePath to read from in its interface. This is because this data will be provided by the wiring we do in Ledgit.hs.

The Exporter

Nothing special here, we take a stream of lines and export each of them to a file. Underneath the implementation is using the functions we have seen before: writeBinaryFile, encodeByName.

data Exporter m = Exporter {
  exportCsv :: Stream (Of LedgerLine) m () -> m ()
}

The App

The App just connects the 2 main components, its implementation is super-simple

data App m = App {
  runApp :: m ()
}

newApp :: Importer m -> Exporter m -> App m
newApp Importer {..} Exporter {..} = App {..} where
  runApp = importCsv exportCsv

The “wiring”

Now we need a way to make an App with its Importer, its Exporter and the CliOptions parsed from the command-line. For this we use the registry library and define a registry like so:

newRegistry :: CliOptions -> Registry _ _
newRegistry cliOptions =
     fun (newImporter @IO)
  <: fun (newExporter @IO)
  <: fun (newApp @IO)
  <: val cliOptions

We put all the values and components constructors into a Registry and later ask for an all-wired application:

runApplication :: IO ()
runApplication = do
  cliOptions <- execParser defineCliOptions
  let registry = newRegistry cliOptions
  let app = make @(App IO) registry
  runApp app

That’s it, registry automatically calls all the constructor functions and wires the App. You can also write this code by hand, there’s no real need to use registry for such a simple application.

Learn now

Learn later

There are many other ways to define and wire Haskell applications:

Summary

This blog post presents a simple Haskell application which can be seen as the “template” for many CLI applications. We have

  • command-line options parsing
  • “business” data types
  • files input / output
  • streaming
  • encoding / decoding

There is nonetheless a learning curve which we should not under-estimate, we need to:

  • know how to set-up a new project
  • know how to compile, run tests, install the application
  • know how to find relevant libraries in the Haskell ecosystem
  • learn about data and newtype
  • learn about type classes and instances
  • be comfortable with the Applicative typeclass and combinators
  • understand a minimum of monad transformers

I hope this blog post will contribute to making this learning curve less steep by giving pointers on things to start learning right, then other things to read / practice later.

Concluding thoughts

It occurred to me that being computer literate will be an important part of the “citizen-toolkit” in the future. There is no reason why we should not be able to access all of our data in the future through well-crafted APIs. When this happens, I hope someone will use Haskell and write a similar blogpost about REST access (or whatever API standard), blockchain auditing, security libraries etc…


1 comment:

Neil Mayhew said...

Hi Eric,

Thanks for this helpful and interesting post.

I have some suggestions and questions:

1. It's a really bad idea to use Double for currency amounts. Much better to use Centi from Data.Fixed instead.

2. You have a lot of instances of $$ where I think $ is meant. This is very confusing for less experienced Haskellers.

3. In parseCommerzBank I think it would be better to produce a parse failure for bad dates rather than letting them go to Nothing.

4. I'm wondering why you have the ADT in InputLedgerLine instead of having the parse alternatives all produce LedgerLine. Do you need to know after the fact which type of bank line produced the ledger line?

I'm a YNAB user, too, so I found this post particularly interesting. However, all my accounts can be connected to YNAB with automatic feeds so I rarely have to download CSV directly. I've needed to process CSV for other financial applications, though, especially when importing legacy data from spreadsheets, and maybe that's your use case here.

Thanks again,

—Neil