11 October 2019

A better "add" operator for HLists

At Haskell eXchange 2019, Yves Parès was presenting his “porcupine” library, a library to help scientists run data pipelines using the power of Haskell’s arrows. At some stage, he said, “you know if you’re using a ‘records’ library, like Vinyl, you have to build your HList by appending RNil at the end”. And I thought: No!

This is a very small thing that has been bugging me for some time. If I want to build a HList, why do I have to append HNil at the end? As soon as I’m appending 2 things together to form an HList the whole type should be determined, isn’t it? Let’s work on a bit of code.

Here is the standard definition of a HList in Haskell:

data HList (l::[*]) where
  HNil  :: HList '[]
  HCons :: e -> HList l -> HList (e ': l)

-- example
myHList :: HList [Int, Text]
myHList = HCons 1 (HCons "Hello" HNil)

I can define a :+ operator to make the operation of appending an element a bit nicer:

infixr 5 +:
(+:) :: a -> HList as -> HList (a : as)
(+:) = HCons

myHList1 :: HList [Int, Text]
myHList1 =
     1
  +: "Hello"
  +: HNil

I can also define an <+> operator to append 2 HLists together:

-- :++ is a type-level operator (not defined here)
-- for appending 2 lists of types together (see the Appendix)

infixr 4 <+>
(<+>) :: HList as -> HList bs -> HList (as :++ bs)
(<+>) HNil bs = bs
(<+>) (HCons a as) bs = HCons a (as <+> bs)

list1 :: HList [Int, Text]
list1 = 1 +: "Hello" +: HNil

list2 :: HList [Double, Bool]
list2 = 2.0 +: True +: HNil

lists :: HList [Int, Text, Double, Bool]
lists = list1 <+> list2

All good so far, that’s a reasonable API. However we still need to specify HNil every time we create a new HList. Can we avoid it?

A more polymorphic operator

In order to avoid using HNil we need to have an operator, let’s call it <:, to know what to do when:

adding one element to another: a <: b
adding one element to a HList: a <: bs

But even better we should be able to:

append 2 HList together: as <: bs
append an element at the end of a HList: as <: b

We can already see that this operator can not be a straightforward Haskell functions, because the types of its first and second arguments are not always the same. Annoying. Wait, there’s a tool in Haskell to cope with variations in types like that: typeclasses!

infixr 5 <:
class AddLike a b c | a b -> c where
  (<:) :: a -> b -> c

instance {-# OVERLAPPING #-} (asbs ~ (as :++ bs)) =>
  AddLike (HList as) (HList bs) (HList asbs) where
  (<:) = (<+>)

instance (abs ~ (a : bs)) => AddLike a (HList bs) (HList abs) where
  (<:) = (+:)

instance AddLike a b (HList [a, b]) where
  (<:) a b = a +: b +: HNil

instance (asb ~ (as :++ '[b])) => AddLike (HList as) b (HList asb) where
  as <: b = as <+> (b +: HNil)

This AddLike typeclass will deal with all the cases and now we can write:

a = 1 :: Int
b = "hello" :: Text
c = 2.0 :: Double
d = True :: Bool

ab = a <: b
bc = b <: c

abc = a <: bc
bca = bc <: a

abcd = ad <: cd

That’s it, one operator for all the reasonable cases.

Appendix

Here is the full code:

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE UndecidableInstances #-}
{-# LANGUAGE PolyKinds #-}
{-# OPTIONS_GHC -fno-warn-unticked-promoted-constructors #-}

module AddLikeApi where

import Protolude

data HList (l :: [Type]) where
  HNil  :: HList '[]
  HCons :: e -> HList l -> HList (e ': l)

myHList :: HList [Int, Text]
myHList = HCons 1 (HCons "Hello" HNil)

infixr 5 +:
(+:) :: a -> HList as -> HList (a : as)
(+:) = HCons

myHList' :: HList [Int, Text]
myHList' =
     1
  +: "Hello"
  +: HNil

-- * Appendix

infixr 4 <+>
(<+>) :: HList as -> HList bs -> HList (as :++ bs)
(<+>) HNil bs = bs
(<+>) (HCons a as) bs = HCons a (as <+> bs)

infixr 5 <:
class AddLike a b c | a b -> c where
  (<:) :: a -> b -> c

instance {-# OVERLAPPING #-} (asbs ~ (as :++ bs)) => 
  AddLike (HList as) (HList bs) (HList asbs) where
  (<:) = (<+>)

instance (abs ~ (a : bs)) => AddLike a (HList bs) (HList abs) where
  (<:) = (+:)

instance AddLike a b (HList [a, b]) where
  (<:) a b = a +: b +: HNil

instance (asb ~ (as :++ '[b])) => AddLike (HList as) b (HList asb) where
  as <: b = as <+> (b +: HNil)

type family (:++) (x :: [k]) (y :: [k]) :: [k] where
  '[]      :++ xs = xs
  (x : xs) :++ ys = x : (xs :++ ys)

-- examples
list1 :: HList [Int, Text]
list1 = 1 +: "Hello" +: HNil

list2 :: HList [Double, Bool]
list2 = 2.0 +: True +: HNil

lists :: HList [Int, Text, Double, Bool]
lists = list1 <+> list2

a = 1 :: Int
b = "hello" :: Text
c = 2.0 :: Double
d = True :: Bool

ab :: HList [Int, Text]
ab = a <: b

bc :: HList [Text, Double]
bc = b <: c

cd :: HList [Double, Bool]
cd = c <: d

abc' :: HList [Int, Text, Double]
abc' = ab <: c

abc :: HList [Int, Text, Double]
abc = a <: bc

abcd :: HList [Int, Text, Double, Bool]
abcd = ab <: cd

09 September 2019

Processing CSV files in Haskell

This blog post is the result of a little experiment. I wanted to check how hard it would be to use Haskell to write a small program to help me solve a “real-life” problem. I have always been pretty bad at doing accounting for the family, with a mix of Excel spreadsheets filled with random amounts and dates.

In order to improve our budgeting I decided to give a go at an application called YNAB “You Need A Budget”. Many applications of that nature require you to import your bank transactions in order to really precise about your income and expenses. And now we have an IT problem, right at home. I have, for historical and practical reasons, a bunch of different bank accounts. Of course not one of them exports my transaction data in a format that’s compatible with what YNAB expects.

Is this going to stop a software engineer? No, a software engineer and devoted householder would find the right combination of awk and sed to do the job. But I am also a Haskeller and I wonder how difficult it is to solve that task using Haskell. More precisely I want to gauge what amount of Haskell knowledge is required to do this. Since I am not a beginner anymore (yay!) this is a bit biased but I think that it is very important to do our best to remember we were once beginners.

In the following sections I want to explain what I did and give some pointers to help beginners getting started with Haskell and being able to code a similar application:

set-up the project
create data types
decode CSV lines
write tests for the decoders
parse and process a full file
parse options from the command line
tie it all together in an application

For each section I will recommend some things to start learning first and some others to learn later.

The full code can be found here.

Setting-up a Haskell project

This is something I didn’t have to do from scratch since I already had the Haskell build tool stack installed on my machine. From now on I am going to assume that you have installed stack already. Creating your first Haskell project is not that obvious. You need to learn how to declare a few things according to the “Cabal” format:

where to put your sources, your tests?
are you going to produce a library, an executable?
which libraries do you need as dependencies?

Fear not, there is a great Haskell command-line tool helping you with all of this: summoner. Go stack install summoner. Just follow the prompts and create your first project in no time, with the corresponding Github project and CI configuration. I think this is the best way to get started on some immediate coding. You will have plenty of time later to learn Cabal/Stack/Hpack/nix and become a pro at setting up projects.

Funny enough, this step took me a bit of time. Indeed I am frequently using the ghci REPL (with stack ghci, we will talk about it later) when programming in Haskell and I have a global set-up for it in .ghci

:set prompt "λ> "
import Prelude

:def hoogle \str -> return $$ ":!hoogle --count=50 \"" ++ str ++ "\""
:def pointfree \str -> return $$ ":!pointfree \"" ++ str ++ "\""
:def pointful \str -> return $$ ":!pointful \"" ++ str ++ "\""

This configuration file gives me a cute ghci prompt but it also gives me access to some very useful Haskell tools like hoogle for searching type signatures, right in my REPL. Unfortunately when I started my ghci session, stack informed me that it didn’t know about Prelude. The reason is that the project created by summoner is created with a custom prelude which removes the standard Prelude from the search path. Custom preludes are definitely important to know in Haskell but they are also something which is best left for a bit later, when you want to get serious about Haskell development and make sure you are using “safe” functions as much as possible (no head :: [a] -> a for example). In my case I decided to switch to another custom prelude, Protolude.

Learn now

install stack
install summoner
create a project
start stack ghci
the Relude prelude and its main functions

Learn later

the cabal format
the hpack format, as an alternate format
the cabal or stack commands for building/testing a project
other custom preludes: protolude, classy-prelude

Create data types

This is a real cool part of Haskell, cheap (to create) and powerful data types. For this application we want to have a datatype representing the input data and a data type for the output data. Wait, actually no. We just need a data type modelling what it means to be a transaction for YNAB and ways to:

create values of that type from a CSV line (next section)
output a CSV line from values of that type from a CSV line (next section)

Each transaction (or line in a ledger) must contain at least a date, an amount, a payee and possibly a category.

import           Data.Text (Text)
import           Data.Time (Day)

data LedgerLine = LedgerLine {
  date      :: Day
, amount    :: Amount
, reference :: Reference
, category  :: Maybe Category
} deriving (Eq, Show)

data Category = Category Text
  deriving (Eq, Show)

data Reference = Reference Text
  deriving (Eq, Show)

data Amount = Amount Double
  deriving (Eq, Show)

Here we are re-using some standard data types like Text and Day but wrapping them with custom data types. This is quite useful because we can’t make the mistake of putting a Reference into a Category for example. This also better documents the LedgerLine fields. The deriving clauses give us ways to display and compare values out of the box (think toString and equals in Java).

My actual datatypes are a bit more complicated:

data LedgerLine = LedgerLine {
  _date      :: Maybe Day
, _amount    :: Amount
, _reference :: Reference
, _category  :: Maybe Category
} deriving (Eq, Show)

newtype Category = Category Text
  deriving (Eq, Show, IsString)
  deriving newtype FromField
  deriving newtype ToField

newtype Reference = Reference Text
  deriving (Eq, Show, IsString)
  deriving newtype FromField
  deriving newtype ToField

newtype Amount = Amount Double
  deriving (Eq, Show)
  deriving newtype FromField
  deriving newtype ToField
  deriving newtype Num

First of all some CSV lines might not have a date yet if the transactions have been created today. Then:

I use newtype instead of data for Category, Amount, Reference to avoid paying the cost at runtime of wrapping a type
an IsString instance is used for Category and Reference to be able to use strings directly in tests (to write "restaurant" instead of Category "restaurant")
I am declaring a Num instance to be able to +, negate,… amounts later as if they were Doubles
the field names are prefixed with _ to avoid potential clashes with variables having similar names amount, category etc…
there are some instances for FromField and ToField for… see the next section :-)

Learn now

how to create data types and the difference between data and newtype
typeclasses and instances: Show, Eq, Num,…

Learn later

record fields, how they appear as top-level functions and how to avoid clashes
different ways of deriving instances for data types
IsString for text-like data types

Decode a CSV line

This is becoming more involved. We need to find a library knowing how to parse CSV lines. The standard library for CSV files in Haskell is cassava. Like many other libraries for encoding / decoding data structures it uses type classes:

FromField to specify how to parse a value in a CSV column and transform it to a data type value
FromNamedRecord to specify how to parse a full CSV row and how to assemble the parsed values

In our case we want parse at least 3 formats, from different banks: Commerzbank, N26, Revolut so we need an auxiliary data type:

data InputLedgerLine =
    CommerzbankLine LedgerLine
  | N26Line LedgerLine
  | RevolutLine LedgerLine
  deriving (Eq, Show)

and we can start defining parsers for each format:

instance FromNamedRecord InputLedgerLine where
  parseNamedRecord r =
        parseCommerzBank
    <|> parseN26
    <|> parseRevolut

    where
      parseCommerzBank = fmap CommerzbankLine $$
            LedgerLine
        <$$> (fmap unCommerzbankDay <$$> r .: "Transaction date")
        <*> r .: "Amount"
        <*> r .: "Booking text"
        <*> r .: "Category"

      parseN26     = panic "todo N26"
      parseRevolut = panic "todo Revolut"

newtype CommerzbankDay = CommerzbankDay { unCommerzbankDay :: Day } deriving (Eq, Show)

instance FromField CommerzbankDay where
  parseField f = CommerzbankDay <$$>
    parseTimeM True defaultTimeLocale "%d.%m.%Y" (toS f)

This is whole jump in complexity all of a sudden, but also quite some power! Think about it, in a few lines of code we have:

specified how to parse rows for the Commerzbank file format
specified how to parse each field and what are the field names in the CSV file
specified a date format for dates like 26.08.2019
specified that other parsers must be tried if the first parser fails (when we are parsing another format)

I am not going to unpack everything here but give you some pointers what to learn.

Learn now

cassava for parsing CSV files
what is anApplicative how to parse data types with MyData <$> p1 <*> p2 <*> p3
what is an Alternative to try another parser with <|>
the Data.Time library to parse dates and times
why newtypes are sometimes necessary to tell Haskell to use a specific type instance (this is why I introduced a special CommerzbankDay here)

Learn later

other parsers: attoparsec, megaparsec

Decode a CSV line

Pretty cool, if you understand how the parsers in the above section work, you should be able to open a GHCi session and try them out (read the doc of the cassava library for the decodeByName function).

λ> import Data.Csv
λ> let commerzbankHeader = "Transaction date,Value date,Transaction type,Booking text,Amount,Category
λ> let line = "30.08.2019,30.08.2019,debit,\"mobilcom-debitel Kd\",-15.00,Home Phone and Internet"
λ>
λ> fmap snd $$ decodeByName @InputLedgerLine $$ commerzbankHeader <> "\n" <> line
Right [CommerzbankLine (LedgerLine {
  _date = Just 2019-08-30,
  _amount = Amount (-15.0),
  _reference = Reference "mobilcom-debitel Kd",
  _category = Just (Category "Home Phone and Internet")})]

It works!

Perhaps we still want to make sure this code will still work if we make further modifications, so it is time to… write tests! There are many alternatives for writing tests in Haskell and I have my own preferences :-). I reached for my own library, registry-hedgehog which is a layer on top of several libraries:

hedgehog for writing property-based tests
registry for assembly data generators without using typeclasses
tasty-hedgehog for executing hedgehog properties as Tasty tests
tasty-discover to automatically find tests in files and assemble them into a large suite

This is totally overblown for that little project since I haven’t written a single property so far. But I know the API well and like it since I made it to my taste :-). What do the tests look like?

test_parse_commerzbank_with_date = test "we can parse the commerzbank format with a date" $$ do
  let line = "30.08.2019,30.08.2019,debit,\"mobilcom-debitel Kd\",-15.00,Home Phone and Internet"
  let result = fmap snd $$ decodeByName (toS $$ unlines [header, line])

  result === Just (CommerzbankLine $$ LedgerLine {
      _reference = "mobilcom-debitel Kd"
    , _date = Just (fromGregorian 2019 8 30)
    , _amount = Amount (-15.0)
    , _category = Just ("Home Phone and Internet")
    })

A test is simply a piece of text describing the intention, some action (decodeByName) and an assertion (with ===). This is very similar to what I tried on the command line earlier.

Learn now

hspec: an easy library to start writing unit tests

Learn later

quickcheck/hedgehog: for writing property tests
tasty: a test framework dedicated to the structuring and the running of test suites
hspec-discover/[tasty-discover]: to avoid having to manually create test suites from tests in test modules
registry-hedgehog: for an alternative to typeclasses when creating data generators

Parse and process a full file

Now we are entering serious territory. When we parse files we have to be conscious about:

memory usage: it is not advised to read the full content of a file before processing it
resource usage: files must be properly closed after use to avoid leaking resources

None of this really counts for my application since the files I am processing are quite small (< 1 Mb) and the application exits right after processing. Anyway I wanted to see if it was as easy to do the “right thing” rather than go for a quick and dirty solution.

There is a beautiful library for streaming data in Haskell, streaming, which I used before. I am in luck since someone created a streaming-cassava library to stream rows decoded by cassava. It provides a function decodeByName which is the equivalent of Data.Csv.decodeByName I have used in the tests but it now operates on “streams” of data. A similar function, encodeByName, also exists to encode values to CSV rows. That’s fine ut we also need to read and write those rows. I am going to decompose the whole processing in 6 parts and explain what are the data types involved in each step:

read an input file to get a ByteString m () which is a stream of bytes
decode the rows with decodeByName to get a Stream (Of InputLedgerLine) m ()
deal with decoding errors
process the input ledger lines and transform them to LedgerLine
encode the lines as CSV rows with encodeByName to get back a stream of bytes ByteString m ()
write those bytes to an output file

Read a file as a Stream of bytes

Again we are lucky, the streaming-with library gives us a function, withBinaryFileContents to read the contents of a file as a stream:

withBinaryFileContents filePath $$ \(contents :: ByteString m ()) ->
  doSomething contents

Not only the contents are being streamed using the ByteString m () data structure, but also withBinaryFileContents is going to make sure the file is closed when the processing (doSomething) is finished, even if there are exceptions.

Decode the lines

The Streaming.Cassava.decodeByName function does the job for us. It takes a ByteString m () and returns a Stream (Of InputLedgerLine) m (), provided we have a FromNamedRecord typeclass instance for InputLedgerLine. Now is a good time to talk about those streaming data types: ByteString and Stream.

What is a stream of data?

Indeed I owe a bit of explanation on the “streaming” types: ByteString m r and Stream (Of a) m r. Why so many types parameters to represent streams? I will just explain Stream here because ByteString m r is just a specialization of Stream when we are streaming bytes.

NOTE: The ByteString name in Haskell (found in Data.ByteString or Data.Lazy.BytesString) could make you believe that we are dealing with strings and their underlying bytes. It is better to think about it as just a collection of bytes. Same thing for Data.ByteString.Streaming.ByteString m () but streaming bytes.

So, what is a Stream (Of a) m r? If you run :info Stream in GHCi, you will more or less read (I’m simplifying a bit here) that it is either:

Return r: returning a value r, nothing more to do. If you use the fmap operation you can “map” this value to something else (so Stream is a Functor)
Effect (m (Stream (Of a) m r)): creating a stream with the effect m. For example m = IO when we read from a file
Step (Of a (Stream (Of a) m r)): producing a value a and another stream of values: “what comes next”. Think about Of as pair where the first element is strictly evaluated

I found it a bit confusing at first because of the various type variables (“do we really need a type for the return value? Yes we do”) but after a while I realized that it was the simplest thing to do to stream values and already super-powerful!

Deal with decoding errors

I think this part is difficult for beginners. I wrote that Streaming.Cassava.decodeByName was returning Stream (Of InputLedgerLine) m (). No error in sight there. How are the parsing errors signaled then? On the monad m. The decodeByName full signature is:

decodeByName :: (MonadError CsvParseException m, FromNamedRecord a) =>
  ByteString m r -> Stream (Of a) m r

Meaning that the monad m must support errors which are CsvParseException. For example m can be ExceptT CsvParseException n where n is another monad. On one hand this is quite nice because we get back a data type Stream (Of a) m r where we don’t have to think too much about errors, it is mostly a stream of parsed values. It is easier to work with than Stream (Of (Either CsvParseException a)) m r for example. On the other hand the constraint on m is going to propagate to the rest of the application and things can become awkward for example if another part of the application is requiring MonadError OtherException m. Then the compilation errors can become confusing and it is not immediately obvious how the error types can be aligned. In this application we nip the problem in the bud by doing to following:

catch the error as soon as possible
rethrow it as an exception in IO

rethrow :: (Exception e, MonadIO m) => ExceptT e m a -> m a
rethrow ma = do
   r <- runExceptT ma
   case r of
     Left e  -> throwIO e
     Right a -> pure a

rethrow assumes that we are working with values a in a monad which is ExceptT e m. It catches the errors of type e and, assuming that m is capable of doing IO it is going to throwIO the errors. What we do here is essentially transforming a constraint MonadError CsvParseException m into MonadIO m. We lose a bit in terms of abstraction, m is less general than it could be. But we gain in terms of inter-operability with other parts of the application.

Well, that is, if we can even apply rethrow on our stream! What we need is a function Stream (Of a) m r -> Stream (Of a) n r where m is ExceptT CsvParseException n. This function exists in much more general cases than Stream. It is called hoist. This function works on data types of the form t m a (t = Stream here) and is defined in the mmorph library. This is probably the most complicated transformation of this whole project. However situations with nested “monads/containers” (t and m) appear quite frequently in Haskell so after a while you will reach for hoist quite naturally.

What if I hadn’t done any of this?

The MonadError CsvParseException m constraint would have “bubbled-up” to the top-level, up to the main function where Haskell would have asked me to do something like runExceptT to make sure I dealt with parsing errors.

Process values

The values we read are of type InputLedgerLine but we want to a single format LedgerLine. We are not that far since each parser is already normalizing the input values to a LedgerLine. We only need to extract that line from each InputLedgerLine case:

toLedgerLine :: InputLedgerLine -> LedgerLine
toLedgerLine (CommerzbankLine l) = l
toLedgerLine (N26Line l)         = l
toLedgerLine (RevolutLine l)     = l

Now, how can we use toLedgerLine to convert the lines in a Stream (Of InputLedgerLine) m () to get Stream (Of LedgerLine) m ()? By using the map function in Streaming.Prelude:

import Streaming.Prelude as SP

let decoded = decodeByName contents :: Stream (Of InputLedgerLine) m ()
let processed = SP.map toLedgerLine decoded :: Stream (Of LedgerLine) m ()

I really encourage you to read the documentation on Streaming.Prelude because you will find there most of the operations you generally use on lists but this time on streams.

Encode the lines as CSV

Again streaming-cassava helps us here. encodeByName encodes our values, Stream (Of LedgerLine) m () to a ByteString m (), provided we have a ToNamedRecord instance:

instance ToNamedRecord LedgerLine where
  toNamedRecord (LedgerLine date amount reference category) =
    namedRecord [
       "Date"   .= date
     , "Amount" .= amount
     , "Payee"  .= reference
     , "Memo"   .= category
     ]

Since all our fields have ToField instances which are derived automatically because they are newtypes of well-known types like Text and Double, we just have to specify the name of the fields in the output file, so that cassava knows in which column to put the values.

Write a `Stream` to a file

streaming-with gives us writeBinaryFile which takes a ByteString m () and writes to an output file, again making sure that resources are properly cleaned-up even if there is an exception in the meantime.

To sum-up all those transformations in a block of code:

processAll =
  withBinaryFileContents inputFilePath $$ \contents -> do
    let decoded   = decodeByName contents
    let processed = Streaming.map toLedgerLine $$ hoist rethrow decoded
    let encoded   = encodeByName ynabHeader processed
    writeBinaryFile outputFilePath encoded

Thanks to all those libraries we have a nice isolation of responsibilities, and guarantees about memory and file handle usage!

Learn now

all the type classes and functions of cassava
the streaming, streaming-bytestring, streaming-with libraries to stream data
the rudiments of monad transformers like ExceptT

Learn later

other streaming libraries: there are many great alternatives (but slightly more complex) to streaming: conduit, pipes, streamly
the ResourceT monad transformer to deal with resources
the Managed monad to simplify situation where the withXXX pattern is nesting too many calls
different ways of dealing with errors in Haskell

Parse options from the command line

At the minimum we need to be able to read the name of the input file. This can be done with System.getArgs :: IO[String] and would be sufficient for this application. However you are going to need more elaborate parsing of command line options for a non-trivial CLI application. I have used a very well-known library for this: optparse-applicative.

With this library, we define a data type for the data we want to read from the command line:

data CliOptions = CliOptions {
  inputFile  :: Text
, outputFile :: Maybe Text
} deriving (Eq, Show)

The output file is left optional, since we can provide either a hard-coded name for the ouput file result.csv or append a piece of text to the input file name. The parser for CliOptions looks like this:

cliOptionsParser :: Parser CliOptions
cliOptionsParser = CliOptions
   <$$> strArgument
       ( metavar "INPUT FILE"
      <> help "Input CSV file" )
   <*> option auto
       ( long "output-file"
      <> short 'o'
      <> value Nothing
      <> help "Output CSV file" )

This style of parser definition is very similar to the one we used for FromNamedRecord to parse CSV fields. It relies on the notion of an Applicative (hence the library name) and on a series of helper functions to specify the options:

strArgument parses a string given as an argument (so it is not optional)
option parses an option (starting with -- on the command line) and the exact type of parser is auto meaning that it will parse everything with a Read instance

You can also see some additional information, like the option name (long and short) for the output file. This information is used both for parsing and for documenting the command line options. Talking about documenting, how do we provide a --help option? optparse-applicative gives a way to “wrap” a Parser with more information

defineCliOptions :: ParserInfo CliOptions
defineCliOptions =
   info (cliOptionsParser <**> helper) $$
   header "ledgit - massage ledger files" <>
   progDesc "Transform a CSV ledger file into a suitable YNAB file"

In defineCliOptions we enrich the CliOptions with a helper option and provide additional information to our parser with “modifiers”:

progDesc adds a text description of the program displayed under the “Usage” section showing a summary of the options
header adds an additional header when we display the help
those 2 modifiers are being “appended” into one with <> (yes they form a Monoid)

While the whole library is quite powerful, there is quite a lot to explain if you really want to understand how it works: parsers, Applicative, Monoid, Read,… Yet I more or less took the examples from the documentation, changed a few things and it worked immediately.

Learn now

the optparse-applicative library

Learn later

other options libraries: options, cmdargs

Tie it all together

In reality you could put all the code in one Haskell file (you could even create a stack script) and you would be done. For fun I decided to create small components to isolate the different pieces of the application, using “records-of-functions”:

Data.hs contains all the data types + the CSV encoders/decoders
Importer.hs contains the Importer component tasked with reading the file and decoding it
Exporter.hs contains the Exporter component which takes a stream of lines and outputs it to a file
App.hs just connects the 2
Ledgit.hs calls the options parser and create the App

The `Importer`

Let’s have a closer look at those components. The Importer is defined as:

data Importer m = Importer {
  importCsv :: (Stream (Of LedgerLine) m () -> m ()) -> m ()
}

It is kind of weird. Instead of just exposing an interface like importCsv :: Stream (Of LedgerLine) m () returning the decoded lines, it takes a “consumer” of Stream (Of LedgerLine) m () and executes it. This is because of a limitation of the Streaming library and the libraries we have been using with so far.

The Streaming library does not support any resources management. The resource management (properly closing file handles) is done with withBinaryFileContents which take a function consuming the file contents. If we want to use that library and define a component we need to propagate the same pattern.

There is actually quite a profound principle at play here. In programming, some “things” can be either defined by how they are produced or how they are consumed. For example you can define the Maybe datatype by either

data Maybe a = Just a | Nothing

data Maybe a = forall b . ((a -> b), b)

In the second case you specify how to “consume” values that are Just a or values that are Nothing.

If you squint a bit you will also recognize a “continuation-like” type in importCsv :: (a -> r) -> r. The computer science literature is full of such transformations, from “direct style” to “continuation-passing style”. This is a lot of hand-waving, just to justify the weird shape of the Importer interface :-).

Otherwise you will notice that the Importer does not mention its “configuration”, there is no inputFilePath to read from in its interface. This is because this data will be provided by the wiring we do in Ledgit.hs.

The `Exporter`

Nothing special here, we take a stream of lines and export each of them to a file. Underneath the implementation is using the functions we have seen before: writeBinaryFile, encodeByName.

data Exporter m = Exporter {
  exportCsv :: Stream (Of LedgerLine) m () -> m ()
}

The `App`

The App just connects the 2 main components, its implementation is super-simple

data App m = App {
  runApp :: m ()
}

newApp :: Importer m -> Exporter m -> App m
newApp Importer {..} Exporter {..} = App {..} where
  runApp = importCsv exportCsv

The “wiring”

Now we need a way to make an App with its Importer, its Exporter and the CliOptions parsed from the command-line. For this we use the registry library and define a registry like so:

newRegistry :: CliOptions -> Registry _ _
newRegistry cliOptions =
     fun (newImporter @IO)
  <: fun (newExporter @IO)
  <: fun (newApp @IO)
  <: val cliOptions

We put all the values and components constructors into a Registry and later ask for an all-wired application:

runApplication :: IO ()
runApplication = do
  cliOptions <- execParser defineCliOptions
  let registry = newRegistry cliOptions
  let app = make @(App IO) registry
  runApp app

That’s it, registry automatically calls all the constructor functions and wires the App. You can also write this code by hand, there’s no real need to use registry for such a simple application.

Learn now

learn about the Handler pattern

Learn later

There are many other ways to define and wire Haskell applications:

MTL and monad transformers
effect libraries: polysemy, fused-effects
The ReaderT pattern
registry

Summary

This blog post presents a simple Haskell application which can be seen as the “template” for many CLI applications. We have

command-line options parsing
“business” data types
files input / output
streaming
encoding / decoding

There is nonetheless a learning curve which we should not under-estimate, we need to:

know how to set-up a new project
know how to compile, run tests, install the application
know how to find relevant libraries in the Haskell ecosystem
learn about data and newtype
learn about type classes and instances
be comfortable with the Applicative typeclass and combinators
understand a minimum of monad transformers

I hope this blog post will contribute to making this learning curve less steep by giving pointers on things to start learning right, then other things to read / practice later.

Concluding thoughts

It occurred to me that being computer literate will be an important part of the “citizen-toolkit” in the future. There is no reason why we should not be able to access all of our data in the future through well-crafted APIs. When this happens, I hope someone will use Haskell and write a similar blogpost about REST access (or whatever API standard), blockchain auditing, security libraries etc…

24 August 2019

ICFP 2019

ICFP 2019 in Berlin is now over. It was a fantastic week packed with great content. I also particularly enjoyed the “Hallway track” this year and the numerous discussions I had around effects, Haskell, Functional Programming, modularity.

You will find below my notes for some of the talks I went to, they are most likely to be useful to me alone but don’t hesitate to ask me for more details or to head to the conference program to find links to specific papers that have sparked your interest.

Pre-conference

Idris 2

Idris 2 is the reimplementation of the Idris language using

Idris 1 as its implementation language
a new type theory: QTT (Quantified Type Theory)
Chez Scheme to compile the QTT language

Outcomes

implementing Idris in Idris shows if the language scales to large programs (answer: not entirely, hence Idris 2). Some other “productivity” features have been added, for example “generate” to automatically split cases and find an implementation. And thanks to QTT more implementations can now be found.
QTT is nice because it allows the safe erasing of types which are just proofs at run-time. Such an “erasable” proof is annotated as 0. Some other types are dependent types which are needed at run-time for pattern matching (annotated as 1). This also a way to introduce linear typing: lmap : (1 a -> b) -> (1 xs : List a) -> List b (a linear map function). That’s still limited for now, there is no polymorphism for linearity annotations and no “borrow checker” but Edwin Brady said he would love to work on that (and even safely support imperative programming with types!)
The compilation time and generated code is now competitive with GHC (except for high-order functions, more work to do there) thanks to the compilation to Chez Scheme (other Schemes are also supported with different trade-offs, like the ease of creating an executable). In the future OCaml could be a good backend as well (in the ML workshop there was a talk on how to FFI Idris with OCaml to access a HTTP library)

Shifted names

You remember all the presentations on the lambda calculus talking about renaming and substitution and the difficulty of dealing with free variables to make sure that names don’t clash?

The usual answer is “say we introduce a ‘fresh’ name here”, meaning we need to keep some state around. “Shifted names” are a simple idea: prior to do anything with something named x put an index on all variables named x in a term and increment the index (so x becomes x0 the first time, x1 next and so on). This means that we can substitute variables for values, rename variables, introduce names, and so on without requiring any “freshness condition”. All those operations have been formalized and proven correct in Coq, which then helps when proving programs, like proving the correctness of a CPS transformation (because a proof might just have to establish that a variable named x is not the same as a variable named k which fortunately Coq doesn’t have trouble in believing).

Deferring the details and deriving programs

How to program in Agda, do imperative programming (swap 2 variables, loop over a list to sum elements,…) but separate proofs from the program to make the code easier to read?

Liam introduces an Applicative to delay proofs. Those proofs can be provided by some preconditions on programs for example by the loop invariant when summing a list, i < length list , then you know that it is safe to access an element of the list list !! i inside the loop

The code is available at https://github.com/liamoc/dddp

Cubes, cats, effects

It didn’t get much of that talk but it was entertaining nonetheless!

Conor McBride explains that the equality used by type theorists is usually quite boring because it starts from x = x. Not only this is quite obvious but the real proof that those 2 labels ‘x’ are the same depends on the implementation of your language. Then the equality proofs that you get in proofs for type systems are generally quite weak.

So the main idea here is to develop a notion of types where we know much better how a value of a given type was built. It must have a “path” from a type T0 to a type T1. Not only that but we track which part of the typing context we used to produce a given type judgment. Conor hopes that this will give us a much tighter way to write proofs because some of the functions we use to write typing judgments will now be accompanied with corresponding proofs showing which type equalities hold. He makes a link with category theory in the sense that it is like putting our hands on a value proving that “the diagram commutes”.

Generic Enumerators

How to enumerate constrained data for testing? We can use type indexed families to constrain data, but how can we enumerate them?

It is also easy to generically enumerate recursive types (think “sum of products”). We can go further by using dependent types and generate indexed families (types indexed by other types) by seeing an indexed family as a function from an index to the description of a data type. A generalized coproduct can be enumerated by providing a selector and a list of enumerators.

This is a dependent pair because the chosen selector will trigger a different data type. The paper show an example of how to generate sized trees (with a statically-known size). You need to supply an enumerator to enumerate all the ways to split the size l in n + m and the library does the rest.

Augmenting type signatures for program synthesis

This is a way to add more information to type signatures to get more programs to be synthesized. The application is to be able to program for more back-ends: CPUs, GPUs,… by modelling libraries for those backends and automatically use the best of them in user code.

The author shows how to synthesize code for a C function void gemv(int m, int n, float *a, float *x, float *y) once a few properties on m, n, a, x, y are given. Then the program synthesizer uses code templates like a for loop, an if then else to try to find the code fulfilling the properties. And the user can use a small logic language with facts, conjunction, unification, negation,… to drive heuristics for the synthesis.

They got results for rewriting some existing code to better libraries for ML, signal processing, string processing, etc…

FreezeML complete and easy type inference for first-class polymorphism

Can we have something between ML and System F in terms of polymorphism and inference?

ML inference is decidable but doesn’t have first class polymorphism
that’s the opposite for System F

For example if you have singleton :: forall a . a -> [a], and id :: forall a -> a, what is the type of singleton id?

in ML: [a -> a] (id is instantiated).
in System F: [forall a . a -> a]

The idea? “freeze” which instantiation you want with 2 operators @ or $ to indicate if you want to keep “forall” polymorphism or not. This brings back type inference.

ICFP 2019 Day 1

Blockchains are functional

Manual Chakravarty did a great presentation of Blockchains and how functional programming and formal methods apply to that space. The general approach at IOHK is to go from formal specifications which are verified using Coq (+Adga and Isabelle) to a direct translation to Haskell.

Since the resulting Haskell code is not performant and lacks concrete considerations like networking and persistence, they program the real implementation in Haskell and use property-based testing to test that implementation against the model.

The other important point is that they use a model of transactions which is called the UTxO (Unspent transaction output) model. That model is a more functional approach than the “account model” which can be found in Ethereum (where you just model the state of wallets). Thanks to this model and a restricted functional language used for on-chain code, they can predict more precisely how much fees will have to be paid for the execution of a contract, contrary to Ethereum again where some contracts can not be executed because they haven’t got enough “gas” (gas is a retribution system to incentivise nodes in the system to validate transactions).

Which language are they using for those validations? Plutus Core: this is based on System F-omega with recursion, which is a type system very well studied. That being said they realized that going from the System F papers to a real implementation raised many interesting questions that they cover in some upcoming papers this year. They formalized the language in Agda and have an abstract interpreter for it.

The whole “smart contract” language is Plutus where you write Haskell programs, delimiting what needs to be executed on the chain with TemplateHaskell (enabling the generation of Plutus core code thanks to a GHC plugin).

It is still hard to program in Plutus for regular “application developers” so they have a DSL dedicated to financial contracts called Marlowe. Similar DSLs for other domains are in preparation.

Compiling with continuations, or without? Whatever.

https://www.cs.purdue.edu/homes/rompf/papers/cong-preprint201811.pdf

Should we use an intermediate language with continuations or not?

The idea of this paper is that it is possible to introduce a “control” operator, which will decide what is translated to “2nd class continuations” when it makes sense. Experiments on subsets of Scala show that more labels and goto are emitted this way and get more stack allocations than before (so less heap allocations).

Upcoming work: adding parametric polymorphism and first class delimited continuations in the user language (by building a type-and-effect system).

Lambda calculus with algebraic simplification for reduction parallelization by equational reasoning

How to parallelize the evaluation of complex expressions?

For example sums over an array, folds of a tree? A lambda calculus with “algebraic simplification” can help there. It consists of:

the normal lambda calculus + semi-ring operations (addition / multiplication)
a “delta abstraction” where the evaluation of the body can be simplified with the algebraic properties of + and * before passing arguments

This allows the parallel evaluation of functions and arguments. This even works for more complex expressions containing conditionals:

-- "breaking" sum
sum (x : xs) acc =
  if x < 0 then acc else sum xs (acc + x)

This function, applied to [5, 3, 1, 7, -6, 2] for example, will be split in two evaluations, one for [5, 3, 1] and one for [7, -6, 2]. The first list is then partially evaluated with a continuation so that we try to evaluate \f -> if ... else if 5 > 0 && 3 > 0 && 1 > 0 then f (5 + 3 + 1). Then we can simplify 5 + 3 + 1 to 9.

The other important part of this work is that there is a type system based on linearity conditions to say when it is “safe” to expand or factor expressions. Interestingly the best expansions or factoring of expressions happens when variables (in polynomials for example) are used linearly.

Fairness in Responsive Parallelism

For cooperative scheduling of threads a notion of fairness is developed and an algorithm has a provable estimation of the expected execution time of high-priority tasks.

Narcissus: Correct-By-Construction Derivation of Decoders and Encoders from Binary Formats

Encoding and decoding binary formats can have lots of bugs, in particular in network stacks. Those bugs can be exploited by attackers. Some issues: formats are context sensitive (parsing the rest depends on a length field). They are also under-specified.

This framework allows the definition of non-deterministic specifications of formats in Coq and then generates encoders and decoders from that specification.

Interesting considerations:

an encoder can fail! Indeed encoders can have invariants which must be respected. If you pass a buffer to write in, the buffer might be too small, or the number of elements in a list can be too big.
a decoder might decode more formats than what the encoder can encode
encoders need to handle state to support things like the building of compression tables

Key features of Narcissus:

readibility of the format specifications
they are extensible, context-sensitive
there can be multiple choices in encoding (to version the payload for example)
there’s an extension for checksum

This work was validated for many encoders for the TCP/IP stack: Ethernet, ARP, IPv4, TCP, UDP and used in MirageOS (and Coq can extract OCaml code). A new format like UDP is 10-20 lines of format + 10-20 of special data types. The performance loss is around 10% (compared to manual code)

More work to come: certified extensions to extract code in other languages without an OCaml dependency.

Question: how hard would it be to apply this to protobuf? There is a paper on verifying protobuf (https://www.cs.purdue.edu/homes/bendy/Narcissus/protobuf.pdf)

Closure Conversion is Safe for Space

Formally verified compilers (like CakeML) can not reason about intentional properties: is my generated program safe for space and time?

There are some theorems for memory-managed language already. However they don’t work for closures and there are some notorious examples where V8 javascript closures are leaking memory.

This work: first formal proof that closure conversion is safe for space by adding profiling semantics for source and target.

The main drawback of this approach IMO: the closure environment they use is proved to be safe but it is also not memory-efficient. On the other hand the closure environment chosen in V8 is unsafe (and fully mutable). This is a known issue with V8 and no one knows how to do it better yet.

Selective Applicative Functors

I’m a big fan of Andrey Mokhov’s work (for example what he has started with algebraic graphs). Andrey presents “Selective Applicative Functors” which are kind of between:

applicative functors where you have independent computations that you can combine
and Monads where you can have dependent, arbitrary computations where the result of one computation creates a new one.

Selective applicative functors are for the case where you want to start independent computation but you can the result of one of them if it finishes earlier and cancel the other. So on one hand the “shape” of computations is known in advanced but you still have “conditional” computations. He dubs this super-power “Speculative execution”. The many applications of selective functors can be found in the paper.

Coherence of Type Class Resolution

A very well rehearsed presentation showing how to prove coherence of typeclasses with no overlapping instances nor orphan instances. In the presence of super-classes can we be sure that the same function is always picked when it comes from 2 different type classes inheriting from the same parent? Is the resolution algorithm also working in the presence of the FlexibleContexts extension where it is possible to locally specify what should be one type if a type class has several type parameters?

The answer seems intuitively “yes” but it turns out to be really hard to prove! https://arxiv.org/pdf/1907.00844.pdf

ICFP 2019 Day 2

Keynote: Rosette - solver-aided tools

Rosette is built on top of Racket and is a platform for implementing DSLs which can be translated to SMT constraints (and solved with Z3 for example). The trick they employ for this is a variant of “bounded model-checking” where they can efficiently explore, using symbolic variables, the possible executions of a program without incurring an exponential number of generated constraints (still, polynomials can be large!).

They have all sort of applications, from languages used for education, OS verification, type systems proofs, and so on. They can show that implementations are correct (they found several bugs in a sub-module of Linux), find counter-examples, do program synthesis to fix programs and so on.

Demystifying differentiable programming: shift/reset the penultimate backpropagator

When doing machine learning and computing gradient descent we need to compute derivatives of functions with multi-variables.

There are 2 ways of doing this: Forward Differentiation which is simple but not very efficient when we compute a large number of variables to get a single result, and Backward Differentiation which is more complicated because more intermediate results have to be kept around (but more performant).

It turns out that the structure of those computations form continuations. The paper shows how to use Scala continuations to write code which can be backward-differentiated automatically using the shift/reset continuation operators for its implementation.

Efficient differentiable programming in a functional array-processing language

This talk is almost the opposite of the previous one! It takes a Forward Differentiation approach for a functional language (a subset of F# with efficient array operations) and shows that a whole lot of techniques can be used to optimise the generated code (sub-expressions elimination, fusion, loop fission, partial evaluation,…) and be on par with the best imperative libraries.

BobKonf summer edition

Relative Functional Reactive Programming

This is a very promising idea which I see possibly replacing actors with something which is better typed. The idea is to extend functional reactive programming which uses events and behaviours across time to events and behaviours across space and time.

Then messages become “perceptions” which travel across space and time (until they reach a node) and a “dialog” between a client and a server becomes a list of perceptions. Using this modelling + CRDTs the talk shows that we can develop a full peer-to-peer todo application.

A library is currently being implemented on top of the reflex framework in Haskell (to show that it works for real :-)) but I would like to see how this could be integrated with Ivan Perez “Monadic streams”: http://www.cs.nott.ac.uk/~psxip1/papers/2016-HaskellSymposium-Perez-Barenz-Nilsson-FRPRefactored-short.pdf. They are also a few hard questions to solve which are the subject of the thesis of the presenter, for example how do you “garbage-collect” events which are not necessary anymore (the current approach supposes an infinite memory!).

Statistical testing of software

Can we measure software quality really? Two development processes, Cleanroom software engineering and Software reliability engineering give us some answers:

Clean room aims to totally prevent bugs, not fix them. You start from a formal specification and the team separates between developers refining the spec while the testers write tests just to measure the quality of the product. The developers can not test, not even access a syntax checker!
Software reliability engineering: describe and quantify the expected use of your system to spend most of your resources on the cases which really matter.

This gives us some ideas for doing for statistical testing:

model the usage of a product
what is really called? How often?
how often calls are related to each other?

From that you can create QuickCheck/Hedgehog generators which can vary their behaviour based on some current state. This means that we are effectively working on Markov Chains (a state diagram containing the probabilities of firing transitions). And there is lot of literature on how to cover Markov Chains, how often we get to a given state and so on.

Then we can measure reliability by counting the number of tests passing before the first failure. Even better we can measure how many times an edge in the Markov chain graph is failing, so that failing paths which are not called very often will not count for much.

Liquidate your assets

This is an example of using Liquid Haskell with a “Tick” monad the count the number of times a resource is being accessed. From there we can statically enforce that some algorithms have specific bounds and check some properties such as how long it should take to sort an already sorted list. Nicky Vazou also showed how you can more or less interactively write proofs using Liquid Haskell and a small proof library.

A functional reboot for deep learning

Any talk by Conal Elliot is worth attending (cf “The essence of automatic differentiation”, or “Compiling to categories”). This time Conal wants to reboot Machine Learning by removing all non-essential complexity and generalize things where they actually are un-necessarily specialized.

For now this is more the intent of a program that an accomplished vision but he already has some of the brick thanks to his previous work on automatic differentiation and his Haskell compiler for “compiling to categories”.

Conal argues that modelling neural networks as graph is wrong. This hides the fact that we are effectively working on functions and forces a sequentiality of computations (in “layers”) which doesn’t have to be. We also have much better way to represent the search space we are exploring than forcing everything into multi-dimensional arrays. In particular representable functors offer a much better way to represent our data, while still being isomorphic to arrays which permits efficient computing on GPUs.

He is asking for help to pursue his vision, both from the Haskell community (there is still work to do on his plugin) and from the ML community (to better understand some statistical aspects and ways of thinking of ML problems).

ICFP Day 3

Call by need is clairvoyant call by value

It is surprising to me that there are still things to say about the lambda calculus. In particular here we address the difficulty of dealing with the operational semantics of call by need.

Because call by need memoizes the function parameters once they are called, in order to do a proper description of their semantics you need to maintain a heap for those values. It makes it hard to predict the runtime behaviour of programs as any Haskell debugging a memory leak would know. It turns out that it gets much easier to have an evaluation strategy making as if you had evaluated only the arguments which will be used later on. This is a form on non-determinism: what if we knew the result in the future? I don’t fully understand the details but it seems that it is like programming in a monad evaluating a tree of possibilities, eventually being able to say which path should be taken. Ultimately this simplifies proving the properties of call by need languages.

Lambda: the ultimate sublanguage

What if you taught functional programming using a much simpler language than Haskell? That language is System F-omega, which is effectively the core of Haskell.

By using only this language Jeremy Yallop is teaching an 8 weeks course, progressively introducing lambda-calculus, polymorphism, modules, existential types,… One of the assigments for example presents the Leibnitz equality: “2 things are the same if you cannot distinguish them” and the Church equality: “2 things are equal if they can always be reduced to the same thing” (I don’t think I got this right) and use the calculus to show that they are the same notion of equality (one gives you the other).

Most participants report than the course was pleasingly challenging and gave them a thorough understanding of the field. You can find the course online here: https://www.cl.cam.ac.uk/teaching/1617/L28/materials.html

Workshops

MiniKanren tutorial

MiniKanren is a “relational programming” library with only 4 operations. It started in Scheme but has now been ported to 15 languages.

It can be put to many uses, from unification in type systems, to program synthesis (it can find quines!), to better querying pharmaceutical data. This should probably be my first thought if I ever stumble on a problem requiring difficult unification in the wild.

Modular effects in Haskell through effect polymorphism and explicit dictionary applications - A new approach and the μVeriFast verifier as a case study

Dominique Devriese proposed last year an extension to Haskell called DictionaryApplications. With this extension you can locally pass an explicit dictionary where a type class instance is required. More importantly the paper proves that it is possible to have this feature without compromising coherence and global uniqueness of instances. How? By making “as if” you were creating a “wrapper” newtype on the fly and coercing it back and forth with the data type for which you want an instance.

Ok, that gives you super powers, but in practice how useful is this? In order to show the usefulness of this feature Dominique has taken an existing OCaml application which was actually using lots of effects lumped together: reader, writer, non-determinism, state, etc… and has rewritten it to typeclasses and dictionary applications.

The resulting code is a lot clearer and quite flexible since it allows the local creation and passing of instances which can rely on IO for their creation. For example you can pass a State s effect around (to computations needing MonadState s) which is effectively implemented with a mutable ref. The paper goes further and even abstract on the creation of such “mutable cell” by making it an effect in itself.

Haskell Implementors Workshop

Haskell use and abuse

Haskell was used on a secret project at X, the Google incubator. They had 3 months to prove the viability of a system containing both hardware and software (that looked a lot like accelerating machine learning to me).

The result was 800 files, 200.000 lines of code, more than 70 extensions used. What worked, what did not work?

the good: lens library to index complex data structures, a good data model representing the hardware, an array library with statically-typed shapes, smart people attracted to the project
the bad: one person full-time to integrate the project to the rest of the infrastructure of Google, too many programming styles, too many DSLs that only a few team members understand, ugly type errors, ugly type signatures (because dependent typing is too weak)

In passing they showed a trick. How to pass parameters to a pattern synonym in Haskell? Use implicit paramters! (and one more extension,…)

Configuration, but without CPP

How to deal with #ifdef in many Haskell libraries?

Problems: it is hard to test, we rarely test all the configurations, it’s text-based (tooling is out), it’s dynamically scoped so hard to write correctly.

A quick analysis shows it is used for:

not configurable: ghc version, base version, platform
configurable: package dependencies, user defined predicates

How to replace the configurable part?

Solutions: there are already solutions (100 papers) to this problem! One promising idea: “Variational typing”, describe in a type the different possibilities for an expression:

  e :: D <Int, Char>
  e = D <5, 'a'>

For example the syntax for having GHCJS specific code would look like:

  GHCJS< liftIO $ ..., return Nothing>

This type system encodes static choices about a program, a choice applies to the all the program. A few rules can be formalized

distribution: D < A -> B, C -> D> = D < A, C > -> D < B, D >
idempotence: D <A, A> = A
domination: D < D <A, B>, C> = D <A, C> (if you choose the left side, always choose the left side)

In terms of performance, typechecking should still be fast (in particular because practically speaking there are few people nesting #ifdefs).

There are some variants of this idea for module systems, for example: “Variation aware module system” where import and exports can depend on the variation.

HIE files in GHC 8.8

hiedb indexes all the source files. Gives fast documentation, types, jump to definition (local or not), find usages, etc… (and there’s a Haskell API).

Future work:

show typeclass evidence (generate that into hie and make it searchable)
integrate with .hi files to keep only one format
capture more type information in the typed AST (some leaves are not currently typed)

References:

Explicit dictionary application: from theory to practice

With explicit dictionary application we

remove the need to have *By functions like nubBy
avoid the creation of And, Or, … newtypes to get different monoids with foldMap
we can use an IORef s to implement a MonadState s instance
we can replace deriveVia, for example use a function semigroupFromMonoid to implement a Semigroup instance

Potential problems:

global uniqueness of instances, this should not be allowed

insert :: Ord a => a -> Set a -> Set a
insert @{reverseOrd} 1 (insert 1 (insert 2 empty))`
-> [1, 2, 1]

coherence: we should get the same parent typeclass for Eq and Ord

whichEq :: (Eq a, Ord a) =>  a -> Bool
whichEq x = x == x

The previous work was showing that the use of representational roles allows the type checker to exclude problematic programs. This year a plan for a better implementation is proposed, but that proposal needs some (wo)man-power to be completed.

So the talk is also a big ask for help!

State of GHC

There’s a lot of activity in GHC-lang (in no particular order and missing many more)

8.8 (Aug 19)
visible kinds application. g :: t @Type a -> ...
pattern signature to bring some types in scope Just (x :: a) -> [x, x] :: [a]
around 70 new contributors each year (for the past 6 years)
10 PRs per day
1400 new issues per year (not sure if that’s a good sign!)
linear types are coming!
stand alone kind signatures
hole-fit plugin
pattern matching with betteer overlap/coverage checks
more compile time performances
visualisations of GHC core

More things cooking:

class morphisms: proposal coming up
GHCJS: plugins, more like normal GHC, profiling
8.10: better codegen
NoFieldSelectors proposal

GHC proposals:

56 proposals accepted, some of them still need to implemented
more accepted proposals than rejected because of the good discussions before proposal
but not always good. For example ExtraCommas: 166 messages on the thread, unclear outcome, people unhappy

GHC devops:

5 releases since last year
everything moved to gitlab
build on Hadrian
8.8.1 this week-end
8.10.1 release next February (small release)
8.12.1 branch in May 20
more platforms tests i.e Alpine Linux, integer-simple is better tested -> releases of core libraries is more of a challenge
Hackage repository to check ~ 150 packagees against a GHC head (head.hackage)
compiler performance is now tested in CI

If Hadrian goes well we can deprecate make in 8.12

AND THAT’S IT!

29 September 2018

ICFP 2018

This year again ICFP and the Haskell Symposium are full of interesting talks that I want to shortly present. I highlighted a single sentence in each section to try to summarize the main idea so that you can skim the whole thing and stop at what looks interesting to you.

ICFP Day 1

Capturing the Future by Replaying the Past (Functional Pearl)

Delimited continuations are incredibly powerful because they allow us to implement any monadic effect. They also allow some elegant implementations, for example the 8 queens problem can be implemented using a non-determinism effect. Unfortunately delimited continuations are not supported by every programming language. Or are they? Actually the authors of this paper show that mutable state + exceptions is all that is required to implement delimited continuations (except in C, read the paper for the details). There is a performance hit if we compare their implementation with a direct support from languages which support delimited continuations but this is not that bad.

For me this paper was the opportunity, one more time, to try to wrap my head around continuations. For example I realized that the notation with reset and shift was a bit confusing. In an expression such as reset (shift (\k -> k 2)) I actually need to focus my attention on the body of the shift rather than the body of the reset. And the body of the shift says “do k 2”. What’s the k? Well it is whatever is outside of the shift and inside the reset so that’s 1 + _ and eventually the result is 3. It becomes a bit trickier to interpret expressions like reset (shift (1 + k 2) * shift (1 + k 3)) but still possible. “Do 1 + k 2”. What is k? It is x * shift (1 + k 3). Which by the previous reasoning is k = 1 + x * 3. Now applied to 1 + k 2 gives 1 + 1 + 2 * 3 which equals to 8. This also explains the term “thermometer continuations” where the authors store a list of executions created by the various shifts in a reset expression.

Versatile Event Correlation with Algebraic Effects

This paper presents the design of language supporting “versatile joins”, that is the many ways we would like to correlate different sources of (asynchronous) events: “when you receive a ‘stock price update’ event and ‘user buy event’ for the same quantity and a ‘market is opened event’ then emit a ‘stock buy event’”. Conceptually this means making a cartesian product of all the event sources and restricting the resulting tuples to the ones that are “interesting”. Said another way, this can be interpreted as having a way to enumerate this cartesian product, filter out some tuples, combine them in some way and yield the result. It turns out that all those actions can be implemented as effects in a language supporting algebraic effects and effect handlers.

For example selecting the last event for a given event stream corresponds to the handling of a push effect for that stream. and joining several streams corresponds to the interpretation of several push 1, push 2, ..., push n effects. This provides a very modular way to describe combinations like “with the latest occurrence of stock and price for a given article emit an availability event”.

The paper shows that with a limited set of effects, like push but also trigger to trigger the materialization of a tuple or get/set to temporarily store events, we can reproduce most of the behaviours exposed by so-called “complex event processing” (CEP) libraries: windowing, linear/affine consumption, zipping, and so on.

The simple essence of automatic differentiation

One of the distinguished papers of the conference, classical Conal Elliot work, a work of art. The explosion of machine learning methods using neural networks brought back the need to have efficient ways to compute the derivative of functions. When you try to fit the parameters of a neural network to match observed data you typically use “gradient descent” methods which require to compute the partial derivative of functions with thousands of input variables and generally just one output!

The basic idea behind “automatic differentiation” is to build functions with their associated derivative functions. Since many functions are not differentiable, what we can do is to build the ones that are! You start by creating simple functions for which you know the derivative and use some rules for creating more complex functions from the simple ones, calculating the corresponding derivative as you go, using well-known rules for deriving functions. For example the “chain rule” for the derivative of the composition of 2 functions. In practice the derivative of a function can be built out of a few operations: function composition, parallel product, cartesian product, and linear functions (like +).

Actually those operations are pretty universal. If we abstract a bit by using category theory concepts we can define the derivative of a function in terms of operations from a category, a “monoidal” category, a “cartesian” category. Then by varying the category, taking other categories than Haskell functions for example, we can derive very useful programs. This is not new and was presented in “compiling to categories”, there is even a Haskell plugin supporting this construction automatically!

The paper builds on this idea for automatic differentiation and shows that using the “continuation of a category”, or the “dual of a category” or taking matrices as a category we get straightforward implementations of the differentiation of functions. In particular we get a simple implementation of the “reverse mode” of differentiation which does not mutate state like traditional algorithms and which can hence be easily parallelizable.

ICFP Day 2

Competitive Parallelism: Getting Your Priorities Right

This is yet another “let’s create a language to solve a specific problem” but an important one which is the definition of priorities in a concurrent program. What we typically want is to be able to specify a partial order to define which thread “has a higher priority”, make sure that high-priority threads don’t depend on low-priority ones (the “priority inversion” problem), get an efficient way to schedule those threads on processors and get some bounds on the total computation time/latency of a high priority thread for a given program.

The authors have defined and implemented a language called “PriML” extending Standard ML with some new primitives, priority, order, a modal type system and a scheduler to support all these objectives. I wonder how we could design languages and compilers so that anyone can benefit from those features rather sooner than later but this seems to provide a good solution to the priority inversion problem that jeopardized the Mars Pathfinder mission.

Fault Tolerant Functional Reactive Programming (Functional Pearl)

Ivan Perez and his team have already shown how to use a “Monadic Stream” representation to implement all the operators of various FRP (Functional Reactive Programming) libraries in Haskell. FRP can be used to represent various components of a Mars rover for example where there are various sensors and processors controling the behaviour of the rover.

Since this “monadic streams” representation allows you to change the “base” monad for streaming values they now use a variant of the Writer monad to represent fault information in computations: “what’s is the probability that the value returned by a sensor is incorrect?”, “if it is incorrect, what is the likely cause for the failure?”. Then combining different reactive values will cumulate failure probabilities. However you can also introduce redundant components which will reduce the failure rate! Their library also tracks the failure causes at the type level so that you can have ways to handle a failure and the compiler can check that you have handled all possible failures for a given system.

Report on ICFP and Climate Change

Not a technical paper but a report on what SIGPLAN plans to do to reduce carbon emissions. Benjamin Pierce presented the rationale for doing something and announced that next year ICFP 2019 might become carbon neutral by raising the price of tickets to buy carbon offsets. I personally bought my own carbon offset for my trip to ICFP this year and I hope this will inspire other people to do the same, we just don’t have much time left to act on climate change.

What You Needa Know about Yoneda: Profunctor Optics and the Yoneda Lemma (Functional Pearl)

What category theory is good for again? Well some results from category theory show up in many contexts and give us practical ways to transform some problems into others that are easier to deal with. The Yoneda lemma is one such result. It kind of says that the result of “transforming” a value a, noted f a is entirely determined by how all the objects that have a relation with a are being transformed: forall x . (a -> x) -> f x.

Using Haskell and assuming that f is a functor the equivalence is pretty obvious. If I have forall x . (a -> x) -> f x I can just set x to be a to get a f a. In the other direction, if I have an f a I also have a forall x . (a -> x) -> f x, that’s the fmap operation! In retrospect that’s kind of meh, what’s the big deal? Well we can prove tons of stuff with this lemma. For example monads have a bind operation: bind :: m a -> (a -> m b) -> m b. By using the Yoneda lemma we can prove that this is totally equivalent to having a join operation: join :: m (m a) -> m a!

Another application of this lemma and the meat of this paper is the derivation of profunctor optics as found in the Haskell lens library, from the “classical” representation as a getter and setter. Why is that even useful? It is useful because the profunctor formulation uses a simple function to support lens operations. This means that to compose 2 lenses you can just use function composition. The Yoneda lemma helped us going from one representation to a more useful one!

Generic Deriving of Generic Traversals

This is a new Haskell library which gives us lenses for any datatype deriving Generic. One application is to solve the “record problem” in Haskell where there can be clashes when 2 different records have the same name for one of their fields. In that case person ^. field@"name" will return the name of a Person and dog ^. field@"name" will work similarly. But the library can do much more. It allows you to select/update elements in a data structure by type, position, constraint (all the elements of the Num typeclass for example) and even by structure! So you can focus on all the fields of a Scrollbar that makes it a Widget and apply a function to modify the relevant Widget fields. This will return a Scrollbar with modified fields.

ICFP Day 3

Partially-Static Data as Free Extension of Algebras

Some programs can use “multi-stage” programming to create much faster versions of themselves when we know some of the parameters. For example a power function for a given power, say 4, can be specialized into x * x * x * x which requires no conditionals, no recursion. However this misses another optimisation. Once we have computed the first x * x we could reuse it to compute the final result let y = x * x in y * y.

This paper shows how to use the properties of some algebras to optimize code generated by staged programs. This is a nice improvement and general framework for a domain that is not mainstream yet (staged programming) but we know that this is probably the future for writing programs which are both nicely abstract and very efficient.

Relational Algebra by Way of Adjunctions

Wonderful presentation by Jeremy Gibbons and Nicolas Wu where they decide to present their paper as a conversation around a cup of coffee and some equations and diagrams on napkins. The essence of their paper is to present operations and optimisations on sql tables (the “relational algebra”) as given by adjunctions. Adjunctions are a tool from category theory showing how some “problems” in a given domain can be translated and solved in another domain, then the solution can be “brought back” to the original domain (this is all of course a lot more formal than what I just wrote :-)).

Something to be noted, in their description of table joins they need to use a monad which is a “graded” monad where each bind operation aggregates some “tags”, here the keys to the elements of the tables. This is way over my head but circling around the subject over and over I might understand everything one day :-).

Strict and Lazy Semantics for Effects: Layering Monads and Comonads

If you consider throwing exceptions as having a monadic effect and laziness (consuming or not a value) a comonadic effect then you can transform expressions having those effects into a pure language having monads and comonads. Unfortunately they don’t always compose, unless when they are “distributive”, if there’s a way to permute the monad m and the comonad c. The paper proposes to “force” one or the other interpretation to solve this dilemma. Either have a “monadic priority” which gives us “strict semantic” for those expressions having both effects or use a “comonadic priority” which gives us a “lazy semantic” for those expressions.

The cool theorem is that if the monad and comonad distribute then both choices give us the same result. The corollary is that if they don’t distribute then choosing a lazy or a strict semantic always give us different results!

What’s the Difference? A Functional Pearl on Subtracting Bijections

This is more of a mathematical puzzle than something that’s useful for day to day programming. When trying to evaluate some number of combinations (“how many ways can 3 people of a group have the same birthday?”), it can be useful to “remove” parts of some sets that we can count because they are in bijection with other well known sets (like the set of all the natural numbers). This is also a tricky operation.

The paper proves that there is an algorithm for substracting bijections which makes sense and which always terminates. I still have the naive feeling that the algorithm they presented could be simplified but I’m probably very wrong.

Ready, Set, Verify! Applying hs-to-coq to Real-World Haskell Code (Experience Report)

Good news: the Haskell containers library is bug free! How was that proven? By taking the Haskell code and translating it to Coq code, using as a specification typeclass laws and quickcheck properties from the test suite. This tool hs-to-coq is now used in other contexts to write some specifications in Haskell and prove them with Coq. It can not yet be used to prove concurrency properties but the team is thinking about it.

Haskell Symposium Day 1

Neither Web nor Assembly

Indeed WebAssembly is designed to be a low-level VM, performant and secure. The design of WA is supported by a formal specification, and a mechanical proof (in Isabel, Coq and K are underway). It is entirely standardised and supported, that’s a first, by all major browser vendors. So much that the proposal process mandates that any proposal must be implemented by 2 major implementations at least.

So far mostly C/C++ and Rust are supported because some higher-level language features required by other languages are still not available like tail calls (kind of important for functional languages :-)), exception management or garbage collection. A Haskell compiler (Asterius) is underway, being done by Tweag, and the vision is to be able to replace GHCJS on the mobile client wallets which are executing code for the Cardano blockchain.

AutoBench

This is a neat tool coupling QuickCheck and Criterion to benchmark different functions and see which ones have the best runtime behaviour. Some limitations though: benchmarks are only performed across one notion of “size” whereas general functions might depend on various parameters and also this is limited to pure functions for now.

Improving Typeclass Relations by Being Open

A great idea. Introduce a typeclass “morphism” to automatically create instances. For example if you have a typeclass for a partial order you would like to automatically get an instance for it everytime you have an Ord a instance for any type a. This new declaration class morphism Ord a => PartialOrd a would have solved many compatibility issues with the introduction of the infamous Functor / Applicative / Monad modification in Haskell.

Rhine: FRP with Type-Level Clocks

This is an addition to the “monadic streaming” framework introduced by Ivan Perez and his team. This provides some clock abstraction and synchronization mechanisms to make sure that different event sources emitting events at different rates can properly work together. It also provides a conceptual unification of “events” and “behaviours” from FRP: “events are clocks and behaviours are clock-polymorphic”.

Ghosts of departed proofs (Functional Pearl)

“Existential types” are a great way to hide information in an API. Indeed a function can give you a value of a given type without you being able to do anything with that value instead of returning it to the API, proving that you are really returning something you have been given, not something that you have fabricated. But you don’t necessary need a value to enforce constraints on your API, you can use a “phantom type” to tag a computation with a specific type, for which there are no corresponding value.

One cool thing you can do is encode proofs with these phantom types. For example you can embed the proof that a list has been sorted. This is all free at runtime because those proofs only exist at compile time. Even better you can create a type representing proofs, with all the classical logical combinators and, or, implies,… and associate them with values. This is all implemented in the gdp library.

Haskell Symposium Day 2

Deriving Via: or, How to Turn Hand-Written Instances into an Anti-pattern

DerivingVia is a way to reduce boilerplate which just landed in GHC 8.6.1. With DerivingVia you can declare trivial instances for data types by reusing known instances for other types. For example you can declare a Monoid instance for the datatype newtype Pair = Pair Int Int with deriving Monoid via (Sum Int, Mul Int). The generated instance will add the first element of the pair and multiply the second one.

Type Variables in Patterns

This is the complement to type applications in expressions: type applications in patterns. Being able to “extract” types from a pattern allows to properly name a given type to use it to type another expression. For example it will be possible to bind the existential type a in data MyGadt b where Mk :: (Show a) => a -> MyGadt Int in a pattern match by writing

case expr of MyGadt @a a ->
  print a where

  -- here the type `a` has been bound in the pattern match
  print :: a -> Text
  print = T.pack (show a)

The implementation should start before the end of the year.

The Thoralf Plugin: For Your Fancy Type Needs

This Haskell plugin uses a SMT solver to solve some of the obvious type constraints which GHC is not able to solve by itself.

Suggesting Valid Hole Fits for Typed-Holes (Experience Report)

Another useful tool for programming in Haskell. In Haskell we can use “type holes” giving the type that is expected in a given place and also some values which can fit that hole. Unfortunately type holes can be pretty useless with some expressions, using lenses for example. Now the suggestions for what can be put in the hole have been vastly improved by looking at all the functions in scope which could create a value for the hole. For example _ :: [Int] -> Int will suggest Prelude.head and Prelude.last. But we can go further and ask for some “refinement”: which function, having itself a hole could be used to fill in the hole? In the example above foldr could also be used provided that foldr has the right parameters.

And… this is all available in GHC 8.6.1 already and even able to use the new doc feature adding documentation to the suggestions.

A Promise Checked Is a Promise Kept: Inspection Testing

When the documentation of a library like generic-lens says “the generated code is performant because it is the same as manually written code”, can you really trust it? It turns out that version 0.4.0.0 of that library was breaking that promise and it was fixed is 0.4.0.1.

Fortunately, inspection-testing is a GHC plugin to save the day. With that plugin you can instruct GHC to check the generated GHC Core code: inspect $('genericAgeLens === 'manualAgeLens). You can also test if you have unwanted allocations in your code. For example the Data.Text package has many functions which are supposed to be “subject to fusion” but inspection testing proves that they are not.

Branching Processes for QuickCheck Generators

Dragen is library generating QuickCheck’s Arbitrary instances for recursive datastructures. The cool thing is that you can specify the “depth” and the distribution of all constructors in the generated values. So you can ask for trees of depth 5 with a proportion of Int nodes which is twice the number of Text nodes.

Coherent Explicit Dictionary Application for Haskell

Finally a way to avoid the “one instance of a typeclass per type only!” limitation. This implemented proposal allows to materialize a typeclass instance as a “Dictionary” and then do an explicit “dictionary application” to specify which dictionary you want to apply. For example you can write nubCI = nub @{ eqOn (map toLower) } where eqOn produces a dictionary.

The main concern with this type of proposal is that we could break coherence which is guaranteed to happen where there’s only one instance per type. Set operations for example can not be used with 2 different Ord dictionaries. This can actually be checked by looking at the “role” of the type variable a in Set a. It is “nominal” whereas this proposal only works with “representational” roles. Otherwise if you try to do a dictionary application where there’s any possibility that you then get 2 different instances for the same typeclass then you get a warning.

Theorem Proving for All: Equational Reasoning in Liquid Haskell (Functional Pearl)

Such a brilliant idea, since theorems are types and proofs are code, let’s write the theorems as Liquid Haskell types and proofs as Haskell code! Then LiquidHaskell is able check if your proof of reverse (reverse xs) == xs is correct. You will not get all the support you can get from a general theorem prover environment but I think this is a neat teaching tool at least. And the proofs disappear at runtime, nice.

Pages

11 October 2019

A more polymorphic operator

Appendix

09 September 2019

Setting-up a Haskell project

Create data types

Decode a CSV line

Decode a CSV line

Parse and process a full file

Read a file as a Stream of bytes

Decode the lines

What is a stream of data?

Deal with decoding errors

What if I hadn’t done any of this?

Process values

Encode the lines as CSV

Write a Stream to a file

Parse options from the command line

Tie it all together

The Importer

The Exporter

The App

The “wiring”

Summary

Concluding thoughts

24 August 2019

Pre-conference

Idris 2

Shifted names

Deferring the details and deriving programs

Cubes, cats, effects

Generic Enumerators

Augmenting type signatures for program synthesis

FreezeML complete and easy type inference for first-class polymorphism

ICFP 2019 Day 1

Blockchains are functional

Compiling with continuations, or without? Whatever.

Lambda calculus with algebraic simplification for reduction parallelization by equational reasoning

Fairness in Responsive Parallelism

Narcissus: Correct-By-Construction Derivation of Decoders and Encoders from Binary Formats

Closure Conversion is Safe for Space

Selective Applicative Functors

Coherence of Type Class Resolution

ICFP 2019 Day 2

Keynote: Rosette - solver-aided tools

Demystifying differentiable programming: shift/reset the penultimate backpropagator

Efficient differentiable programming in a functional array-processing language

BobKonf summer edition

Relative Functional Reactive Programming

Statistical testing of software

Liquidate your assets

A functional reboot for deep learning

ICFP Day 3

Call by need is clairvoyant call by value

Lambda: the ultimate sublanguage

Workshops

MiniKanren tutorial

Modular effects in Haskell through effect polymorphism and explicit dictionary applications - A new approach and the μVeriFast verifier as a case study

Haskell Implementors Workshop

Haskell use and abuse

Configuration, but without CPP

HIE files in GHC 8.8

Explicit dictionary application: from theory to practice

State of GHC

29 September 2018

ICFP Day 1

ICFP Day 2

Report on ICFP and Climate Change

ICFP Day 3

Haskell Symposium Day 1

Neither Web nor Assembly

Haskell Symposium Day 2

Write a `Stream` to a file

The `Importer`

The `Exporter`

The `App`