Data type encodings

In the last post I incorrectly said that the case eliminator for the List type was fold. Of course it isn’t, the type of fold(r) is

b -> (a -> b -> b) -> List a -> b

whereas the case eliminator for List is

b -> (a -> List a -> b) -> List a -> b

The difference is in the second argument. In fold we have already transformed the tail of the list and are given the transformed value b. In the eliminator we are given the list tail directly. In a case expression, we’re just unwrapping one level, so this second form is what we want. When we case match on a list, we have two cases: - the empty list, where we have to produce a value of type b - the head of the list (type a) and the remaining list (List a), from which we have to produce a value of type b.

But you could ask, what if case didn’t do one-level unwrapping but did a full recursive transformation? What if it worked like fold on lists?

someList : List a
case someList {
  [] -> // we have to return something of type b
  x r -> // x : a, r : b, we have to return something of type b
}

In this world, when we case match on a list we have again two cases: - the empty list, as before, where we produce a b - the head of the list (type a), along with a value of type b. This value is the result running our case match on the tail of the list, which has already happened! So operationally a “fold case match” like this behaves very differently: it does a recursive transformation on the structure instead of a single unwrapping.

It turns out that these two styles (“normal” eliminator and “fold” eliminator) are two data type encodings that already have names. The normal eliminator is called the Scott Encoding. The fold eliminator is called Church Encoding.

Here are church encodings of some common data types:

type Bool { True, False }
a -> a -> a

type List a { Nil, Cons a (List a) }
b -> (a -> b -> b) -> b

type Option a { None, Some a }
b -> (a -> b) -> b

type Foo { Bar Bool }
(Bool -> a) -> a

For each type I’ve written the tl syntax for the type, then the encoding. The encoding is, in each case, a function which is observably identical to the type. Any data you could store in the type can equivalently be stored using the function.

For example the value Some 5 can be encoded as _ f -> f 5. Here we ignore the default value passed if we are None, and instead call the function for when we are Some, passing the inner value 5.

The list [3, 4] can be encoded as l1 = z f -> f 3 (f 4 z). This is a function that takes two arguments which correspond to the two branches of the fold case we saw earlier. z takes no arguments and has to produce a value of type b (or whatever the result type is). f takes two arguments: the head of the list (3) and the result of applying f to the tail of the list.

To use this list value, because it’s a function we just use it directly. Let’s say we want to add 2 to the front of it:

l2 = z f -> f 2 (l1 z f)

Or if we want to sum the numbers in the list:

l1 0 (x r -> + x r)

Church encoding embodies the principle that something is what it does. This style seems similar to a “final encoding”, which also expresses a data type in terms of what you can do with it but I’m not sure if these two things are the same or just similar.

If we designed “Church style” eliminators for tl, they’d look like this:

Bool
a -> a -> Bool -> a

List
b -> (a -> b -> b) -> List a -> b

We could have a syntactic case form for this, called cata, which does a recursive case match on its argument. For example, a function that sums a list could be written like this:

sum {
  l -> cata l {
    [] -> 0,
    x r -> + x r
  }
}

A more interesting example is folding a tree

type Tree a {
  L a,
  B (Tree a) (Tree a)
}

t : Tree Int { … }

t_sum : Int {
  cata t {
    L x -> x,
    B l r -> l + r
  }
}

// tree depth
cata t {
  L _ -> 1,
  B l r -> + 1 (max l r)
}

cata is useful anywhere you’re doing a lot of folding or recursive transformations. We can use it to evaluate a small language:

type E {
  I Int,
  Add E E,
  Sub E E,
  Var Nat,
  Let E E
}

at : Nat -> List a -> a {
  // look up elem at index
}

eval : E -> Int {
  e -> let f = {
    // recursively fold over the expression,
    // producing a function from environment 'c'
    // to an integer.
    cata e {
	  // I is just an int
      I n -> (_ -> n)
      // Add and Sub evaluate their arguments and then +/- them
      Add x y -> (c -> + (x c) (y c)),
      Sub x y -> (c -> - (x c) (y c)),
      // Var looks up v in the environment
      Var v -> (c -> at c v)
      // Let evaluates e1 and then adds it to the environment
      Let e1 e2 -> (c -> e2 [e1 c, ..c])
  } {
  // c starts as an empty list
  f []
  }
}

So this construct can definitely be nice for some types of program.

Beyond Scott and Church, there are other encodings. For example the Parigot encoding gives you the transformed value but also the original untransformed value. The Parigot encoding for a list would be something like

b -> (a -> List a -> b -> b) -> b

In the recursive case you get the head of the list a, the tail of the list List a, and also the result from transforming the tail of the list b. This allows you to “remember” what the value was before transformation, which I guess can be useful for some algorithms. I haven’t found an obvious use for it yet.

I thought for a while that it might be interesting to be able to swap out your eliminator in the language for a different one, or maybe define it per type. But I didn’t really get anywhere with that train of thought. I definitely think there’s something valuable about automatically deriving the correct fold for your data types, and making that easy to use. That’s all I’ve got for now though.