April 26, 2018

Deprecating ‘data'

Summary: Generic programming should be based on representational equality. We can obtain this by reformulating “data” in terms of “newtype”: manually or (tentatively) with compiler support

Background #

In modern Haskell we often seek to derive instances, meaning to have the compiler generate them for us, as opposed to writing them by hand. Deriving instances is not just more convenient but also safer and encourages deduplication and code reuse.

Currently GHC provides three different strategies for deriving instances: stock, anyclass and newtype. The stock strategy is the original approach to deriving in Haskell and is “magic”: it is hardcoded in the compiler and only works on a predefined set of classes, such as Eq, Show, Typeable, Functor and so on.

The anyclass strategy is just syntactic sugar for writing empty instance declarations and we will not be focusing on it here. The most flexible approach is newtype, which allow lifting instances through one layer of representational equality.

As a motivating example, let’s consider the following implementation of an invertible map.

data Element a = Element a a
  deriving (Show, Functor, Applicative, Monad)

newtype InvMap a = InvMap [Element a]
  deriving (Show, Functor, Applicative, Alternative)

The above code won’t currently compile. We can modify it to derive just the Functor instance using stock deriving (assuming -XDeriveFunctor is enabled), but the Applicative and instances have to be written by hand. However we can derive the instance using newtype deriving, if we refactor our code a bit:

data Element a = Element ((Identity `Product` Identity) a)
  deriving (Show1, Show, Functor, Applicative, Monad)

newtype InvMap a = MultiMap (([] `Compose` Element) a)
  deriving (Show, Functor, Applicative, Alternative)

Now GHC will happily derive all of the instances for us.

This is possible because newtype deriving works by exploiting representational equality, which allow us obtain instances from the underlying representation, if they exist. In this case the standard library defines the following instances.

instance Monad []
instance Monad Identity

instance (Monad f, Monad g) => Monad (f `Product` g)
instance (Applicative f, Applicative g) => Applicative (f `Product` g)
instance (Applicative f, Applicative g) => Applicative (f `Compose` g)

As the name implies, GeneralizedNewtypeDeriving work on newtypes. For types declared using data we are stuck with the “old-fashioned” approach of writing instances by hand. There is an easy way out however: we simply stop using data.

The observation here is that any data declarations for any Algebraic Data Type can be mechanically translated into a newtype, wrapping a type-level expression define using a closed set of type constructors.

The type constructors we want to use operate on objects in the functor category (Type -> Type). We can find implementations of most of these in base package under Data.Functor. We will present them here in a slightly simplified syntax:

data (f . g) a      = Compose (f (g a))
data (f * g) a      = Pair (f a) (g a)
data (f + g) a      = InL (f a) | InLR (g a)
newtype Identity a  = Identity a
newtype Const a b   = Const a
newtype Fix  f      = Fix (f (Fix f))
data ()             = ()

The translation rules are simple: we replace multiple cases with (possibly nested) applications of +, multiple constructors with (possibly nested) applications of *, and concrete types with Const b. We use () for nullary constructors, and for application of unused variables.

For example we can rewrite the following data declarations like so.

data Bool = False | True
-- >
newtype Bool = (Const () + Const ()) ()

data Maybe a = Nothing | Just a
  deriving (Functor, Traversable)
-- >
newtype Maybe = (Const ()) + Identity
  deriving (Functor, Traversable)

data Pair a b = Pair a b
  deriving (Functor, Traversable)
deriving Monoid a => Applicative (Pair a)
deriving Monoid a => Monad (Pair a)
-- >
newtype Pair a b = Const a * Identity
  deriving (Functor, Traversable)
deriving Monoid a => Applicative (Pair a)
deriving Monoid a => Monad (Pair a)

(We are omitting newtype constructors for readability. Assume there is always a newtype constructor with the same name as the type).

For nested applications of constructors we use . (or Compose).

data Map k v = Map [(k, v)]
-- >
newtype Map k = [] . (Const k * Identity)

We can also add arbitrary types to the mix, such as basic value types such as Int and Double, as well as function types.

data TotalMap k a = TotalMap a (k -> a)
  deriving (Monad)
-- >
newtype TotalMap k a = TotalMap ((Identity * (k -> )) a)
  deriving (Monad)

Recursive types #

For recursive types, the choice of encoding is not as straightforward. The simplest encoding would be to use a single
fixpoint type such as this one from the data-fix package:

data Fix f = Fix (f (Fix f))

Unfortunately this does provide many of the instances we want. For example consider:

data Tree = Leaf a | Branch [Tree a]
-- >
newtype TreeF a r = (Const a) + (Compose [])
newtype Tree a = Fix (TreeF a)

This won’t give us the expected instances (no Monad, not even Functor). On the other hand using Free (from free):

data Tree = Leaf a | Branch [Tree a]
  deriving (Functor, Applicative, Monad)
-- >
newtype TreeF = []
newtype Tree a = Free [] a
  deriving (Functor, Applicative, Monad)

And similarly:

data NonEmpty a = Done a | Cons a (List a)
  deriving (Functor, Applicative, Monad)
-- >
newtype NonEmptyF a = Const a * Identity
  deriving (Functor)
newtype NonEmpty a = Free (NonEmptyF a) a
  deriving (Functor, Applicative, Monad)

In short, when dealing with recursive types we have to pick the recursive type explicitly.

Generic programming for Free #

Assuming algebraic data types are rewritten as described above, we can now provide instances for them generically by simply writing a small, closed set of instances. For example we can write a prism-based serialization library (we’ll omit the actual implementation):

class Binary a where
  binary :: Prism' [Word8] a

class Binary1 f where
  binary1 :: Binary a => Prism' [Word8] (f a)

instance (Binary1 f, Binary1 g) => Binary1 (a + b)
instance (Binary1 f, Binary1 g) => Binary1 (a * b)
instance (Binary1 f, Binary1 g) => Binary1 (a . b)

instance (Binary1 f, Binary1 g, Binary a) => Binary (a + b) a
instance (Binary1 f, Binary1 g, Binary a) => Binary (a * b) a
instance (Binary1 f, Binary1 g, Binary a) => Binary (a . b) a

instance Binary a => Binary (Identity a)
instance Binary a => Binary1 (Const a)

instance Binary a => Binary (Const a b)
instance Binary1 f => Binary (Fix f)

instance Binary1 Identity
instance Binary ()

This will allow us to derive serialization for an arbitrary type.

newtype IntMap a = IntMap [(Int, a)]
  deriving (Eq, Ord, Functor, Traversable, Binary)

This approach completely subsumes the traditional stock and anyclass mechanisms. With a proper set of instances for the base types, we should be able to reimplement all current stock deriving of classes such as Eq, Show, Foldable etc in Haskell itself.

Fixing syntax #

This manual rewriting strategy solves the problem of instance deriving, but at the cost of readabilty and syntactic sugar such as record selectors and pattern matching. The latter be recovered by defining a set of bidirectional pattern synonyms. Like the type representations, these can be infered from original definition in a mechanical fashion. For example:

pattern (Leaf a)    = Tree (Fix (InL (TreeF (Const a))))
pattern (Branch ts) = Tree (Fix (TreeF (InR (Compose xs))))

count :: Tree a -> Integer
count x = case x of
  Leaf _ -> 1
  Branch xs -> sum $ fmap count xs

n = count $ Branch [Leaf 1, Branch [Leaf 2], Branch []]

We have now recovered pattern matching and data constructors application. How about record syntax? We could of course manually write out function definitions in term of pattern matching.

Unlike type and constructor translation, record desugarding this involves several specific decisions. E.g. for the following type should we generate getters, setters, or some lens type?

data Pair a b = Pair { fst :: a, snd :: b }
-- >
newtype Pair a b = ...
fst :: Pair a b -> ??

Similarly, how can we avoid generating partial functions here? Should we use Maybe (arbitrarily chosen), or some prism type?

data List a = Nil | Cons { head :: a, tail :: List a }
-- >
newtype List a = ...
head :: List a -> ??

An obvious answer is to make record syntax into a library as well. Given a few standard typeclasses we can reduce record initialization, updates and so on, into a predictable encoding provided by a library.

We could even use RebindableSyntax to allow user-defined overloading of this syntax, a la do-notation etc. For this to work we have to decorate the representation with constructor and selector names. This is easily accomplished by adding a phantom type parameter to Identity.

data TagType = Constructor | Selector

data Tag (t :: TagType) (s :: Symbol) :: k -> k

--- A la Data.Monoid.All
data All = All { getAll :: Bool }
-- >
newtype All = (Tag Constructor "All" . Tag Selector "getAll") Bool

Compiler support (data-as-newtype) #

We have now seen how the problem of deriving can be solved neatly and elegantly by rewriting all of our our ADTs to newtypes. As this process is entirely mechanical it seems reasonable to ask if the compiler could do it for us. We will term this approach data-as-newtype.

Let’s consider a hypothetical version of GHC that, given any suitable data declaration, rewrites it as a newtype using our predefined set of rules. This has a far-reaching effect: it unifies generics with representational equality. To understand what this means, let us take a deeper look at both concepts.

Generics #

The GHC.Generics module provides magic type class called Generic, which, given any user-defined type, gives you an automatically generated isomorhic type defined using type constructors analogous to the Data.Functor family. For example, given the declaration:

data FooBar = Foo | Bar
  deriving (Generic)

GHC will generate something similar to:

type FooBarRep = Tag "Foo" (Const ()) + Tag "Bar" (Const ())

instance Generic FooBar where
  type Rep FooBar = FooBarRep
  from Foo = InL (Tag (K ()))
  from Bar = InR (Tag (K ()))
  to InL _ = Foo
  to InR _ = Bar

The methods from and to provides an isomorhism between Rep a () and a, which allow us to lift arbitrary instances. For example we could lift Eq:

instance Eq FooBar where
  a == b = from Foo == from Bar

The major drawbacks of generics is that its conversions incur a run-time overhead, and that it doesn’t support deriving as a first class notion: instead we have to write instances to perform the conversions.

Representational equality #

The Data.Coerce module provides a magic typeclass called Coercible, which provides an isomorhism between any two representationally equal types.

Representational equality could be informally defined as nominal equality up to newtype wrappers. In other words, if you strip away all the newtypes from two types A and B and end up with the same underlying type, A and B are representationally equal.

Compared to from/to, coerce incurs no runtime overhead, and also integrates well with the deriving mechanism. It’s major drawback is that it does not work with data declarations. Adopting data-as-newtype would change this.

Reformulation of data-as-newtype #

The data-as-newtype approach can be neatly reformulated as follows:

Amend GHC generics such that:

coerce . to   = from . coerce
coerce . from = to   . coerce

Equivalently: for any type A, the generic representation of A is coercible to A. Because coerce = coerce . coerce, to and from emerge as special cases of coerce.

DerivingVia #

The proposed extension DerivingVia extends newtype deriving to work through arbitrary layers of newtypes. We can derive instances through any type that is representationally equal to the new type.

newtype Count = Count Int
  deriving Show
  deriving Monoid via (Sum Int)

Data-as-newtype could be seen as a logical extension of DerivingVia. Everything is a newtype, and the underlying type provides the default set of instances, with via allowing us to selectively override.

Limitations #

Like traditional generics, data-as-newtype does not work for GADTs. Consider the humble:

data Foo :: Type -> Type where
  Foo :: Foo Int
  Bar :: Foo ()

The closest we can obtain is something like this:

newtype Foo = Const Int + Const

which will generate a type of the correct kind Type -> Type, but unlike the GADT, it does not encode the restriction that Foo :: Type -> Type can only ever be applied to Int and ().

This limitation is inherent in data-as-newtype, but also appear in normal GHC Generics. Data-as-newtype should therefore be restricted to operate on normal ADTs only. Of course, the restriction only applies to true GADTs. The following example will work just fine (assuming ToJSON instances exist for the base type constructors).

data Maybe :: Type -> Type where
  Nothing :: Maybe a
  Just :: a -> Maybe a
  deriving (ToJSON)

Representation issues #

Non-type kinds #

With -XDataKinds enabled, data syntax can be used to declare members of kinds other than Type, referred to as promotion. The data-as-newtype approach only works on declarations of kind ... -> Type, so promoted types would effectively bypass it.

Using the generics/representational formulation this problem does not arise.

Strictness #

Data declaration with strictness annotations will require a separate translation, or some type-level dispatch. This complicates translations somewhat but is straightforward to use.

data Strictness = Strict | Lazy

data family Identity (s :: Strictness) :: Type -> Type
data family Product (s :: Strictness) :: (Type -> Type) -> (Type -> Type) -> Type -> Type

data instance Identity Strict a = Identity !a
data instance Identity Lazy a = Identity a

data instance Product Strict f g a = Product !(f a) !(g a)
data instance Product Lazy f g a = Product (f a) (g a)

data V2 a = V2 !a !a
-- >
newtype V2 = Product Strict (Const Strict a) (Identity Strict)

It’s worth noting that the standard Data.Functor hierarchy provide a somewhat ad-hoc set of choices here, e.g. its Identity is strict and its Product is lazy.

Multiple choice #

A common problem in generic programming is ambiguous choice of representation. For example, the following are all valid representations of data Pair a b = Pair a b:

newtype Pair a b = Const a * Const b
newtype Pair a   = Const a * Identity
newtype Pair a b = (Identity * Const b) a

As we’ve seen previously this problem is exacerbated with recursive types.

We could address it in two ways:

Specifying the translation rules unambiguously, effectively providing a normal form for our ADT translations
Assuring that the basic types conform no matter what representation is chosen. I’m not sure whether this is possible.

Bootstrapping the data hierarchy #

Adopting the data-as-newtype at the compiler level strategy means we run into bootstrappnig issues. What base types do we use? Can they be defined in terms of theselves? There is a couple of strategies we could adopt:

Pick a fixed set of base types and define all other ADTs in terms of them. This is the simplest solution but requires hardcoding certain types in the compiler.
Use a RebindableSyntax-like solution with an escape hatch. Whenever the compiler encounters a non-basic data declaration it looks up the primitives from the current scope. This may result in further transitive resolution of types, and so on, until a basic type is encountered.

Summary #

We have seen how data can be reformulated in terms of newtype assuming a suitable set of base types and rewriting rules. From this we obtained an elegant reformulation of generics based on representational equality. This empowers GeneralizedNewtypeDeriving and DerivingVia enough to subsume all other deriving strategies, including stock.

Implementing this at the compiler level is attractive On the other hand we can also obtain many of the benefits right now, by simple manual translation.

Furher reading #

Kudos