I'm working on a project right now where I'm dealing with the Prim typeclass and I need to ensure that a particular function I've written is specialized. That is, I need to make sure that when I call it, I get a specialized version of the function in which the Prim dictionaries get inlined into the specialized definition instead of being passed at runtime.

Fortunately, this is a pretty well-understood thing in GHC. You can just write:

{-# SPECIALIZE foo :: ByteArray Int -> Int #-} foo :: Prim a => ByteArray a -> Int foo = ...

And in my code, this approach is working fine. But, since typeclasses are open, there can be Prim instances that I don't know about yet when the library is being written. This brings me to the problem at hand. The GHC user guide's documentation of SPECIALIZE provides two ways to use it. The first is putting SPECIALIZE at the site of the definition, as I did in the example above. The second is putting the SPECIALIZE pragma in another module where the function is imported. For reference, the example the user manual provides is:

module Map( lookup, blah blah ) where lookup :: Ord key => [(key,a)] -> key -> Maybe a lookup = ... {-# INLINABLE lookup #-} module Client where import Map( lookup ) data T = T1 | T2 deriving( Eq, Ord ) {-# SPECIALISE lookup :: [(T,a)] -> T -> Maybe a

The problem I'm having is that this is not working in my code. The project is on github, and the relevant lines are:

To run the benchmark, run these commands:

git submodule init && git submodule update cabal new-build bench && ./dist-newstyle/build/btree-0.1.0.0/build/bench/bench

When I run the benchmark as is, there is a part of the output that reads:

Off-heap tree, Amount of time taken to build: 0.293197796

If I uncomment line 151 of BTree.Compact, that part of the benchmark runs fifty times faster:

Off-heap tree, Amount of time taken to build: 5.626834e-2

It's worth pointing out that the function in question, modifyWithM , is enormous. It's implementation is over 100 lines, but I do not think this should make a difference. The docs claim:

... mark the definition of f as INLINABLE, so that GHC guarantees to expose an unfolding regardless of how big it is.

So, my understanding is that, if specializing at the definition site works, it should always be possible to instead specialize at the call site. I would appreciate any insights from people who understand this machinery better than I do, and I'm happy to provide more information if something is unclear. Thanks.

EDIT: I've realized that in the git commit I linked to in this post, there is a problem with the benchmark code. It repeatedly inserts the same value. However, even after fixing this, the specialization problem is still happening.