Hackage GSoC: Feature graphs and URI trees

Hello Haskellers! I’ve made lengthy strides in the internal structure of the new Hackage server. Amidst implementing features, I’ve also implemented a reasonably solid top-level organization for them. I’ll describe some of the technical details of the structure here.

Feature graphs

Each feature has a listing of the URIs it provides, the user groups it needs to authenticate, and the data it needs to store with methods to back up and restore that data. A feature might also define caches for its pages, IO hooks to execute on certain events (like uploading a package), and pretty much anything else: features are arbitary datatypes that implement a HackageFeature typeclass. If feature A depends on feature B, then feature A can extend B’s URIs with new formats and HTTP methods, use B’s data and user groups, and register for any of B’s hooks.

The barebones features are:

core: the central functionality set for something to reasonably be called a Hackage server. This serves tarballs, cabal files, and basic listings. The data it maintains are the user database and a map from PackageName to [PkgInfo] (see previous post). It is possible to create a core-only server with an archive.tar, but it’s effectively immutable after initialization.

to (see previous post). It is possible to create a core-only server with an archive.tar, but it’s effectively immutable after initialization. mirror: this allows tarballs to be uploaded directly by special clients, and it is intended for use by secondary Hackages (if any) which need to stay up to date without having to support a userbase. This doesn’t use its own data, instead manipulating the core’s.

Now, take a look at the packages, upload, check, users, distros, and build features. Some of then depend on each other. They all depend on core. html depends on all of them. One way to look at the organization is that they provide the model and controller for data and html provides a view. They are interfaces which provide their own data in a way which html/json/xml/yaml/whichever other features can render in their particular format with a minimal amount of effort.

For example, the packages feature doesn’t define any of its own URIs, but has a function, PackageId -> IO (Maybe PackageRender) , which the HTML package page calls. The PackageRender type is essentially the One True Resource Representation of a package, and it looks like this:

data PackageRender = PackageRender { -- using the most recently uploaded package as of now rendPkgId :: PackageIdentifier, -- Vec-0.9.8 rendAllVersions :: [Version], -- [0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4, 0.9.5, 0.9.6, 0.9.7, 0.9.8] rendDepends :: [[Dependency]], -- [[array, base (≤5), ghc-prim, QuickCheck (2.*)]] rendExecNames :: [String], -- [] (no executables) rendLicenseName :: String, -- BSD 3 rendMaintainer :: Maybe String, -- Just "Scott" rendCategory :: [String], -- ["Data", "Math"] rendRepoHeads :: [(RepoType, String, SourceRepo)], -- [] (no repository) rendModules :: Maybe ModuleForest, -- Just a tree containing Data.Vec.* rendHasTarball :: Bool, -- True rendUploadInfo :: (UTCTime, String), -- (Jun 17 2010, "Scott") rendOther :: PackageDescription -- the package description }

From this, the html feature can make a package page that looks like the current one, where 95% of its work is HTML formatting via Text.XHtml.Strict. A json feature could use the same information to make a data-rich nest of curly braces.

Now, a paragraph or two about a failed approach. I also considered having each feature provide its own HTML. This is perhaps the simplest approach on the face of it. However, it gets tricky to, say, define a package page and then later append to it for newer features. I considered HTML hooks where a feature could provide an interface to anyone who wants to inject Html blocks into its pages. For example, a build reports feature would have to register for a hook so that the main package page can link to the reports page.

This has several disadvantages, the most prominent of which is that it makes it cumbersome to switch to a different HTML-generating library or add new formats. I just accepted that HTML was an exceptionally unmodular thing. Instead, what the HTML feature now does is depend on both the build reports feature and the packages feature, and this also allows free-form HTML instead of copy+paste amalgamations, which I’ve heard can be rather ugly. The metric to go by here is “out of all modifications one could imagine making to the server, how can I make them implementable modifying the minimum number of modules?” (I haven’t considered using partial derivatives to optimize the minimum… yet.)

Here is a brief description of the middle features:

packages: just package pages

upload: authenticated users can upload new packages, with some checking in place: can’t overwrite packages, can only upload a new version if a maintainer, and so on. Adds a maintainer/author group for each package. By contrast, the mirror feature overwrites packages without question.

check: checking packages before indexing them and providing candidate packages (see previous post)

users: user pages, password-changing, currently using core and not storing any data of its own

distros: linking Hackage with Arch, Debian, and any other distribution with package repositories with Haskell binaries. These distributions can PUT and DELETE to Hackage to indicate the addition and removal of these packages.

build: submission of build reports, both anonymous and with full compilation logs

And finally, an ad hoc but nonetheless important feature:

legacy: a pile of 301 redirects so that old URIs can mostly work (in particular, links to /cgi-bin/hackage-scripts/package/foobar posted on mailing lists 4 years ago will still work)

Features each have their own particular init functions. For instance, the function to initialize the HTML module is currently:

initHtmlFeature :: CoreFeature -> PackagesFeature -> IO HtmlFeature

URI trees

I would have written this up yesterday but I’ve spent the last 24 hours implementing a new and improved routing system. All of the magic happens in impl , the ServerPart Response which is given to Happstack’s simpleHTTP .

impl :: Server -> ServerPart Response impl server = renderServerTree (serverConfig server) [] -- ServerTree ServerResponse . fmap serveResource -- ServerTree Resource . foldl' (\acc res -> addServerNode (resourceLocation res) res acc) serverTreeEmpty -- [Resource] $ concatMap resources (serverFeatures server)

This seems pretty terse for what’s effectively the server’s main method, but complexity lurks just beneath the surface. It all starts with lists of resources, each server feature providing its own list, which are concatenated into a [Resource] . A Resource contains a URI, and how to respond when that URI is visited for certain combinations of HTTP methods and content-types. Although I’ve never coded a line of Ruby in my life, I stole some of Rails’ routing syntax for this task (also stolen by the Pylons web framework, apparently). Here’s how it works:

A resource at “/users/login” will be run only when /users/login is visited, assuming it’s a GET request.

A resource at “/package/:package” will be run when /package/HDBC is visited, but also when /package/nonexistent-1.0 is entered. It’s passed [("package", "HDBC")] in the former case, and there are combinators to turn assoc lists into data values ( type DynamicPath = [(String, String)] and a combinator withPackagePath :: DynamicPath -> (PackageId -> PkgInfo -> ServerPart Response) -> ServerPart Response ). It’s up to the resource to return a 404 if it can’t abide by the URI.

in the former case, and there are combinators to turn assoc lists into data values ( and a combinator ). It’s up to the resource to return a 404 if it can’t abide by the URI. A resource at “/package/:package/doc/…” will be run when /package/uvector/doc/ or any subdirectory is visited, and it’s likewise passed an appropriate assoc list.

I can specify “/package/:package/:cabal.cabal”, and when /package/parsec-3.1.0/parsec.cabal is visited, the resource is given [("package", "parsec-3.1.0"), ("cabal", "parsec")] (the extension is stripped off).

(the extension is stripped off). And the most complicated one: “/package/:package.:format”. This works for /package/QuickCheck ( [("package", "QuickCheck"), ("format", "")] ), or /package/llvm-0.8.0.2.json ( [("package", "llvm-0.8.0.2"), ("format", "json")] ). An empty format means to go for the default, in this case HTML.

Server trees provide a way for efficiently serving an entire tree of URIs. Starting from an empty server tree, resources are incrementally added, and when two share the same URI format they are combined. For example, the simplified Hackage URI tree is:

The relevant types are:

data ServerTree a = ServerTree { nodeResponse :: Maybe a, nodeForest :: Map BranchComponent (ServerTree a) }

data BranchComponent = StaticBranch String -- /foo | DynamicBranch String -- /:bar | TrailingBranch -- /...

addServerNode :: Monoid a => [BranchComponent] -> a -> ServerTree a -> ServerTree a

Finally, I have a 100-line function for converting resources into something Happstack can read (to be broken up shortly, I hope). It’s called serveResource , and it’s how I convert a ServerTree Resource into a ServerTree ServerResponse via ServerTree ‘s Functor instance. Then the tree is converted to its final flat form, using Happstack’s path-munching combinators to traverse each node’s forest.

serveResource :: Resource -> ServerResponse

renderServerTree :: Config -> DynamicPath -> ServerTree ServerResponse -> ServerPart Response

If this effort is a success, I won’t have to deal with the ServerTree type in any great detail for the rest of the summer. I’ve pushed all of the above code to the hackage-server repository.

Thanks for perusing my run-through of some of the internal server design and my exploration of the problem domain. I’ve also been reading the other GSOC blogs, including Marco’s progress on Immix. I had idly considered applying for that, but given my near-total unfamiliarity with the GHC RTS, I think it would’ve been more than a bit difficult for me. I can see he’s doing a great job, too. Still, there are some things I appreciate about using Haskell and not C in my project. Not only does the type system prevent a host of runtime errors, it also forces me to consider all possible sorts of values which can inhabit a given type and write a case for each one. This is something that’s come in handy a lot in the past few days. Well, now on to actually implementing features in detail. I’ll keep you all posted.