The Zen of Comprehensive Archive Networks

It seems that there is a lot of interest in having similar archives for other languages like CPAN [1] is for Perl. I should know; over the years people from at least Python, Ruby, and Java communities have approached me or other core CPAN people to ask basically "How did we do it?". Very recently I've seen even more interest from some people in the Perl community wanting to actively reach out a helping hand to other communities. This 'missive' tries to describe my thinking and help people wanting to build their own CANs. Since I hope this message will somehow end up reaching the other language communities I will explicitly include URLs that are (hopefully) obvious to Perl people. Note that I'm going to describe what things worked for Perl, translate appropriately for other languages.

I'll start negatively and end with hopefully more constructive notes, however these will build on the denials.

In the following Mumble and mumble stand for any other language than Perl or a combination of languages other than Perl.

First, the negative statements.

CPAN shall not 'piggyback' other languages. (In other words, there shall not be a mumble / top level directory.) Rationale: CPAN is CPAN is CPAN. CPAN carries Perl. This implies all kinds of different contracts, explicit and implicit. Some people in the Mumble community will take offense to CPAN carrying Mumble . Some people in the Perl community will take offense to CPAN carrying Mumble . Some CPAN mirrors will take offense to suddenly having to carry also Mumble . Some CPAN mirrors will become resource (bandwidth, disk) constrained after having to suddenly carry also Mumble .

/ top level directory.) CPAN cannot 'piggyback' other languages. The building blocks or 'plumbing' of CPAN (the basic directory structure, the PAUSE) is a reasonably good match for Perl. I'm not so certain that it is for all the other languages.



Now, on to the hopefully more constructive suggestions.

First and foremost-- I'm not against other language communities having a CPAN. I would love to have such archives. I'm willing to help the other language communities. I'm only against too straightforward "let's just slap it on to the side of CPAN" solutions to the problem. Other languages are not like Perl, they are different, to a smaller or larger degree. Let's allow them their own degree of dignity and careful thought.

Then on to the technical questions, also known as "How did you do it?" Well, people always ask that from me and I go speechless... "Errrr, ummm, I kind of pulled all this stuff together and organized it a bit, and put it on a ftp server". After this a brooding silence always falls... "And...?" ... "And what?" ... "That's it?" "That's it."

Components of CPAN

Well, that's not really it, of course. The above is how CPAN started. How it grew is another story. First, Larry designed Perl to grow by letting it have modules (in other words, namespaces). Then we had a couple of wise men (like Tim Bunce) to have the vision of good module naming guidelines. Finally, we had Andreas König who single-handedly wrote PAUSE [2], the module submission machinery, where Perl module authors can register, submit, and manage their submissions. This allowed for a rapid but still controlled growth of modules.

Installing modules can be difficult, especially if that involves having to glue in C and/or external libraries. Andreas and other people wrote both a frontend and a backend for this: the frontend is known as the CPAN (shell) [3] and the backend is known as the MakeMaker [4]. The shell (also known as CPAN.pm) takes care of downloading the required components, and the backend creates the required Makefiles (or equivalent build tool control files) and then invokes the appropriate build tools.

Incidentally, naming the module installation shell identically with the archive proved to be more than a little bit confusing: people may talk of "CPAN being broken" and you will have no idea whether they are talking of a bug in the shell, their favourite CPAN mirror being down, or whether they are objecting to some design detail of CPAN in general.

Another variant of confusion is that many people think CPAN is "just" the PAUSE, in other words, just the modules submitted by authors using the PAUSE interface. While not wrong (the overwhelming majority of CPAN content does indeed come from PAUSE), this is not exactly right, either. Firstly, CPAN does have other sources than just PAUSE: there are a couple of small sites CPAN merges into itself, and some files (like some rarer binary distributions of Perl) are still fetched manually (since they change infrequently). Secondly, there is the ports page that lists binary distributions for Perl, some in CPAN, most hyperlinked from elsewhere.

An essential feature for (half)automated installation tools is easy extraction of module dependencies. Easy documentation extraction allows for easy online documentation browsing, which in turn makes it easier for people to decide whether they want to use a module, and when they use it, to use it better.

Since the CPAN shell is starting to show signs of its age and because it doesn't have a good programmable API, a new project called CPANPLUS [5] has been started. It will hopefully be a drop-in replacement for the old trusty CPAN shell, but also allow greater flexibility and extensibility. Similarly, there is a replacement project for MakeMaker, Module::Build [6].

Note that CPAN.pm and MakeMaker come with every Perl distribution, but it is possible to write alternative module installation interfaces: ActiveState has their own interface called ppm (Programmer's package manager, originally known as Perl package manager) for their ActivePerl distribution.

Because of the growth of CPAN, it finally became too arduous to know what was out there, and luckily Graham Barr's scratch to this itch become large enough to be published as search.cpan.org [7]. There are also alternative search engines for CPAN, Randy Kobes' search [8] and WAIT [9], but the search.cpan.org seems to be the most popular.

Later backPAN [10] was added by Andreas to hold all the old versions of submissions deleted by their authors; this ties back into simple basic things that the core server(s) must have, like good backups.

The cpan-testers is a mailing list (started by Graham Barr and Chris Nandor) whose subscribers download recent module uploads and try running the regression suites, and report back the success or failure to a mailing lists which gets databased, and of course back to the original author. This has proved to be invaluable in making the modules more portable between operating system platforms and different releases and configuration of those platforms. Also important to notice is that having regression test suites coming with the modules is essential-- how else can you know whether the code works at all?

Last but not least, module feedback (bug reports, enhancement requests, or praise) for the modules can given through the RT ticketing system [12] set up by Jesse Vincent.

Mirrors

CPAN mirrors [13], then? How did they come about? The original ones, dozen or so, were easy: I just asked the maintainers of the original ftp sites I had found the seeds of CPAN from whether they might be interested in carrying this slightly bigger amalgamated Perl archive. Well, they foolishly agreed... I have to remind people once again that CPAN was conceived as a FTP archive. Not a website. And it still is that way. search.cpan.org just gives a nice interface. I'm sorry but I'm a dry CS engineer, not a graphic designer. Information, not animation.

Oh, back to the CPAN mirrors. After the original ones, we grew slowly for a while, by word of mouth in the Perl community. However, since this was the time before the billions dollars worth fiber dug into the ground, Internet connections were still a bit dodgy and spotty. Therefore I started doing two things: scanning ftp logs for sites that obviously were mirroring CPAN but were not registered mirrors, and sites that were good representatives for their particular top level domain, especially outside the big seven TLD. This way I could track down where Perl was used and by asking those sites to participate to push back the load from the master site. Later I also filled in missing countries by going for sites like the sunsites, and other vendor/public funded sites that had a good chance of having good connectivity. Usually I could find a sympathetic soul, oftentimes a system administrator.

The status of the CPAN mirrors is monitored four times a day, from two different machines in two different continents. A stale mirror is almost worthless, sometimes even dangerous. Note also that as the number of mirrors grows, don't expect to be able to check all of them at each scan: there are always some network or server problems that will stop you from getting all the status information. Getting the full status of all the files on all the mirrors is a fantasy unless the mirrors themselves run integrity checks. CPAN relies on a very simple trick: the CPAN master site updates a certain file once every hour, embedding a (UTC) timestamp in that file. By downloading that file from a mirror and extracting the timestamp we can trivially see when did they last update.

Summary of the mirror tirade: I went for sites that liked and/or used Perl. I have no way of knowing off-hand whether they would like Mumble. The mirrors are donating their network and storage capacity and some amount of their administrative time for the Perl community. If we would like to extend that in any way we would have to ask them, from all of them individually.

You can learn more about CPAN's history from the Perl timeline [14]. Things didn't happen overnight.

Naming

A quite important thing for both the authors and the users is that the language must get the naming scheme of its modules right, or at least reasonably close. Perl's/CPAN's is far from perfect, but at least it was once designed, and it has been enhanced over the years as new needs have appeared. A good naming scheme allows hierarchical browsing, gives good hints for search engines (a good name is effectively a string of uniquely identifying keywords), and coordinates community efforts. Some sort of conflict resolution mechanism in case of competing and identically named implementations is important. Keeping all those guidelines well documented and all these processes public is important.

One naming issue I think Perl 5 got wrong is that module namespaces are first-come-first-served, two or more different authors cannot have an identically named module. This may lead into unintentional or intentional namespace squatting, and some overly heated exchange, none of which is not good for the community.

When designing your author/module/whatever hierarchy think scalability. We originally got it wrong in one spot by having all authors as subdirectories in one single directory which quickly became a bottleneck. (The solution to this was simply to 'hash' based on the leading two characters of the user ids.) Think also several different views to your data: by author, by module, by category, by date, by keywords, and so forth. Don't think only hierarchical views will be enough: you will need searching capabilities.

Licensing

Get your license policy clear from day one. No, day minus one. In this day and age it is very important that every piece of software gets clearly marked as to what license it carries. Build your module packaging tools so that they suggest, maybe even demand that the author picks a license. This way both the users of modules and distributors of software wanting to include the module don't have to keep guessing.

Very much related to the licensing is of course commercial use: CPAN took the easy and clear policy of no commercial software of any kind, not even share/guilt/donateware would be allowed. We felt that any other policy would be open to nitpicking, or maybe even legal challenges, and as a volunteer group we do not have time or other resources for any of that.

Keep Things Safe

That the servers hosting the archive core services should be paranoidically maintained and monitored for security goes without saying, but I'm saying it anyway.

Should you have PGP/GPG keys and triply-written-in-blood signatures? Maybe. Currently CPAN has only MD5 checksums-- but so far they have been enough. Then again, given the recent rise in Trojan attacks against various pieces of open/free software a greater level of trust may be needed. There are ongoing projects that enable using PGP/GPG keys for verifying the origin of the software; but as always with PKI systems, bootstrapping the web of trust is hard, some say even not worth the trouble. Where should you store the public keys? Obviously not in the same place as the module distributions themselves. Which public key servers would you trust? One lightweight way to do without PKI would be simply to distribute the original checksums to enough places so that an attacker couldn't feasibly modify all the copies. But at some point you would be very probably trusting DNS, anyway.

Keep Things Open

Code quality? Ratings/reviews? Moderation/metamoderation? "Approved" SDKs? These all are hotly debated subjects and will not be addressed here since the CPAN is and will stay an open and free forum, where the authors decide what they upload. Any further selection belongs to different fora. Besides, adding any rating or approval processes creates bottlenecks, and bottlenecks are bad.

Be mindful of other platforms than Intel Linux and Windows. There's no need to alienate people of rarer tastes. One day they will help you.

Make your archive accessible via several means. Don't stop at just HTTP: think FTP and rsync, too. On the other hand, do get the basic protocols right first-- don't jump off the deep end and try to create an all-singing all-dancing web service, or whatever is currently fashionable. This ties back to being platform agnostic: try to package your modules so that the maximum number of people can install it.

CPAN Scriptorium

The scripts that maintain the CPAN are dreadfully simple. They are just simple shell scripts that copy sites A, B, ..., Z to the CPAN master site at ftp.funet.fi, launched from cron. Many of them use Ye Olde Original mirror [15], some of them are just rsync [16]. No magic. I really don't have anything to give away, no magic bags full of powerful CPAN spells. The most complex script in the CPAN master site is the script [17] that probes the mirror sites for uptodateness-- and even that is not rocket science, just multiplexing ftp and http downloads and comparing timestamps.

Andreas has the webserver code for PAUSE available online [18]. That code is slightly more complex than the core CPAN scripts, or the scripts supporting the PAUSE; but even here, the code is there. Again, no tricks up our sleeves.

Conclusions

There is no magic. All it takes is a few people that sit down and get first something running, a rough cut. Then iteratively enhance it. Don't try to create a master plan that will get everything right in one fell swoop. The only one that will get swooped is you.

One way to summarize most of the above is the priceless KISS principle-- Keep It Simple, Stupid. Avoid too complex setups. Start simple.

Another important credo is: Avoid bottlenecks and interdependencies. Decentralize. Create and encourage alternatives. For example, the most popular search engine of CPAN isn't actually part of CPAN proper: search.cpan.org just mirrors CPAN and from the data builds the search indices and searching/browsing interfaces. That's way there can be several seach engines of the same CPAN. Similarly, currently we use CPAN.pm + MakeMaker to install modules: but we are not committed to either, and the community is working on replacements. Keep things loosely connected. This allows for different people to work on their own enhancements without disturbing the other parts.

Perhaps the most demanding thing is commitment: someone must keep things running. A slowly decaying and dusty archive is almost worse (and certainly more sad) than no archive at all.

Oook and out.

While writing this article I got valuable feedback from many people: from the CPAN core people, and from the readers of use.perl.org (a Perl news and community website). I have to especially mention Neil Kandalgaonkar, who shared his war stories from the ActiveState trenches.

This article is free documentation; you can redistribute it under the same terms as Perl itself. Quoting it or, linking to it, translating it to other languages, or using the illustration(s) is allowed as long as the URL of the original article ( http://www.cpan.org/misc/ZCAN.html ) is included.

$Date: 2003/01/09 00:04:23 $

Jarkko Hietaniemi, the CPAN Master Librarian