Google is Maven Central's New Best Friend

Maven Central has long been a critical resource in the Java community. I don’t think there is a single developer or organization that uses the JVM that has not been impacted by Maven Central. The importance of Maven Central as an open resource cannot be overstated. That’s why I am glad to announce that Google, a prolific participant in the Java community, is now hosting a complete mirror of Maven Central with research and data mining opportunities! Before I get into the details I think it would be useful to talk a little about Maven Central’s history. Maven Central has had many homes. It spent its first few weeks at the Apache Software Foundation after which the very generous people at Ibiblio started hosting Maven Central and carried Maven through its early growth spurts. When Maven Central started to tax Ibiblio’s infrastructure, the community was fortunate to have an advocate in Matthew Porter, the CEO at Contegix. Matthew provided a new home for Maven Central, where it lived happily for many years, while growing rapidly in size and traffic volume.

Along the way I founded Sonatype and eventually the day-to-day operations of Maven Central were taken over by Brian Fox and others. Today everything is managed by Jason Swank, Joel Orlina and a few other members of Sonatype’s technical operations team. These guys are the unsung heroes of Maven Central. They field all your requests to sign up to Maven Central, support various Nexus instances as the input funnel, write documentation for users and publishers, run search.maven.org, and keep the infrastructure of machines and CDNs humming along. Everything is running well.

So how does Google fit into this equation?

About two years ago a friend of mine at Google, Matt Stephenson, asked if it would be possible to get a copy of Maven Central. Google has contributed a lot to the Java community so I had no issues with this. I started working with Matt to figure out how to get Google a full copy of Maven Central. We needed a place where everyone had access which ruled out any private Google infrastructure. Fortunately Google had recently announced their new publicly available cloud infrastructure and so we decided to give that a try. We used a compute instance running a client that used the SOLR index available on search.maven.org to synchronize the content which worked to give Google a full copy of Maven Central.

Now that I had a full content replica of Maven Central it got me thinking about what was possible. Imagine being able to provide any research group with access to the content of Maven Central? Or giving the Java community access to the content from which to generate new data and new analysis tools? Imagine running experiments on different types of transports and dependency pre-computation to improve artifact resolution times? I imagined how much was possible, but wondered how all these experiments could be run without jeopardizing the existing infrastructure. So I asked Google if they would be interested in timely updates of Maven Central and if they would be willing to provide the infrastructure for their needs and other developers. Given Google’s very pro-developer position, I was not surprised to receive a resounding “yes”. Time to get to work.

My first problem to deal with was getting reliable incremental updates for Maven Central. For a one-off copy of Maven Central using the SOLR index in off hours was fine. I don’t think anyone actually noticed me hogging the index, but slamming the SOLR index repeatedly during the day is not very nice and would not scale either. I talked to Mike Hansen, the head of Products at Sonatype, and we figured out a way to make the canonical Maven Central S3 bucket at Amazon available for synchronization. The bucket not only contains the artifacts, but also includes journals of additions that occur on a 2-hour period. In theory this meant that any sink could be updated in 2-hour intervals. I worked with Jason Swank to setup the access and wrote a small program to do the synchronization. Now we we have a full content replica. But how can we make a complete replica?

For a complete replica of Maven Central I needed a way to serve the content via HTTPS. Fortunately Google Cloud Storage provided exactly what I needed. With very similar APIs to Amazon’s S3 it is very easy to push content to make it generally available. Having the content generally available from Cloud Storage makes it easier to consume within Google Cloud Platform than trying to move the content around with rsync over ssh. Google Cloud Storage also provides some other interesting features. They can ultimately make Maven’s transport far more efficient by using HTTP2 and leveraging the available compression schemes. We can use existing HTTP headers and augment them to do some amazing things. Now we have the makings of an incredibly reliable system. We have a copy of the canonical Maven Central along with a full replica of the content serving mechanism. The community can use it and we have full redundancy. This is great given how important Maven Central is. Hats off to both Google and Sonatype for being forward thinking and accommodating while putting this infrastructure together. Now there are all sorts of cool projects we can work on!

Really and truly, the number of tools and forms of data that can be produced from the content in Maven Central is limitless. The innovation possible is limitless. There is so much data potential in Maven Central - it’s mind boggling.

One of the first tools being worked on is by Tamás Cservenák (Sonatype) and Fred Bricon (Red Hat). They want to provide a simple service that will replace the need for the Nexus index. The Nexus index was long used in tools like M2Eclipse to search for artifacts, but the size of the indices has become unwieldy and no one usually cares about artifacts that are 7+ years old. They decided that a fast service would be more useful so they are working on one. It might be interesting to see if we can get towards a “Maven Repository API" in one form or another.

As for myself I would like to start generating semantic version and binary compatibility information and make it generally available. I think this can have a huge positive impact if a developer can tell that it’s safe to move to a different version of a library.

What’s possible in the future? Real time synchronization using Amazon’s S3 notification APIs, component indices produced for segments of the community like Android developers and Spring developers. Or Scala or Clojure users maybe. It should be possible to catalog and categorize artifacts and their metadata for various use cases. Imagine having an up-to-date catalog of all Android components available in Maven Central? Very powerful.

If you want to give the new mirror a try, you can use the following in your $HOME/.m2/settings.xml

<settings> <mirrors> <mirror> <id>google-maven-central</id> <name>Google Maven Central</name> <url>https://maven-central.storage.googleapis.com</url> <mirrorOf>central</mirrorOf> </mirror> </mirrors> </settings>

This is just the beginning and a glimpse of what might lie ahead. I hope that by making the content of Maven Central available to the community truly amazing things will start to happen. Maven Central is an incredible resource and I’m sure many smart people have ingenious ideas about how to leverage what we have. Please feel free to reach out at JavaOne to talk about what’s possible. I will be talking about Google Maven Central with folks from Sonatype and Google at Still Rocking It: A Dozen Demos and Great News for Apache Maven Users if you’re interested. I’d love to hear what you think!

Comments