Magda is made up of a number of small, mostly self-contained services — search, administration, authorization, discussions, a web server and an API gateway. At the center is the registry, an unopinionated database that stores records as sets of “aspects” — various views of a record, each of which conform to their own JSON schema.

Upon being created, services can subscribe to events (e.g. a new record being created), as well as write their own data in the registry, either by creating their own aspect, or by patching an existing one.

An example of how this facilitates extension is what we call “sleuthers” — these are microservices that sit on the network and listen for records to be added or modified. When this happens they’ll perform some kind of operation and write the result back to the registry. For instance, we have a sleuther that:

Checks whether URLs linked to by datasets actually work

Writes the result back to its own aspect (which is read by the UI)

Patches a corresponding quality rating into a shared “quality” aspect

… this in turn is listened to by the search indexer, which averages out quality ratings and uses them to inform search ordering.

To be compatible with Magda, a sleuther needs only to make an HTTP call to register itself, expose an HTTP interface that can accept webhooks, then make another HTTP call to record its results — nearly any programming language I can think of is capable of this (sorry QBasic).

The sleuther also:

Can’t hit private APIs

Can be updated independently of the registry (and vice versa)

Can’t take down the whole system unless it somehow manages to kill all of the machines hosting the system, and

Can be removed easily — just shut down the service and remove its webhook registration.

Gluing It All Together: Kubernetes and Helm

Naturally this is impractical in a world where every node on the network requires a new VM to be manually provisioned, dependencies set up, an application installed, network configured etc. Luckily, times are changing.

Welcome to the future.

With Docker and Kubernetes, we can write down our entire system — runtimes, databases, configuration, storage, autoscaling, load balancing, networking and all — as readable, source-controllable text files that can be used to build the system from scratch, automatically, reproducibly and on any cloud or on premises.

Helm then allows us to turn these files into templates, providing specific customizations for various contexts, and track those customizations too. For instance, you might want to provision different storage depending on whether the system is running on Google Cloud or AWS, or run only a single database server in development.

The result is that whether running locally (with minikube), on a cloud provider or on premises, a complex system can be installed with a simple helm install --values customization.yaml . Get something wrong? Fix it and run helm upgrade <installation-name> --values customization.yaml — Helm and Kubernetes will figure out what’s changed and update your cluster automatically.

Breaking the Fourth Wall

Where this gets truly crazy is that using the Kubernetes API, you can also modify your cluster’s configuration from within the cluster itself!

We’re currently adding functionality to Magda that allows an administrator to add services to their installation without having to know what Kubernetes even is.

A key feature of Magda is federation — it’s able to connect to external data sources and crawl their contents, bringing the metadata into its own registry and indexing it for search. Naturally, these connectors are implemented as microservices.

Through our admin interface, a user simply needs to provide id of an appropriate docker image and the configuration it needs — then through the Kubernetes API, we can create a new container, run the desired image, pass the right configuration and clean up once its finished. Because Kubernetes uses a declarative model for configuration, we easily roll back to the previous version in the event of a problem, or even pull the new config out so that this cluster can be exactly replicated elsewhere.

We’re even able to integrate with Kubernetes’ jobs API to provide a dashboard for what’s currently being crawled. In this way, we’re able to get a package manager for free from Docker, and a distributed job scheduler for free from Kubernetes.

Kubernetes is our Runtime

Creating this kind of hard dependency on a technology does lock us in to some extent, but given that Kubernetes is an open-source, vendor-agnostic and well-supported technology, we’re happy to take the risk given what we gain. Effectively, what PHP is to Wordpress and the JVM is to Jenkins, Kubernetes is to Magda.