Looking at the above image of our problem, what is your first thought on resolving the issue?

It’s a rite of passage for any developer that when you are thrown a software problem, you will at some point cache. But, this problem goes beyond caching.

We definitely don’t want every source system to implement an API object cache. Nor do we want to attempt to solve by throwing more hardware at each system. That breaks one of the first things you learn in Computer Science — Don’t Repeat Yourself!

It’s fairly clear to the even mildly observant eye what the common entry point is for all this data. The API Gateway. A single entry point to all source systems. If a request is made to the API, we can capture that response, return it to the client, and then cache it. Simple, no?

Yes! We can be quite naive here. Let’s add a short-lived cache that places any requested object into a Fastly cache 5 minutes. This feels comfortable. It’s too short for anyone to notice stale data — like a price, and it could alleviate some level of our failed request issues.

Solutions should be simple, clear, and accessible for the team. But … this just feels too naive. First, Fastly runs on hundreds of nodes, and even with shielding enabled, there’s no guarantee a newly-cached resource will always be a HIT.

Second, this still doesn’t solve a problem if a system goes down beyond 5 minutes. If we can’t even fetch the object from the source system, we won’t be able to cache it. Let’s think beyond this. Let’s define a new principle.

No single request to the API will hit a source system. Rather, it will always hit a persistent cache managed by API services.

A persistent, invalidated-when-necessary cache for an API serving over two million objects? Yeah, that sounds pretty darn cool.

We began to review our options for persistent storage. We love Postgres and use it heavily. However, the store that spoke to us was Elastic Search. It spoke to us through these key items: Distributed, built-in search, and a general, versioned document store.

So, we spun up a cluster of Elastic Search. Which, by the way, is ridiculously simple. We used Ansible to do so, and it’s essentially running these tasks:

Ensure Java is installed

Download Elastic distribution

Unarchive Elastic binary into some desired root

Write, or use default Elastic configuration

Run Elastic.

Repeat on each box, and adjust config in minor ways as per documentation. Voilá, a cluster!

Then, to test things out, we bulk loaded an entire resource set in Elastic. The first set we focused on was our places resource, which contains just under 9 million objects. This resource is essentially all the Places which can be referenced by various objects in our API (hotels, restaurants, activities, …). What we saw right away:

Loading the entire places dataset into Elastic, and leveraging reading from it, rather than the source system improved response times significantly.

Elastic Search natively supports a wide range of search functions on its documents. We were able to build a simple Python service which could translate a query string in the URL to an Elastic Search query.

Let’s talk about these search functions some more. First, here’s how our API looks today, with Elastic Search as part of the picture.

This Sieve app, named after the kitchen tool is the glue between our API and Elastic Search. It’s a Python 3 app, leveraging aiohttp to handle some serious load. To bind to Elastic, we simply transform query string parameters into Elastic Search queries, and query the Elastic Search API within the Sieve app, on demand.

This was good. We could see this become a single, consistent search layer not only for this places resource but well, everything. No reason for every source system to put in place their own search functionality. What a relief that’d be for those product owners!

There was one obvious problem before we could route API requests to this new functionality. You can guess what it is, right? Resources change all the time. So, let’s talk about cache invalidation.

Prior to this, we’d already implemented a webhook delivery system (For the record, we call it major tom). major tom acts like what you may think a webhook system would. When a source system modifies data in some way, it sends a simple ping to major tom, telling it to notify people of the change. What major tom receives looks something like this:

{

"resource": "tour_dossiers",

"language": "en",

"data": {

"id": "24270"

}

}

major tom then sends out an event to every subscriber of tour _dossiers webhooks, doing some extra work to add timestamps and an event type. The final result major tom produces looks like this:

[

{

"event_type": "tour_dossiers.updated",

"resource": "tour_dossiers",

"created": "2018-05-07T05:34:38Z",

"language": "en",

"data": {

"id": "24270",

"href": "https://rest.gadventures.com/departures/24270/"

}

}

]

On average, we receive 175,000 object changes from our systems each day. Data changes all. the. time! A data change can be as clear-cut as a customer adding their middle name onto their profile. Or, it can be a collection of many events, e.g. a customer booking a tour, which modifies that customer, a booking, services, departure availability, etc …

We were pretty confident about webhooks, delivering millions of events from major tom. And with that in mind, we’re seeing an interesting pipeline here.

Woah! Did we just implement cache invalidation? On that persistent Elastic store? Dang, that felt too easy — It wasn’t an obvious answer at the time! We just make it look easy now.

Yes, we’ve now gained the confidence to continue following our new core principle, again:

No single request to the API will hit a source system. Rather, it will always hit a persistent cache managed by API services.

Now that we had this real-time invalidation in place, we enabled our API to hit our Elastic Store (via Sieve) for every request to our places resource.

Since that day, we’ve been moving more of our 50+ resources to this pattern. Initial migrations came from demand, and then we moved resources that traditionally struggled to return within reliable time frames.

We’ve run into our problems. Although we had confidence in major tom to deliver events, we’ve identified that source systems would not always fire the appropriate — or any — webhooks when an object changed. Ultimately, these [minor] issues have helped us fix underlying problems, and improved our consistency, and trust with the data viewed in the G API.

We continue to expand the responsibilities of our Sieve app. Offering functionality like GraphQL through Elastic Search, but, we’ll save those details for another day.

Curious about more detail? Feel free to email me, and we can discuss. Thank you for reading.

This article is part of a series introducing the G API to the world.