Introduction

Webhooks are user-defined HTTP callbacks. At WePay, we make use of webhooks we call Instant Payment Notifications or IPNs to update our partners on the status of transactions happening in our system. IPNs allow our partners to receive notifications whenever something important happens to objects such as a checkout, credit card, merchant account, user, etc. For example, if the state of a checkout with id 12345 changed from “authorized” to “captured”, we will notify our partner with an IPN containing object_type = “checkout” and object_id = 12345. Partners can then do an API call to look up the exact changes. Our IPN delivery system also handles following use cases:

IPN Batching - We batch IPNs that are generated within a very short span of time for a given object. For example, for checkout with id 12345, if state transition from “authorized” to “captured” and from “captured” to “reserved” happens too quickly, instead of sending two IPNs for checkout 12345, we will send only one so that only a single lookup is needed to get the latest state.

- We batch IPNs that are generated within a very short span of time for a given object. For example, for checkout with id 12345, if state transition from “authorized” to “captured” and from “captured” to “reserved” happens too quickly, instead of sending two IPNs for checkout 12345, we will send only one so that only a single lookup is needed to get the latest state. Retries for IPN deliveries that fail - In cases where the partner is unable to respond to our HTTP post request, we do telescopic retries so that the partner can receive the IPN whenever their system comes up again.

Problems with existing infrastructure

Previously, IPN deliveries were handled by our monolith using Gearman. This system had several issues:

We were using Gearman in our monolith for lot of asynchronous tasks that include sending emails, processing payments, sending IPNs, creating reports etc. As load increased, we began to encounter operational issues. Gearman failed often, and we experienced worker connection issues with it.

A slow partner could cause a backup in our IPN system due to limited Gearman workers.

Since IPN delivery was handled by the monolith, there was no easy way for other services outside of the monolith to send IPNs.

IPNs with Google Cloud Pub/Sub

Google Cloud Pub/Sub is a publisher/subscriber messaging system that provides many-to-many asynchronous messaging and decouples senders and receivers. We evaluated Pub/Sub and decided to create a microservice for IPN delivery with it. Some things to be aware of when using Pub/Sub include:

The ordering of messages is not guaranteed. Hence, messages received by a subscriber can be out of order. In the case of IPN delivery, since our IPNs don’t contain any information about what changed for the object, this was fine for us. For example, even if two IPNs for checkout with id 12345 were generated, one for “authorized” state and another for “captured” state, and were delivered out of order, since IPN only contains data “checkout=12345” and actual state information needs to be gathered by doing a lookup on that checkout, we are able to provide correct information without needing strict ordering for message deliveries.

Cloud Pubsub provides “at least once” delivery guarantee. However, messages can be delivered multiple times. In our case, we were using Redis as temporary storage of IPN data to manage IPN batching and retries for failed IPN deliveries. Since we were doing object level batching of IPNs, these duplicate messages would automatically get deduplicated.

Implementation

The implementation of our IPN delivery microservice can be described using the following diagram:

The IPN Delivery Service consists of following components:

Message Poller This component is responsible for pulling data from Cloud Pub/Sub and handling duplicates

Message Processor This component is responsible for processing pending IPNs after every fixed interval. This processing interval also defines our batching interval. In the case of multiple IPNs for the same object, we send only one. This is how IPN batching is implemented in the service.

Retry Handler This component is responsible for retrying any failed IPN later as per our IPN retry schedule.

Every Cloud Pub/Sub message received by the subscriber needs to be acknowledged. If a message remains unacknowledged for a time duration more than acknowledgment deadline time, it makes the message available for pulling again. Cloud Pub/Sub keeps unacknowledged/not pulled messages for 7 days. This property enables us to do auto recovery in case of application failure. Typical scenarios include:

Microservice goes down In this case Cloud Pub/Sub still contains data for 7 days. As soon as python service comes back, it starts getting IPN messages from Cloud Pub/Sub and starts delivering them.

In this case Cloud Pub/Sub still contains data for 7 days. As soon as python service comes back, it starts getting IPN messages from Cloud Pub/Sub and starts delivering them. Redis goes down We acknowledge a pubsub message only after we have attempted an IPN delivery. If redis goes down, IPN delivery doesn’t happen which means we don’t acknowledge the pubsub message as well. Cloud Pub/Sub redelivers that message later. As a result, as soon as Redis comes back, we start processing IPNs again.

Lessons

We are running our IPN service using Cloud Pub/Sub for last few months and these are some of the points which summarize our learnings