Architecture

When we talked about implementing chat, we usually talked about three options:

Use a 3rd-party chat service (e.g., Layer) Use a 3rd-party websocket implementation (e.g., Pusher) Roll our own websocket-based protocol backed by our servers running on AWS or GCE (we were almost entirely hosted in App Engine at the time)

Chat needs to be fast and it’s not good enough to do polling (kill your servers with requests and your users with slowness), so we needed persistent connections (e.g., websocket).

We quickly discarded #1 as an option (Layer), because we wanted to control our own user data and we wanted full control over the stack and the experience (admittedly, we didn’t dive much into Layer’s entire offering, but we felt that rolling our own would the be the fastest path).

While some felt #3 was the best option, I believed #2 might actually work out best. I threw together a web-based, internal-only prototype over a weekend based on Pusher.

The architecture was simple. Here’s a (crude) image of a simple request flow.

Wow, I have bad handwriting.

When a user entered a chat room a private, presence channel was created and connection established with pusher. This let the user receive notifications from Pusher for the duration of being in the chat room. It was destroyed when the user left or backgrounded the app. Luckily, there were 3rd party Pusher protocol libraries (albeit we had to modify them) available for iOS and Android, which sped up development.

In this image, there are two users (A and B) present in the same chat room.

User A presses send on a message and a POST request is made to the Secret frontend with the chat id Server retrieves the chat session data, adds the message, and writes it back Server makes a POST request to Pusher with a payload intended for the client Pusher routes the message to User B via the websocket connection

Typing notifications and delivery receipts were very similar.

In this model, although not perfect from a latency perspective, it was good enough and simple because requests only flowed in one direction: Client -> Server -> Pusher -> Client

It’s also important to note that Pusher was only there for real-time notifications, not for canonical data. If at any time the user came back to the chat room, it would refresh the most recent state from the server and accept any new notifications from there. Analogous to rendering a single-page application and delivering changes to the client via AJAX or websocket.

Code and Model

The stored data model in the backend is simple. Secret’s canonical datastore was Google App Engine’s High-Replication Datastore (more on that in other blog posts). Essentially, it’s a schema-less, NoSQL datastore built on top of BigTable and Megastore. Entities are document structure and stored in rows by a given key. There are no JOINs and query semantics are very limited, but it allows for very high read-throughput and wide scaling you’d expect from a NoSQL offering.

When the user enters a chat room, an idempotent ID is created on the server that is effectively “<user1_id>:<user2_id>:<secret_id>”, also known as the chat session id. Important note: user ids in this key were always sorted (partial-ordering), guaranteeing idempotency. That way, given a pair of users and a secret, we can always generate the single ID for that secret.

Chat sessions are keyed by simple ids and contain the data above in a single row. Each time a chat is mutated, the server performs a transactional read-modify-write on the row. The transaction is fine so long as write throughput is kept to <= 1 write/sec per entity.

For example, when a user left a chat, we wanted to alert the recipient that they had done so. The server-side code looked like this:

Pretty simple.

For fast queries for things like showing the user all of their ongoing or previous chats, we created indexes on the participant and created time properties, allowing fast answers for things like “Fetch chats user X is a participant in sorted time in descending order” and locally sorted by last update time. Because chats only lasted 24 hours since the last message exchange, we knew the number of chats would be a reasonably small number to fetch and sort locally on the server (e.g., < 100 in almost every case). If they exceeded that, chances are the user was a bad actor and we could drop some on the floor. That code was simple:

Here’s a small library I just open-sourced for talking directly to Pusher via Go (in App Engine environment). You can easily tweak this, the only tricky part of the entire process was authenticating the requests for private and presence channels. https://github.com/guitardave24/pusher-go-appengine/blob/master/pusher.go

Launch

When we launched the redesign, chat was an instant hit. It grew week-over-week to well over 1,000,000 concurrent connections (chat + notifications). Luckily, Pusher worked well and was fairly inexpensive and our scaled linearly to a point where, assuming Pusher was able to meet our demand, we would have no issues for the foreseeable future.

The key takeaways and reinforcements for me in this experience were:

Start small and be ready to build a throw-away prototype to help force a decision

It’s often unwise to roll your own implementation, no matter how fun it might be (obviously, but we all keep doing it!)

If I were to implement real-time chat in an app again, I’d strongly consider using a simple model like this again. The downside of the above architecture is that it’s not as fast as it could be (notably, because messages are routed through the Secret frontend to Pusher), but the end result was just fast, reliable and simple enough.

Hopefully you found this interesting and/or useful, if you have any questions, please don’t hesitate to email me at d@secret.ly or follow me on Twitter @davidbyttow

David Byttow