In this week’s podcast, QCon chair Wesley Reisz talks to Keith Adams, chief architect at Slack. Prior he was an engineer at Facebook where he worked on the search type live backend, and is well-known for the HipHop VM [hhvm.com]. Adams presented How Slack Works at QCon SanFrancisco 2016.

Key Takeaways Group messaging succeeds when it feels like a place for members to gather, rather than just a tool

Having opt-in group membership scales better than having to define a group on the fly, like a mailing list instead of individually adding people to a mail

Choosing availability over consistency is sometimes the right choice for particular use cases

Consistency can be recovered after the fact with custom conflict resolution tools

Latency is important and can be solved by having proxies or edge applications closer to the user

Notes Challenges at Slack? Group Messaging 1m:30s Many companies focus on messaging; but persistent group messaging is the key focus of Slack, supporting message search and archival as well as groups

2m:00s Group chats in other messaging clients require you to individually add members, much like sending a group email works today

2m:35s Channels are used to allow optin membership of groups as well as seeing historic messages sent to that channel

3m:00s A slack channel feels like a place you belong in Latency 3m:30s Voice and video interactions are impacted by latency; the same is true of messaging clients

4m:00s The user interface can provide indications of presence, through avatars indicating availability and typing indicators

4m:15s Latency is important; sometimes the difference is between 100ms and 200ms so the message channel monitors ping timeout between server and client

4m:40s 99th percentile is less than 100ms ping time

5m:15s If the 99th percentile is more than 100ms then it may be server based, such as needing to tune the Java GC

5m:25s Network conditions of the mobile clients are highly variable

6m:20s Mobile clients can suffer intermittent connectivity Architecture 7m:15s Slack consists of a sharded LAMP stack; webservers, memcache, and a fleet of mysql instances

7m:30s Teams are sharded across mysql instances

8m:20s The realtime part of the clientserver communication is due to the messaging infrastructure

8m:35s Slack is a message amplifier; it takes the message written by the individual and them delivers it to all the clients that are interested in receiving the message, with the lowest latency possible

9m:00s The majority of desktop based connections are longlived WebSocket connections Edge caching 11m:00s Users who are far away from the east coast are terminated with an edge cache called flannel (formerly slackd)

11m:50s The roundtrip time is much more tolerable if the edge cache serves content quicker

12m:15s Local conversations can be optimised with the edge cache Posting messages 13m:00s Most clients use the websocket to post messages via JSON instead of using the API at api.slack.com

14m:00s Write amplification happens inmemory in the Java process to deliver messages to currently connected clients, and then sends the message backend

15m:00s There is a possibility of failure, in that the Java process may deliver the message to the network clients but then fail to persist it

15m:10s The platform is being redesigned and will hopefully address in future

16m:00s There’s no evidence that this has hit people Business and community 20m:00s Commercial users of Slack need to be more tightly controlled and defined, or to selectively enable/disable features for individual users

20m:30s Lots of users have their own logins for each service; there’s interest in improving that while still allowing commercial companies to use single sign on solutions MySql and persistence 21m:30s MySQL has replication and data protection built in; other companies have thousands of man years in operating without data loss

22m:15s Users care that persistence works and they don’t lose data, not what the storage system is

22m:40s Lots of the data is relational but consistency is not absolute; master to master replication allows for eventual (in)consistency

23m:40s The best order fit for the master to master is to selectively pefer which master is written to using the loworder bit of the team identifier; so even teams prefer to write to one master and odd teams will prefer to write to the other master

24m:30s Availability is being preserved instead of consistency in the CAP triangle

24m:55s Insert on duplicate key update semantics allows users to post messages, and if the message has been replicated previously then the subsequent insert will overwrite it Consistency and conflicts 25m:15s Consistency problems can occur when two rows are inserted in the two masters simultaneously; it is a querybyquery case that needs to resolve conflicts in an appropriate way

26m:15s Manual conflict resolution indicates an application error in not being able to resolve conflicts itself

26m:35s Relaxing consistency helps availability for the system

27m:00s Most mutations that happen in Slack are performed at human scale and pace

27m:10s It’s unlikely that a user will update the profile picture in a smaller number of microseconds to end up in an inconsistent state

27m:25s It’s extremely rare that it happens, and if it does, the user can always set their picture again

28m:10s If there was no conflict resolution then the masters could diverge

28m:15s There is a conflict resolution system recipe; masters live for a month and then new read replicas are attached and caught up; when they are, they become the new masters since they are in sync with each other MySql and the future 29m:00s MySql is used because Slack has operational experience and the fact that relational queries are used means that other solutions like Cassandra haven’t been explored yet

30m:10s Slack’s architecture is still evolving and it may change in the future

31m:30s As the growth continues and the orders of magnitude increase, there may be rewrites in the future as well Origins of Slack 32m:20s Slack started as a company called TinySpec which created a massively multiplayer game called Glitch, and weren’t getting the growth in the game that they were looking for

33m:00s The game server had a bot which indexed all messages that had been sent

33m:30s Users were using the builtin IRC server for messages

33m:50s The developers pivoted and came up with the idea of using the IRC server as a standalone product; SLACK, with a backronym of Searchable Linked Archive of Company Knowledge

34m:50s Group messaging succeeds if the users feel like they are part of a shared space Companies mentioned Slack

Facebook

Discover QCon Plus by InfoQ: A Virtual Conference for Senior Software Engineers and Architects (Nov 4-18)

QCon Plus covers the trends, best practices, and solutions leveraged by the world's most innovative software shops. Taking place between November 4-18, the event is thoughtfully designed with shorter, focused technical sessions spread over 3 weeks. You'll learn from 54 speakers and 4 keynotes across 18 tracks. The event includes highly interactive sessions, Q&As, AMAs, breakouts, and real-time collaborative action. Save your spot now!

More about our podcasts

Previous podcasts

Rate this Article Adoption Style

Author Contacted

You can keep up-to-date with the podcasts via our RSS Feed , and they are available via SoundCloud Overcast and the Google Podcast . From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.