Why Slack can’t slow down

Slack reached a $1 billion valuation faster than any startup in history. Now it must make key decisions without disrupting the lives of millions of Slack-addicted users

Why has Slack won when other group messaging plays lost? By keeping it simple. Thanks to Slack’s sensible chat room design, everything seems one click away. Private channels, easy integration, and effortless file sharing haven’t hurt, either.

The ongoing triumph, however, is behind the scenes. Throughout Slack’s three-year growth from zero to 5 million daily active users, Slack has managed to keep its service reliable and responsive—and has been commendably transparent about the outages and incidents that do occur. Slack has also attracted more than 100,000 external developers who have built 900 third-party apps at last count.

The stakes keep getting higher, though, because Slack has come out of nowhere to become an essential application for more and more organizations. “Large companies—IBM, Capital One—they run their business on Slack,” Julia Grace, Slack’s head of infrastructure engineering, told me in an interview last week. “We can’t go down. We need to be incredibly fast all of the time.”

Since Grace joined Slack in October 2015, the number of Slack users has doubled. Keeping up with that growth would have been much harder had Slack’s founders not decided to build Slack on the public cloud from the start. Interestingly, that decision was made in 2009, when Slack was a game company called Tiny Speck, with a browser-based, massively multiplayer online game called Glitch.

Building on a game foundation

“When they started working on Glitch there weren’t a lot of other [cloud] competitors in the space, especially when you build a business where you need high reliability, high uptime,” Grace says. AWS was the only reasonable choice.

As it turns out, not only cloud scalability, but also Glitch’s game architecture have been critical to Slack’s success. In an InfoQ presentation last December, Slack chief architect Keith Adams noted that the original game design persists today:

The actual architecture of Slack resembles the architecture of a massively multiplayer online game. So you kind of have your world that you operate in, which is your team, and in order to kind of make that world seem both persistent and interactively mutable with other things in the world, you end up making a pretty thick cache of what’s going on in that world. And then you’ve got a way of getting the latency updates for the changes in that world. So that mental paradigm of “oh, it’s kind of like an online game” actually explains a lot about Slack.

Adams describes Slack’s house style as “conservative.” In simplified terms, he says that Slack is “a very competently executed LAMP stack that has had to scale out. It’s memcache wrapped around MySQL.” The choice of database was primarily due to “the collective history of the universe of the thousands of server years of operating MySQL without it losing data.”

Slack has also eschewed fancy microservices architecture, at least so far. The core application is a PHP monolith that uses the Facebook’s HipHop virtual machine and just-in-time compiler (Adams was formerly a Facebook engineer).

At a high level, Slack is fundamentally a big web application wedded to a messaging bus, the latter written from scratch in Java. Adams offers a classic build-versus-buy explanation for why the bus is home grown: “The effort … that goes into getting an off-the-shelf piece of software to do exactly what you want it to do sometimes is better spent telling the computer what you want it to do in the programming language of your choice.”

Proceeding with caution

On the infrastructure side, Grace shares Adams’ affinity for keeping it simple. “We have a preference for foundational services, such as S3, EC2 and CloudFront [AWS’s content distribution network],” said Grace at her AWS Summit keynote last week.

But Slack is also evolving, albeit carefully. Grace told me that “we are slowly taking parts out and creating services, creating more isolated things that we can run, build, deploy, scale up, etc. So what my team runs is a lot of the services that our monolithic architecture connects to and we’re slowly breaking other things out.”

From the start, Slack has used Apache Solr for search and EMR for crunching log data to yield infrastructure insights. At the AWS Summit, Grace also spoke about Slack’s adoption of Amazon Lex, a new set of APIs and an SDK for tapping AWS’s advanced AI and machine learning capabilities. “With Lex, building conversational bots is going to be so much simpler,” she says, alluding to one of Slack’s differentiating features.

Moreover, Slack’s community of third-party developers has been instrumental in showing the company where it needs to go. “We’re seeing usage patterns in what external developers are doing and the API calls that they’re making—and understanding over time how people have built on our platform and how that has changed. That’s really critical input for how we think about what we need to build in the future.”

Grace sees another critical decision in her immediate future: Should she diversify cloud providers for failover purposes? The burning question:

Can we get the reliability and the speed we need out of Amazon or…do we need to also look to diversify because we need to ensure that, should anything happen—the S3 outage, for example—we keep our customers running? How do we isolate them from having to worry about events in US West?

Duplicating Slack on, say, Google Cloud would be a gargantuan undertaking, particularly if Slack wanted to mirror traffic. “You have to gain experience operating things at scale to understand and have confidence in a failover scenario that you are actually able to fail over gracefully,” says Grace. “So if something did happen, we would feel incredibly comfortable switching over or rerouting traffic or something along those lines.”

Every startup—or enterprise initiative, for that matter—must deal with decisions made before the first release. With a real-time messaging system like Slack that’s dominated an important market in record time, those decisions loom ever larger, because changes to those fundamentals amount to the proverbial engine replacement as a car roars down the highway. More

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s