Taking LinkedIn from an Oracle shop to a big data leader

SCALE: You’re best known for co-creating Kafka at LinkedIn, but what was your history there before Kafka?

Jay Kreps: I joined LinkedIn with the intention of being more of a user of data infrastructure, working on recommendation algorithms and that kind of thing. That’s what I had done previously; I had more of a machine learning background.

I did actually work on a bunch of things like that at LinkedIn, but my observation was that a lot of our problems were less about creating fancy algorithms and more just basic data and scaling issues. That was what held us back on almost everything we did.

The first project I did in there in the infrastructure space was a key-value store called Voldemort, one of the very early NoSQL systems. It was a distributed key-value store and a clone of Amazon Dynamo, basically. I had read that paper and we did an implementation of that and put it into production. It’s actually still used at LinkedIn at very large scale — I think probably getting north of a million requests per second.

We open sourced it, and that was really fun, and after that I thought, “Hey, this actually is a much bigger impact to work on these data systems that, in some sense, make every problem better.” So I was kind of hooked on it.

“Over time the systems that we had that did that kind of data stream transport were just melting under scale and growth, so it became clear we had to do something.”

I really liked the open source aspect of it, as well. That was the the first time I had had any involvement in open source, and when we did Voldemort there was a bunch of different people contributing code, I got to work with these really smart people in different parts of the world, and that was really fun.

The next thing I did was the Hadoop deployment at LinkedIn. The goal was this data lake idea where you get a copy of everything happening in the organization into Hadoop. We thought that would be a really easy thing to do, so we budgeted 3 weeks to get data in, and then a couple months to rebuild the People You May Know feature.

And we said, “Well, the hard thing will be this cool recommendation algorithm, but the the first thing we’ll have to do is just get the data.” It turned out to be a bit of a reverse. We did that, and we did a couple other projects, and we were just struggling to build a pipeline of data for Hadoop.

Kreps (left) and some former LinkedIn colleagues in 2013. Credit: Derrick Harris

Because of that, I got interested in how data was flowing in the rest of the organization, and my observation was we had this problem everywhere. We wanted to make use of the same data in real time for security and fraud detection, and for more-real-time recommendations so we don’t show you the same stuff every time. And the same stuff we wanted to get into Hadoop, we also wanted to get into the data warehouse and we wanted to get into search indexes.

We had all these different systems, some we wanted to respond and act in real time while some were offline data dumps like Hadoop, and that was how we came up with the idea of Kafka. We pitched it internally and people were like, “Well, that sounds like a lot of work.” But over time the systems that we had that did that kind of data stream transport were just melting under scale and growth, so it became clear we had to do something.

“You always have the advantage of youthful arrogance. With Kafka, we said it would take about 3 months, and then we worked on it for the next 5 years.”

When did you join LinkedIn? I recall you telling me previously that you joined when People You May Know was still an Oracle application.

That’s right. I joined in 2007, and we got to do a bunch of really fun work to scale the whole site in different parts, which was really challenging. It was fun, too. A lot of the things that came out of really getting all the data together was the ability to do much more sophisticated stuff with it, to do use more predictive ingredients on the stuff.

Of course, the rest of the company took that much further after I was not involved in that area much anymore. They added a whole team of data scientists doing this stuff that ended up being much better than I ever was.

How difficult was it in those early days to build new tech from scratch and deploy systems like Hadoop at scale?

You always have the advantage of youthful arrogance. With Kafka, we said it would take about 3 months, and then we worked on it for the next 5 years. We did ship it at least after 4 or 5 months, so we got something done in that time period.

Kafka’s a fun one because it’s one of the core problems in distributed systems, which is producing an ordered log or stream of data that’s fault-tolerant and replicated over machines. This is something that exists in the world, it’s a common algorithm or thing that you would learn about if you took a distributed systems class, but there was this opportunity to do it at a really massive scale. Now, I think, LinkedIn has about 800 billion requests going through these logs.

Some things were definitely kind of risky projects, but they paid off in the end.