We launched Triplebyte one month ago, with the goal of improving the way programmers are hired. Too many companies run interviews the way they always have, with resumes, white boards and gut calls. We described our initial ideas about how to do better than this in our manifesto. Well, a little over a month has now passed. In the last 30 days, we've done 300 interviews. We've started to put our ideas into practice, to see what works and what doesn't, and to iterate on our process. In this post, I'm going to talk about what we've learned from the first 300 interviews.

I go into a lot of detail in this post. The key findings are:

Performance on our online programming quiz is a strong predictor of programming interview success Fizz buzz style coding problems are less predictive of ability to do well in a programming interview Interviews where candidates talk about a past programing project are also not very predictive

Process

Our process has four steps:

Online technical screen. 15-minute phone call discussing a technical project. 45-minute screen share interview where the candidate writes code. 2-hour screen share where they do a larger coding project.

Candidates work on their own computers, using their own dev environments and strongest languages. In both of the longer interviews, they pick the problem or project to work on from a short list. We're looking to find strengths, so the idea is that most candidates should be able to pick something they're comfortable with. We keep the list of options short, however, to help standardize evaluation. We want to have a lot of data on each problem.

We're looking for programming process and understanding, not leaps of insight. We do this by offering help with design/algorithm of each problem (and not penalizing candidates for this). We evaluate interviews with a score card. For now we go a little overboard, tracking the time to reach a number of milestones in each problem. We also score on understanding, whether they speak specifically or generally, do they seem nervous, and a bunch of other things (basically everything we can think of). Most of these, no doubt, are horrible measures of performance. We record them now so that we can figure out which are good measures later.

Screening

The first experiment we ran was screening people without looking at resumes. Most job applicants are rejected at the screening stage. The sad truth is that a high percentage of the people applying for any job post on the Internet are bad. To protect the time of their interviewers, companies need a way to filter people early, at the mouth of the hiring funnel. Resumes are the traditional way to do this. However, as Aline Lerner has shown, resumes don't work. Good programmers can't be reliably distinguished from bad ones by looking at their resumes. This is a problem. What the industry needs is a way to screen candidates by looking at their actual ability, not where they went to school or worked in the past[1]. To this end, we tested two screening steps:

A fizzbuzz-like programming assignment. Applicants completed two simple problems. We tracked the time to complete each, and manually graded each on correctness and code quality. An automated quiz. The questions on the quiz were multiple choice, but involved understanding actual code (e.g., look at a function, and select which of several bugs is present).

We then correlated the results of these two steps with success in our subsequent 45 minute technical interview. The following graph shows the correlations after 300 interviews.

We can see that the quiz is a strong predictor of success in our interviews! Almost a quarter of interview performance (23%) can be explained by the score on the quiz. 15% can be explained by quiz completion time (faster is better). Speed and score are themselves only loosely correlated (being accurate means you're only slightly more likely to be fast). This means that they can be combined, into what we're calling the composite score, which has the strongest correlation of all and explains 29% of interview performance![2].

The fizzbuzz-style coding problems, however, did not perform as well. While the confidence intervals are large, the current data shows less correlation with interview results. I was surprised by this. Intuitively, asking people to actually program feels like the better test of ability, especially because our interviews (the measures we're using to evaluate screening effectiveness) are heavily focused on coding. However, the data shows otherwise. The coding problems were also harder for people to finish. We saw twice the drop off rate on the coding problems as we saw on the quiz.

Talking versus coding

Before launching, we spoke to a number of smart people with experience in technical hiring to collect ideas for the interviewing. The one I liked the most was having candidates talk us through a technical project, including looking at source code. This seemed like it’d be the least adversarial, most candidate friendly approach.

As soon as we started doing them however, I saw a problem. Almost everyone was passing. Our filter was not filtering. We tried extending the duration of the interviews to probe deeper and looking at code over Google hangouts. Still, the pass rate remained too high.

The problem was we weren’t getting enough signal from talking about projects to confidently fail people. So we started following up with interviews where we asked people to write code. Suddenly, a significant percentage of the people who had spoken well about impressive-sounding projects failed, in some cases spectacularly, when given relatively simple programming tasks.Conversely, people who spoke about very trivial sounding projects (or communicated so poorly we had little idea what they had worked on) were among the best at actual programming.

In total we did 90 experience interviews, scoring across several factors (did the person seem smart, did they understand their project well, were they confident, and was the project impressive). Then we correlated our factors with performance in the 45 minute programming interview. Confidence had essentially zero correlation. Impressiveness, smartness and understanding each had about a 20% correlation. In other words, experience interviews underperformed our automated quiz in predicting success at coding.

Now, talking about past experience in more depth may be meaningful. This is how (I think) I know which of my friends are great programmers_._ But, we found, 45 minutes is not enough time to make talking about coding a reasonable analog for actually coding.

Interview duration, and interviewer sentiment

A final test we ran was to look at when during the interview we make decisions. Laszlo Bock, VP of People at Google, has written much about how interviewers often make decisions in the first few minutes of an interview, and spend the rest of the time backing up this decision. I wanted to make sure this was not true for us. To test this, we added a pop-up to our interviewing software, asking us every five minutes during each interview if the candidate is performing well, or poorly. Looking at these sentiments in aggregate, we can tell exactly when during each interview we made the decision.

We found that in 50% of our 45-min interviews, we decide (become positive for someone who ends up passing, or negative for someone who does not pass) in the first 20 minutes. In 20%, however, we do not settle on our final sentiment until the last 5 minutes. In the 2-hour interview, the results are similar. We decide 60% in the first 20 minutes (both positively and negatively), but 10% make it almost to the 2-hour mark. (In that case, unfortunately, it's positives turning to negatives, because we can't afford to send people we're unsure about to companies)[3].

Conclusion

It's been a crazy month. Guillaume, Harj and I have spent nearly all our time in interviews. Sometimes, at 10 PM on a Saturday, after a day of interviewing, I wonder why we started this company. But as I write this blog post, I remember. Hiring decisions are important, and too many companies are content to do what they've always done. In our first 30 days, we've come up with a replacement for resume screens, and shown that it works well. We've found that programming experience interviews (used at a bunch of companies) don't work particularly well. And we've written software to help us measure when and why we make decisions.

For now, we're evaluating all of our experiments against our final round interview decisions. This does create some danger of circular reasoning (perhaps we're just carefully describing our own biases). But we have to start somewhere, and basing our evaluations on how people write actual code seems like a good place. The really exciting point comes when we can re-run all this analysis, basing it on actual job performance, rather than interview results. Doing that is why we started this company.

Next, we want to experiment with giving candidates projects to do on their own time (I'm particularly interested in making this an option, to help with interview anxiety), and interviews where candidates are asked to work with an existing codebase. We're also adding harder questions to the quiz, to see if we can improve its effectiveness. We'd love to hear what you think about these ideas. Email us at founders@triplebyte.com.

Thanks to Emmett Shear, Greg Brockman and Robby Walker for reading drafts of this.

An earlier version of this post confused the correlation coefficient R with R^2, and overstated the correlations. Since this blog was posted, however, a new version of the quiz has increased the correlation of the composite score to 0.69 (0.47 R^2)

1. This is a complex issue. There are good arguments for allowing experienced programmers to skip screening steps, and not have to continually re-prove themselves. At some point, track record should be enough. However, this type of screening can also be done in very bad ways (e.g., only interviewing people who have worked at top companies or come from a few schools). Evaluating experience is something we plan to experiment with, but for now we're focusing on how to directly identify programming ability. ↩

2. It’s worth noting the error bars (showing 95% confidence intervals). The true value for each of the correlations in the graph falls in the range shown with 95% confidence. The error bars are large because our sample is small. However, even comparing the bottom of our confidence interval to Aline Lerner’s results on resume screening (she found a correlation close to 0), shows our quiz is a far better first step in a hiring funnel than resumes are.↩

3. We're not perfect, and we certainty reject great people. I always like to mention this when talking about rejections. We know this (and think it's true of all interview processes). We're trying to get better. ↩