“Algorithmic Arrangements”: A Conversation with Tom Quisel, Former CTO of OkCupid

Tom Quisel worked at OkCupid for seven years as a software engineer, data scientist, and for two years, as the company’s Chief Technology Officer (CTO). We talked to him about the algorithms used to match people together, how to design for an inclusive user experience, and the ethics of performing experiments on your users.

Famously analytical in their product development, the OkCupid data blog has published many stories over the years. They’ve dug deep into the data underlying courtship on the site, ranging from the messaging patterns between users of different racial groups to the algorithmically optimal qualities of the first message you send.

OkCupid is a free dating site that takes an algorithmic approach to romance. When you sign up, you answer an array of questions about your politics, preferences, and personality. You also indicate how important the answers are to you, and what your deal-breakers are. A matching algorithm uses this information to present other users to you as potential matches, along with a predicted “match” and “enemy” percentage. If you pay extra money to get a premium account, you can filter for qualities such as attractiveness, body type, and specific personality traits.

With online dating and matchmaking, what sorts of problems did you deal with that were technically very difficult but might not seem that way on the surface? Or, on the flip side, what sorts of problems seemed hard but turned out to be very easy?

That’s a really interesting question. One of the things I’ll say is that at OkCupid, as with many startups, using really advanced algorithms ends up being a second-order optimization. Often, the more effective thing is just to work on getting the user experience right. It’s much easier to do user experience improvements that make larger differences on the dynamics and the site.

OkCupid has always been very algorithmically focused. It’s pretty unique among the dating sites in allowing people to participate in defining the matching algorithm—each person picks out exactly which questions are important to them, how important they are, and what their ideal answers would be to each of those questions. That’s unlike any other site, where there’s less of an algorithmic focus, and there is some psychologist that comes up with an opinionated rating system, or there’s no rating system and there’s no attempt to match personality at all.

There’s a lot of nuances that are pretty tricky there. One of the interesting ones is the human psychology of match questions—understanding what it is that you want.

When you go through the process of answering questions, maybe you’ll answer some questions in a way that is consistent with what you really want, but you might not answer all questions that way. You might provide answers for how you feel that night, which may not be reflective of your larger perspective.

So that’s one of the big challenges: understanding what someone is really trying to say when they’re answering questions about their preferences.

How did you all deal with that?

The first step we took was looking over all the different questions, and identifying which questions lead to confusion from a statistical perspective.

We focused on how effective questions are at splitting the population. The ideal match question is something that people feel very strongly about the answer to, but which also splits the population pretty evenly, so that about half the population feels very strongly yes and half feels very strongly no. Questions like that are perfect for narrowing down the pool of people who are good matches for you.

But interestingly, some of those questions that appear to be very important to you might be based on a misinterpretation. There could be two different interpretations for the question, and you just answered one of them. Then the population is evenly split on a question not because people feel strongly about the answer, but because they have different interpretations of the question. Knowing this can happen, we use the algorithms to help us understand the statistics behind each question, and we’ll try to identify questions that are the most likely to be mistaken in this way so that we can remove them.

We also examined messaging patterns as a backup, and correlated answers to other questions. So if the question is an outlier compared to many other questions, we’ll tend to count it less; or if messaging patterns don’t line up with answers to the question, we would sometimes use that as a reason to remove the question as well.

Many of the questions are user-generated, so what’s nice is that as people answer the questions, some rise to the top and get popular, and some don’t. That makes our task easier—we’re focusing on filtering through which questions are good and which need to be removed, rather than having to think of what new questions should be added.

So the community plays a role in generating the questions that people find important.

Yeah, and it’s really neat to see how those questions spread. Often a new issue would come up, like a new president or a recent news article, and the way people feel about it can be quite important in understanding their personality. So users will add a new question that touches on it, and the question will quickly become popular and play a role in matching people.

There’s been a lot of discussion around “experiments” on users done by data-driven product development organizations like Facebook and OkCupid. A big question that always comes up is the ethical considerations of these sorts of tests—the impact these tests have on the user, independent of the goal that you’re trying to achieve.

When you set up and ran experiments on users, what were the ethical considerations that went into them? Were there any experiments that were considered “off limits” that you decided not to do because they crossed some ethical line?

That’s a really interesting topic. Running experiments was a very important part of our product and decision-making strategy at OkCupid—as I think it is at almost every tech company, I would hope.

At OkCupid, our philosophy was not to just experiment because we wanted to. Often there is this problem with data science and analytics in general—leaders of the company want answers to a particular question, so they’ll ask for experiments to get at some deeper understanding, but there won’t be specific decisions that they’re trying to make as a result of having that information.

We always took a decision-first approach. We would come to some key question. Do we want the product to be designed this way or that way? Should we make this change or that change?

One change we debated quite a bit was our rating system. Originally we had a rating system that allowed people to score other people from one to five stars. And we thought, well, it would be a simpler user interface to just use a yes or no answer. That would be more straightforward, but then again we would lose a lot of information resolution, and was that really worth it? We were pretty torn on it, and couldn’t come to a decision through discussion alone, so we resorted to an experiment to understand which would lead to better messaging patterns. That’s an example of the kind of product decision we would try to answer with experiments.

The ethics around experimentation really depends on what you’re trying to accomplish with the question. The goal should be improving the product for people, and you should focus on not degrading the experience very much for any one person—don’t hurt someone too much for the experiment. Maybe a little bit of degradation of the experience makes sense because you don’t know actually know that it works. But then as soon as you learn which version of an experience is worse, you can resolve the experiment.

I think you get into the hot water if what you’re doing looks more like a psychology study than trying to make a product decision. Like: wouldn’t it be interesting from a research perspective to see what happens if people are exposed to this situation? That can be a little bit sketchier unless you go through the standard research routes, likes IRB approval (a FDA process that regulates the treatment of human research subjects) and informed consent.

You’ve mentioned using messaging patterns as an important metric. But for a dating application it seems like there are many different metrics by which you might measure success. What was your guiding light for figuring out whether a feature was successful or not? Were there different metrics that were in conflict?

You can think of a hierarchy of different metrics. They range from being very plentiful and not that informative, to being extremely informative but much more rare.

One bit of data that is plentiful is who you view on the site. If you have a matching algorithm that displays options on a page, and someone clicks one of those options to view the profile, that’s a weak positive signal. There’s a whole lot of that happening, and it tells us a little bit about the user, but not much.

A stronger positive signal is sending a message. Then even stronger than that is having a multi-directional exchange, which implies that both parties most likely were happy about that exchange.

The strongest signal of all is when someone deletes or pauses their account. We ask them if they did it because they met someone, and asked if they were matched with that person. If they say yes, that allows us to get high-quality information on who were really good matches, because they form entire relationships based on it. But that sort of data is more rare.

We combine all these different levels of data based on what the goal we were trying to achieve with the metric was. For example, if you want to have a system that reacts more quickly, then you focus on the more common data, like profile views. But for the most part, we settled in the middle, which means focusing on communications that involve three or four messages exchanged back and forth. We felt those were a good sign that two people had a genuine connection, and that’s what we’re trying to focus on for the site.

One other metric that competes with that to some degree is evenness: what fraction of people on the site receive at least one contact every week. You see scenarios where maybe someone is happy receiving lots of messages, and really likes the attention, so you could have an algorithm that directs a lot of people to message that one person. That’s nice for that one person, and great for the three or four messaging metric, but it’s not so great for evenness. So we try to spread out the engagement on the site to other people, even if it meant fewer message exchanges. There was often tension between those two goals.

What sorts of strategies did you take to increase things like evenness?

It’s an interesting challenge. Messaging patterns are fundamentally very uneven if you don’t make an active attempt to sculpt or mold them. There are a few lucky people who get a large number of messages, and a very long tail of people who might get messages once in awhile, but overall don’t get much attention. That’s something all these kinds of apps struggle with.

One of the more common techniques is setting a rate limit. If someone is sending a lot of messages, or sending a lot of likes, or thumbs up, or anything like that, they’ll be rate-limited after a certain number of interactions. At OkCupid, we really focused on not doing that too harshly. Rather than hard limiting, we tried to do more of a soft sculpting of the messaging experience. So if someone is sending a large number of lower-quality messages, we would tend to show them other users who get fewer messages, and who maybe would appreciate the message they received more than the typical message recipient.

We found that showing users who had a similar attractiveness level, but also had similar messaging patterns, produced a good balance in terms of both evenness and the total number of quality interactions on the site. I want to emphasize that attractiveness is not the only metric we use. We would always focus on both attractiveness and messaging patterns—when someone sends a message, how often is it responded to, and how many messages someone receives and responds to, which is a good measure for how interested they are in additional messages.

On the site there are questions that involve some amount of self-identified demographic data. Were there other under-the-hood metrics that corresponded to concepts that you had to get at from a roundabout sort of way, like socioeconomic status or class? Things that you couldn’t directly ask people, but would end up in a machine learning model somewhere?

You know, we stayed away from that as much as possible. We did at some point allow people to put what their salary was, but I think we may have gotten rid of that, since it didn’t serve a purpose. And in fact, for a very long time we resisted allowing people to filter by race—we felt it just wasn’t appropriate.

But then we learned about some use-cases from the other side—someone who is Filipino who wants to find other Filipinos easily. We found that that’s a pretty legit reason to search by race, so we added that feature. But in general, we focus on making it an experience that doesn’t discriminate and encourages people to be their best selves.

How much of your approach was trying to enable users to make a selections of matches they felt they wanted, versus trying to encourage people to find matches in ways that a team or the company deemed ethical, like with regards to not being able to filter by race or income?

It’s a mix. For the most part we try to cater to people’s tastes, but in certain cases that are very important to us, like with different protected classes, we would focus on doing what was right.

One example is that people who are bisexual would often receive messages from straight people that were really not desired. Though they were bi, they weren’t interested in that kind of attention. It would be pretty overwhelming, particularly for bi women. So we added a feature to allow them to only be seen by other bi users—that was well-received, and was in response to this pattern we noticed of dissatisfaction and unwanted attention. That helped people of different orientations feel safer on the site.

How did you model users outside of conventional gender norms? What sort of work did you do surrounding supporting people who identify outside of the typical gender binary?

In my experience, OkCupid has always been considered one of the safer sites for people with alternative identities and preferences. That’s something we’ve always been proud of.

Obviously sexual orientation and gender identity are not binary—they’re a continuum. But at first we simplified in terms of gay, straight, and bi orientation. And we were always thinking about the nine different pairings of those groups, and made sure any experience we created made sense for each of those nine different pairings.

On the gender front, for a long time we were aware that people who didn’t identify as either male or female weren’t being completely served on the site, because there was no way for them to enter their identity—the site made you pick male or female. That was a tricky decision, because it was built into the code pretty deeply from the start. We really wanted to make that change, so finally we put in the time and effort and added a much better range of gender options. We were really happy we were able to do that, although it took a lot of work and took us a while to prioritize it.

Honestly, one thing that is interesting is that from a matching perspective we’re pretty gender- and orientation-agnostic. We don’t try to use the algorithms to pair people of certain identities with people of other identities—we really just focus on personality questions and preferences, then allow people to choose how they filter within gender identities. We want people to find other users who are great matches from a personality perspective, possibly in places that they didn’t expect.

You’ve talked about having core principles when you think about the features that you’re willing to develop. In the role of CTO, how did you go about crafting the engineering and product teams around those values? Was there a set of core principles that you aligned around? Were there particular qualities that you looked for when hiring that reflected those values?

Often companies have a more structured set of core values that are baked into company events and communications. Honestly, at OkCupid, the people who worked there came from a certain place of idealism and community, so it just kind of sprung up. Everyone who worked there was encouraged to read feedback, so people would see all kinds of different perspectives from users using the site. When someone would read feedback and find an issue that resonated with them, they could bring it up, and we’d discuss it and think about how to best solve it for the people who sent in the feedback, but in a way that was respectful and helpful to the rest of the users on the site as well.

It was really neat to see the grassroots unification around inclusive ideals, without having to push for an official set of “values.” In a way it was easy because the company was small. It was about thirty or thirty-five people when I left in 2014. In a group that size, it is pretty easy to have value alignment without too much structure.

That’s thirty-five people on the engineering team, or in the whole company?

That was the entire company.

Oh wow, okay. I didn’t realize it was that small.

Yeah, that was what so neat about OkCupid: how many people are reached and impacted by such a small team.