As part of my work at Counsyl, I interview most of our software engineering candidates who are interested in machine learning and data processing problems. I always ask prospective candidates why they’re interested in Counsyl; with surprisingly high frequency the answer is something like, “I want to work on Big Data, and I’ve heard genomics has the biggest data around, so that must be the place to be!”

This conversation has happened so many times that I decided to devote an entire tech talk to its misconceptions. In the talk, I argue that Big Data is not intrinsically interesting; instead, most of the hype is because Big Data is relevant to advertising, and advertising drives the consumer Internet. I further argue that although the total volume of data in genomics may be large, it’s better to think of genomics as a large collection of small data problems, rather than as Big Data.

I find that most of the engineers I talk to who are interested in Big Data aren’t interested in it for its own sake (people working on setting TeraSort records aside!) Instead, they’re interested in it because of what they hope it might offer: technologies that are more personalized and useful to the consumer. I argue, both to candidates and in the talk, that genomics’ small data problems actually fit this bill much better than the usual problems in Big Data space.

You can see a video of the talk here.

Slides here:



Then check out our current job openings.

Imran S. Haque is the Director of Research at Counsyl. Prior to joining Counsyl, he completed his Ph.D in computer science at Stanford, where he worked on large-scale machine learning for drug design with Vijay Pande and Daphne Koller. His code reviews mostly consist of giving the look of disapproval to questionable constructs.