In a previous post, I argued for the need of a different kind of interview questions for data science and machine learning engineers. I then listed some questions I thought good for gauging data science knowledge and cleverness. I extend the list here with some more question:

On some online stores, you notice reviews for multiple-installment novels follow a peculiar trend that goes slightly up for each installment, even though the number of reviews goes down. What do you think is happening here?

Selection bias. Plain and simple. Mostly those who liked the first installment (or who are hard-core fans of the author) go on to read subsequent installments, and hence are more likely to have a favorable opinion of the book and the writer in general.

OK, that makes sense. Now what can we do to de-bias these ratings and get a score on which we can compare novels on an equal footing.

This could be solved by stratified re-sampling, or by creating a new scoring system that combines average rating with number of reviews into a single score.

You are chief scientist for the air forces in WW II and you are tasked with making air strikes safer for fighter pilots (i.e. you want more of them to come back). You personally inspect damaged planes after coming back from battle (say 70% of the planes make it back on average, and 20% are damaged). You find that bullet damage is distributed in a highly non-uniform way (e.g. way more bullets in the wings region than is merited by their area). What could be the reason for this? What would you do to make planes less prone.

This actually happened during WW II and the protagonist was Abraham Wald:

The United States Chess Federation (USCF) invites you to devise their new ranking system that will replace Elo. You are free to devise enhancements to the current system or propose a completely new ranking algorithm.

Many possible ways to do this. makes for a good discussion question. Also whether the candidate decides to extend Elo or to start from scratch tells something about her/his character.

How would you go about building an ensemble of hundreds of highly diverse models? (resulting from different algorithms and different parameters)

This opens the room for the candidate to show off knowledge about bagging and boosting and the benefits of each, but for that scale of diverse models to be beneficial, stacking is a natural choice. Essential here is the awareness that stacking requires an extra hold-out set, and to show resourcefulness in adapting the stacking scheme if that holdout set is fairly small.

How would you sample uniformly from a continuous stream of data?

Reservoir Sampling.

Assume you already have a classification model with great ROC curve, but the model produces arbitrary scores that do not map to probability estimates, how would you go about calibrating the scores into probabilities?

There are a few methods for doing that. Interesting to see if the candidate understand that the calibration process biases some metrics on the calibration data set.

Explain the bootstrap sampling method and when it can be useful.