Let’s prepare for Machine Learning interviews!

Introduction

What is this article about?

In this article, I share an eclectic collection of interview questions that will help you in preparing for Machine Learning interviews. This is helpful to someone who is interested in one/more of the following positions in the Machine Learning group of a leading company (Google, Facebook, IBM, Amazon, Microsoft, etc.):

Research Engineer Software Engineer Postdoctoral Researcher Research Scientist Data Scientist

I will keep on adding more questions to this list with time. This project initially started off as a GitHub repository which can be found here. I continually update the repository with new questions.

Why use it?

This will be useful to someone who is:

Interested in preparing for Machine Learning interviews

Preparing for Machine Learning interviews, however, is lost in the plethora of resources and wants to prioritize what to learn.

Looking to hone their skills by attempting some prospective interview questions

What should I learn?

Someone applying to any one of the above positions is expected to know basics of the following broad topics:

Computer Science

Linear Algebra

Statistics and Probability

Machine Learning

All of these are fairly broad topics and sections dedicated to them in this article lists specific questions related to some of these topics. Note that deeper knowledge of one/more of the above topics might be expected of you depending on the particular position you are interviewing for. This raises our next question.

What is expected of me in the interviews?

Research or Software Engineer: If you are applying to any one of these positions in a Machine Learning group, you should know the basics of the above four topics with emphasis on Computer Science and Machine Learning. In addition, some projects on Machine Learning in GitHub will be helpful to showcase both your knowledge and coding skills.

Postdoctoral Researcher and Research Scientist: Apart from the basics, you should know extremely well about at least one domain of Machine Learning. You should have published multiple papers in this domain. This will demonstrate your authority in this topic. Since you are applying to this position you already know what that would be for your case.

Data Scientist: If you are interested in a Data Scientist position, then after learning the basics, please emphasize more on Statistics and Probability.

List of questions

Now, that you have a general idea of Machine Learning interview, let’s spend no time in sharing a list of questions organized according to topics (in no particular order).

Linear Algebra

What is broadcasting in connection to Linear Algebra? What are scalars, vectors, matrices, and tensors? What is Hadamard product of two matrices? What is an inverse matrix? If inverse of a matrix exists, how to calculate it? What is the determinant of a square matrix? How is it calculated? What is the connection of determinant to eigenvalues? Discuss span and linear dependence. What is Ax = b? When does Ax =b has a unique solution? In Ax = b, what happens when A is fat or tall? When does inverse of A exist? What is a norm? What is L1, L2 and L infinity norm? What are the conditions a norm has to satisfy? Why is squared of L2 norm preferred in ML than just L2 norm? When L1 norm is preferred over L2 norm? Can the number of nonzero elements in a vector be defined as L0 norm? If no, why? What is Frobenius norm? What is a diagonal matrix? Why is multiplication by diagonal matrix computationally cheap? How is the multiplication different for square vs. non-square diagonal matrix? At what conditions does the inverse of a diagonal matrix exist? What is a symmetrix matrix? What is a unit vector? When are two vectors x and y orthogonal? At R^n what is the maximum possible number of orthogonal vectors with non-zero norm? When are two vectors x and y orthonormal? What is an orthogonal matrix? Why is computationally preferred? What is eigendecomposition, eigenvectors and eigenvalues? How to find eigen values of a matrix? Write the eigendecomposition formula for a matrix. If the matrix is real symmetric, how will this change? Is the Eigendecomposition guaranteed to be unique? If not, then how do we represent it? What are positive definite, negative definite, positive semi definite and negative semi definite matrices? What is Singular Value Decomposition? Why do we use it? Why not just use ED? Given a matrix A, how will you calculate its Singular Value Decomposition? What are singular values, left singulars and right singulars? What is the connection of Singular Value Decomposition of A with functions of A? Why are singular values always non-negative? What is the Moore Penrose pseudo inverse and how to calculate it? If we do Moore Penrose pseudo inverse on Ax = b, what solution is provided is A is fat? Moreover, what solution is provided if A is tall? Which matrices can be decomposed by ED? Which matrices can be decomposed by SVD? What is the trace of a matrix? How to write Frobenius norm of a matrix A in terms of trace? Why is trace of a multiplication of matrices invariant to cyclic permutations? What is the trace of a scalar? Write the frobenius norm of a matrix in terms of trace?

Numerical Optimization

What is underflow and overflow? How to tackle the problem of underflow or overflow for softmax function or log softmax function? What is poor conditioning? What is the condition number? What are grad, div and curl? What are critical or stationary points in multi-dimensions? Why should you do gradient descent when you want to minimize a function? What is line search? What is hill climbing? What is a Jacobian matrix? What is curvature? What is a Hessian matrix?

Basics of Probability and Information Theory

Compare “Frequentist probability” vs. “Bayesian probability”? What is a random variable? What is a probability distribution? What is a probability mass function? What is a probability density function? What is a joint probability distribution? What are the conditions for a function to be a probability mass function? What are the conditions for a function to be a probability density function? What is a marginal probability? Given the joint probability function, how will you calculate it? What is conditional probability? Given the joint probability function, how will you calculate it? State the Chain rule of conditional probabilities. What are the conditions for independence and conditional independence of two random variables? What are expectation, variance and covariance? Compare covariance and independence. What is the covariance for a vector of random variables? What is a Bernoulli distribution? Calculate the expectation and variance of a random variable that follows Bernoulli distribution? What is a multinoulli distribution? What is a normal distribution? Why is the normal distribution a default choice for a prior over a set of real numbers? What is the central limit theorem? What are exponential and Laplace distribution? What are Dirac distribution and Empirical distribution? What is mixture of distributions? Name two common examples of mixture of distributions? (Empirical and Gaussian Mixture) Is Gaussian mixture model a universal approximator of densities? Write the formulae for logistic and softplus function. Write the formulae for Bayes rule. What do you mean by measure zero and almost everywhere? If two random variables are related in a deterministic way, how are the PDFs related? Define self-information. What are its units? What are Shannon entropy and differential entropy? What is Kullback-Leibler (KL) divergence? Can KL divergence be used as a distance measure? Define cross-entropy. What are structured probabilistic models or graphical models? In the context of structured probabilistic models, what are directed and undirected models? How are they represented? What are cliques in undirected structured probabilistic models?

Confidence interval

What is population mean and sample mean? What is population standard deviation and sample standard deviation? Why population s.d. has N degrees of freedom while sample s.d. has N-1 degrees of freedom? In other words, why 1/N inside root for pop. s.d. and 1/(N-1) inside root for sample s.d.? What is the formula for calculating the s.d. of the sample mean? What is confidence interval? What is standard error?

Learning Theory

Describe bias and variance with examples. What is Empirical Risk Minimization? What is Union bound and Hoeffding’s inequality? Write the formulae for training error and generalization error. Point out the differences. State the uniform convergence theorem and derive it. What is sample complexity bound of uniform convergence theorem? What is error bound of uniform convergence theorem? What is the bias-variance trade-off theorem? From the bias-variance trade-off, can you derive the bound on training set size? What is the VC dimension? What does the training set size depend on for a finite and infinite hypothesis set? Compare and contrast. What is the VC dimension for an n-dimensional linear classifier? How is the VC dimension of a SVM bounded although it is projected to an infinite dimension? Considering that Empirical Risk Minimization is a NP-hard problem, how does logistic regression and SVM loss work?

Model and feature selection

Why are model selection methods needed? How do you do a trade-off between bias and variance? What are the different attributes that can be selected by model selection methods? Why is cross-validation required? Describe different cross-validation techniques. What is hold-out cross validation? What are its advantages and disadvantages? What is k-fold cross validation? What are its advantages and disadvantages? What is leave-one-out cross validation? What are its advantages and disadvantages? Why is feature selection required? Describe some feature selection methods. What is forward feature selection method? What are its advantages and disadvantages? What is backward feature selection method? What are its advantages and disadvantages? What is filter feature selection method and describe two of them? What is mutual information and KL divergence? Describe KL divergence intuitively.

Curse of dimensionality

Describe the curse of dimensionality with examples. What is local constancy or smoothness prior or regularization?

Universal approximation of neural networks

State the universal approximation theorem? What is the technique used to prove that? What is a Borel measurable function? Given the universal approximation theorem, why can’t a Multi Layer Perceptron (MLP) still reach an arbitrarily small positive error?

Deep Learning motivation

What is the mathematical motivation of Deep Learning as opposed to standard Machine Learning techniques? In standard Machine Learning vs. Deep Learning, how is the order of number of samples related to the order of regions that can be recognized in the function space? What are the reasons for choosing a deep model as opposed to shallow model? How Deep Learning tackles the curse of dimensionality?

Support Vector Machine

How can the SVM optimization function be derived from the logistic regression optimization function? What is a large margin classifier? Why SVM is an example of a large margin classifier? SVM being a large margin classifier, is it influenced by outliers? What is the role of C in SVM? In SVM, what is the angle between the decision boundary and theta? What is the mathematical intuition of a large margin classifier? What is a kernel in SVM? Why do we use kernels in SVM? What is a similarity function in SVM? Why it is named so? How are the landmarks initially chosen in an SVM? How many and where? Can we apply the kernel trick to logistic regression? Why is it not used in practice then? What is the difference between logistic regression and SVM without a kernel? How does the SVM parameter C affect the bias/variance trade off? How does the SVM kernel parameter sigma² affect the bias/variance trade off? Can any similarity function be used for SVM? Logistic regression vs. SVMs: When to use which one?

Bayesian Machine Learning

What are the differences between “Bayesian” and “Freqentist” approach for Machine Learning? Compare and contrast maximum likelihood and maximum a posteriori estimation. How does Bayesian methods do automatic feature selection? What do you mean by Bayesian regularization? When will you use Bayesian methods instead of Frequentist methods?

Regularization

What is L1 regularization? What is L2 regularization? Compare L1 and L2 regularization. Why does L1 regularization result in sparse models? What is dropout? How will you implement dropout during forward and backward pass?

Evaluation of Machine Learning systems

What are accuracy, sensitivity, specificity, ROC? What are precision and recall? Describe t-test in the context of Machine Learning.

Clustering

Describe the k-means algorithm. What is distortion function? Is it convex or non-convex? Tell me about the convergence of the distortion function. Topic: EM algorithm What is the Gaussian Mixture Model? Describe the EM algorithm intuitively. What are the two steps of the EM algorithm Compare Gaussian Mixture Model and Gaussian Discriminant Analysis.

Dimensionality Reduction

Why do we need dimensionality reduction techniques? What do we need PCA and what does it do? What is the difference between logistic regression and PCA? What are the two pre-processing steps that should be applied before doing PCA?

Basics of Natural Language Processing

What is WORD2VEC? What is t-SNE? Why do we use PCA instead of t-SNE? What is sampled softmax? Why is it difficult to train a RNN with SGD? How do you tackle the problem of exploding gradients? What is the problem of vanishing gradients? How do you tackle the problem of vanishing gradients? Explain the memory cell of a LSTM. What type of regularization do one use in LSTM? What is Beam Search? How to automatically caption an image?

Some basic questions

Can you state Tom Mitchell’s definition of learning and discuss T, P and E? What can be different types of tasks encountered in Machine Learning? What are supervised, unsupervised, semi-supervised, self-supervised, multi-instance learning, and reinforcement learning? Loosely how can supervised learning be converted into unsupervised learning and vice-versa? Consider linear regression. What are T, P and E? Derive the normal equation for linear regression. What do you mean by affine transformation? Discuss affine vs. linear transformation. Discuss training error, test error, generalization error, overfitting, and underfitting. Compare representational capacity vs. effective capacity of a model. Discuss VC dimension. What are nonparametric models? What is nonparametric learning? What is an ideal model? What is Bayes error? What is/are the source(s) of Bayes error occur? What is the no free lunch theorem in connection to Machine Learning? What is regularization? Intuitively, what does regularization do during the optimization procedure? What is weight decay? What is it added? What is a hyperparameter? How do you choose which settings are going to be hyperparameters and which are going to be learned? Why is a validation set necessary? What are the different types of cross-validation? When do you use which one? What are point estimation and function estimation in the context of Machine Learning? What is the relation between them? What is the maximal likelihood of a parameter vector $theta$? Where does the log come from? Prove that for linear regression MSE can be derived from maximal likelihood by proper assumptions. Why is maximal likelihood the preferred estimator in ML? Under what conditions do the maximal likelihood estimator guarantee consistency? What is cross-entropy of loss? What is the difference between loss function, cost function and objective function?

Optimization procedures

What is the difference between an optimization problem and a Machine Learning problem? How can a learning problem be converted into an optimization problem? What is empirical risk minimization? Why the term empirical? Why do we rarely use it in the context of deep learning? Name some typical loss functions used for regression. Compare and contrast. What is the 0–1 loss function? Why can’t the 0–1 loss function or classification error be used as a loss function for optimizing a deep neural network?

Sequence Modeling

Write the equation describing a dynamical system. Can you unfold it? Now, can you use this to describe a RNN? What determines the size of an unfolded graph? What are the advantages of an unfolded graph? What does the output of the hidden layer of a RNN at any arbitrary time t represent? Are the output of hidden layers of RNNs lossless? If not, why? RNNs are used for various tasks. From a RNNs point of view, what tasks are more demanding than others? Discuss some examples of important design patterns of classical RNNs. Write the equations for a classical RNN where hidden layer has recurrence. How would you define the loss in this case? What problems you might face while training it? What is backpropagation through time? Consider a RNN that has only output to hidden layer recurrence. What are its advantages or disadvantages compared to a RNN having only hidden to hidden recurrence? What is Teacher forcing? Compare and contrast with BPTT. What is the disadvantage of using a strict teacher forcing technique? How to solve this? Explain the vanishing/exploding gradient phenomenon for recurrent neural networks. Why don’t we see the vanishing/exploding gradient phenomenon in feedforward networks? What is the key difference in architecture of LSTMs/GRUs compared to traditional RNNs? What is the difference between LSTM and GRU? Explain Gradient Clipping. Adam and RMSProp adjust the size of gradients based on previously seen gradients. Do they inherently perform gradient clipping? If no, why? Discuss RNNs in the context of Bayesian Machine Learning. Can we do Batch Normalization in RNNs? If not, what is the alternative?

Autoencoders

What is an Autoencoder? What does it “auto-encode”? What were Autoencoders traditionally used for? Why there has been a resurgence of Autoencoders for generative modeling? What is recirculation? What loss functions are used for Autoencoders? What is a linear autoencoder? Can it be optimal (lowest training reconstruction error)? If yes, under what conditions? What is the difference between Autoencoders and PCA? What is the impact of the size of the hidden layer in Autoencoders? What is an undercomplete Autoencoder? Why is it typically used for? What is a linear Autoencoder? Discuss it’s equivalence with PCA. Which one is better in reconstruction? What problems might a nonlinear undercomplete Autoencoder face? What are overcomplete Autoencoders? What problems might they face? Does the scenario change for linear overcomplete autoencoders? Discuss the importance of regularization in the context of Autoencoders. Why does generative autoencoders not require regularization? What are sparse autoencoders? What is a denoising autoencoder? What are its advantages? How does it solve the overcomplete problem? What is score matching? Discuss it’s connections to DAEs. Are there any connections between Autoencoders and RBMs? What is manifold learning? How are denoising and contractive autoencoders equipped to do manifold learning? What is a contractive autoencoder? Discuss its advantages. How does it solve the overcomplete problem? Why is a contractive autoencoder named so? What are the practical issues with CAEs? How to tackle them? What is a stacked autoencoder? What is a deep autoencoder? Compare and contrast. Compare the reconstruction quality of a deep autoencoder vs. PCA. What is predictive sparse decomposition? Discuss some applications of Autoencoders.

Representation Learning

What is representation learning? Why is it useful? What is the relation between Representation Learning and Deep Learning? What is one-shot and zero-shot learning (Google’s NMT)? Give examples. What trade offs does representation learning have to consider? What is greedy layer-wise unsupervised pretraining (GLUP)? Why greedy? Why layer-wise? Why unsupervised? Why pretraining? What were/are the purposes of the above technique? (deep learning problem and initialization) Why does unsupervised pretraining work? When does unsupervised training work? Under which circumstances? Why might unsupervised pretraining act as a regularizer? What is the disadvantage of unsupervised pretraining compared to other forms of unsupervised learning? How do you control the regularizing effect of unsupervised pretraining? How to select the hyperparameters of each stage of GLUP?

Monte Carlo Methods

What are deterministic algorithms? What are Las vegas algorithms? What are deterministic approximate algorithms? What are Monte Carlo algorithms?

I will keep on adding more questions to both this list and my GitHub repository. Moreover, my plan is to add answers to these questions as well.

Disclaimer: Views expressed in this post are my personal, individual and unique perspectives, and not those of my employer.