How Khan Academy is using Machine Learning to Assess Student Mastery

See discussion on Hacker News and Reddit.

The Khan Academy is well known for its extensive library of over 2600 video lessons. It should also be known for its rapidly-growing set of now 225 exercises — outnumbering stitches on a baseball — with close to 2 million problems done each day.

To determine when a student has finished a certain exercise, we award proficiency to a user who has answered at least 10 problems in a row correctly — known as a streak. Proficiency manifests itself as a gold star, a green patch on teachers’ dashboards, a requirement for some badges (eg. gain 3 proficiencies), and a bounty of “energy” points. Basically, it means we think you’ve mastered the concept and can move on in your quest to know everything.

It turns out that the streak model has serious flaws.

First, if we define proficiency as your chance of getting the next problem correct being above a certain threshold, then the streak becomes a poor binary classifier. Experiments conducted on our data showed a significant difference between students who take, say, 30 problems to get a streak vs. 10 problems right off the bat — the former group was much more likely to miss the next problem after a break than the latter.

False positives is not our only problem, but also false negatives. One of our largest source of complaints is from frustrated students who lost their streak. You get 9 correct, make a silly typo, and lose all your hard-earned progress. In other words, the streak thinks that users who have gotten 9 right and 1 wrong are at the same level as those who haven’t started.

In Search of a Better Model

These findings, presented by one of our full-time volunteers Jace, led us to investigate whether we could construct a better proficiency model. We prototyped a constant acceleration “rocketship” model (with heavy gnomes that slow you down on wrong answers), but ultimately decided that a prudent first step would be to just abstract away the streak model with the notion of “fill up the bar”.

We went from displaying the user’s current streak (bug not intended; could not find another screenshot):

to this:

and when full:

This gave us greater freedom to experiment with different underlying models without disrupting the interface.

Conversations with the team led me to conceive of applying machine learning to predict the likelihood of getting the next problem correct, and use that as the basis for a new proficiency model. Basically, if we think you’re more than % likely to get the next problem correct, for some threshold , we’ll say you’re proficient.

I started off by hacking together a naive Bayes binary classifier modified to give a probability estimate. I trained this on a few days’ worth of problem logs, and initial results were promising — the most striking being that fewer problem were needed to attain the same level of accuracy.

What do I mean by accuracy? We define it as

P(\text{next problem correct} | \text{just gained proficiency})

which is just notation desperately trying to say ”Given that we just gained proficiency, what’s the probability of getting the next problem correct?”

However, naive Bayes is typically used for classification — the task of determining which discrete category a data point belongs to — rather than for regression — returning a continuous value (in our case, a probability estimate in ).

So, our full-time volunteer Jace, who is much more versed in statistics and machine learning, used R to quickly prototype and evaluate different machine learning algorithms and feature sets. R is the de-facto programming language for statistical computing and comes pre-packaged with data analysis and machine learning tools.

To evaluate the different algorithms, input features, and thresholds, we came up with some metrics to gauge for desirable characteristics:

Mean problems done to reach proficiency — ideally we like to minimize this so that students can spend less time rote grinding on problems they know well, and move on to other concepts.

— ideally we like to minimize this so that students can spend less time rote grinding on problems they know well, and move on to other concepts. — Unfortunately, this is hard to correctly measure in our offline data set due to the streak-of-10 bias: students may loosen up after they gain proficiency and spend less time on subsequent problems.

— Unfortunately, this is hard to correctly measure in our offline data set due to the streak-of-10 bias: students may loosen up after they gain proficiency and spend less time on subsequent problems. Proficiency Rate — The percent of proficiencies attained per user-exercise pair. Again, this is hard to measure because of the streak bias.

— The percent of proficiencies attained per user-exercise pair. Again, this is hard to measure because of the streak bias. Confusion matrix for predicted next problem correct — This is for comparing binary classifiers on their accuracy in predicting the outcome of any answer in a user’s response history. We build up the confusion matrix, and from that extract two valuable measures of the performance of a binary classifier.

We tested various models, including naive Bayes, support vector machines, a simple 10-out-of-last-11-correct model, and logistic regression. Based on the metrics above, we settled on…

Using Logistic Regression as a Proficiency Model

(Feel free to skip this section if you’re not technically inclined.)

Logistic regression is usually used as a classifier that gives a reasonable probability estimate of each category — exactly our requirement. It’s so simple, let’s derive it.

Let’s say we have the values of input features (eg. percent correct), and we stuff them into a vector . Let’s say we also happen to know how much each feature makes it more likely that the user is proficient, and stuff those weights into a vector . We can then take the weighted sum of the input features, plus a pre-determined constant to correct for any constant bias, and call that :

z = w_0 + \sum_{i=1}^n w_ix_i

Now if we set , we can write that compactly as a linear algebra dot product:

z = \mathbf{w}^T\mathbf{x}

Already, you can see that the higher is, the more likely the user is to be proficient. To obtain our probability estimate, all we have to do is “shrink” into the interval . We want negative values of to map into and positive values to fall in . We can do this by plugging into a sigmoid function — in particular, we’ll use the logistic function:

h(z) = \frac{1}{1+e^{-z}}

And that’s it! is the probability estimate that logistic regression spits out.

The tricky bit is in determining the values of the weight vector — that is, training logistic regression so that , aka. the hypothesis function in machine learning terminology, gives us a good probability estimate. For brevity I’ll spare you the details, but suffice to know that there are plenty of existing libraries to do that.

So that raises the question, which features did we use?

ewma_3 and ewma_10 — Exponentially-weighted moving average. This is just math-talk for an average where we give greater weight to more recent values. It’s handy because it can be implemented recursively as ewma_3 and ewma_10 respectively.

and — Exponentially-weighted moving average. This is just math-talk for an average where we give greater weight to more recent values. It’s handy because it can be implemented recursively as and respectively. current_streak — This turned out to be a rather weak signal and we’ll be discarding it in favour of other features in the future.

— This turned out to be a rather weak signal and we’ll be discarding it in favour of other features in the future. log_num_done —

— log_num_missed —

— percent_correct —

As for the proficiency threshold, we chose 94% based on our metrics.

Now for some Python code. To compute the exponentially-weighted moving average:

1 def exp_moving_avg ( self , weight ): 2 ewma = EWMA_SEED 3 4 for i in reversed ( xrange ( self . total_done )): 5 ewma = weight * self . get_answer_at ( i ) + ( 1 - weight ) * ewma 6 7 return ewma

and for the actual logistic regression hypothesis function:

1 class AccuracyModel ( object ): 2 3 # ... snip ... 4 5 def predict ( self ): 6 """ 7 Returns: the probabilty of the next problem correct using logistic 8 regression. 9 """ 10 11 # We don't try to predict the first problem (no user-exercise history) 12 if self . total_done == 0 : 13 return PROBABILITY_FIRST_PROBLEM_CORRECT 14 15 # Get values for the feature vector X 16 ewma_3 = self . exp_moving_avg ( 0.333 ) 17 ewma_10 = self . exp_moving_avg ( 0.1 ) 18 current_streak = self . streak () 19 log_num_done = math . log ( self . total_done ) 20 log_num_missed = math . log ( self . total_done - self . total_correct () + 1 ) 21 percent_correct = float ( self . total_correct ()) / self . total_done 22 23 weighted_features = [ 24 ( ewma_3 , params . EWMA_3 ), 25 ( ewma_10 , params . EWMA_10 ), 26 ( current_streak , params . CURRENT_STREAK ), 27 ( log_num_done , params . LOG_NUM_DONE ), 28 ( log_num_missed , params . LOG_NUM_MISSED ), 29 ( percent_correct , params . PERCENT_CORRECT ), 30 ] 31 32 X , weight_vector = zip ( * weighted_features ) # unzip the list of pairs 33 34 return AccuracyModel . logistic_regression_predict ( 35 params . INTERCEPT , weight_vector , X ) 36 37 # See http://en.wikipedia.org/wiki/Logistic_regression 38 @staticmethod 39 def logistic_regression_predict ( intercept , weight_vector , X ): 40 # TODO(david): Use numpy's dot product fn when we support numpy 41 dot_product = sum ( itertools . imap ( operator . mul , weight_vector , X )) 42 z = dot_product + intercept 43 44 return 1.0 / ( 1.0 + math . exp ( - z ))

There’s another interesting problem here — how do you display that probability value on the progress bar? We try to linearize the display and distribute it evenly across the bar. Since it’s 4 am right now, I’ll just give you the code for it (it’s well-commented) and won’t make any helpful explanatory graphs (unless people request it ;)).

1 class InvFnExponentialNormalizer ( object ): 2 """ 3 This is basically a function that takes an accuracy prediction (probability 4 of next problem correct) and attempts to "evenly" distribute it in [0, 1] 5 such that progress bar appears to fill up linearly. 6 7 The current algorithm is as follows: 8 Let 9 f(n) = probabilty of next problem correct after doing n problems, 10 all of which are correct. 11 Let 12 g(x) = f^(-1)(x) 13 that is, the inverse function of f. Since f is discrete but we want g to be 14 continuous, unknown values in the domain of g will be approximated by using 15 an exponential curve to fit the known values of g. Intuitively, g(x) is a 16 function that takes your accuracy and returns how many problems correct in 17 a row it would've taken to get to that, as a real number. Thus, our 18 progress display function is just 19 h(x) = g(x) / g(consts.PROFICIENCY_ACCURACY_THRESHOLD) 20 clamped between [0, 1]. 21 22 The rationale behind this is that if you don't get any problems wrong, your 23 progress bar will increment by about the same amount each time and be full 24 right when you're proficient (i.e. reach the required accuracy threshold). 25 """ 26 27 def __init__ ( self , accuracy_model , proficiency_threshold ): 28 X , Y = [], [] 29 30 for i in itertools . count ( 1 ): 31 accuracy_model . update ( correct = True ) 32 probability = accuracy_model . predict () 33 34 X . append ( probability ) 35 Y . append ( i ) 36 37 if probability >= proficiency_threshold : 38 break 39 40 self . A , self . B = exponential_fit ( X , Y ) 41 # normalize the function output so that it outputs 1.0 at the 42 # proficency threshold 43 self . A /= self . exponential_estimate ( proficiency_threshold ) 44 45 def exponential_estimate ( self , x ): 46 return self . A * math . exp ( self . B * x ) 47 48 def normalize ( self , p_val ): 49 # TODO(david): Use numpy clip 50 def clamp ( value , minval , maxval ): 51 return sorted (( minval , value , maxval ))[ 1 ] 52 53 return clamp ( self . exponential_estimate ( p_val ), 0.0 , 1.0 )

Now, until Google App Engine supports NumPy, the implementation for exponential_fit is just the derivative of the least-squares cost.

The full, uncut, unaltered code is available at our Kiln repo.

Showdown: Logistic Regression The Streak

The metrics may tell us that logistic regression wins, but being the illogical, squishy human beings that we are, we yearned for an intuitive understanding of the unique behaviour of the different models. I developed an internal tool to simulate exercise responses and visualise the prediction of different models. Here’s a highlight reel of the salient characteristics.

As expected, order matters. Both models will weigh problems done more recently higher than earlier problems. What may be surprising is the relative importance:

Logistic regression seems to care much less than streak.

Both models monotonically increase confidence the more responses of the same type they receive:

Logistic regression also recognises consistently spotty performance:

Logistic regression takes into account prior performance. So, getting lots correct is always a good thing, and you’ll be able to recover faster from a wrong answer if you were previously doing well. Contrast with the streak model, which loses all memory after a single incorrect answer.

This could also work against you. If you’ve gotten lots of wrong answers, you’ll need to do more work to convince logistic regression that you’re actually competent. This mitigates one of the issues we had with the streak, where we found that there was a significant difference in actual proficiency for those getting a streak immediately vs. after 30 problems.

Could this be overly harsh for struggling students? That’s a question we are actively investigating, and as a stopgap measure we only keep the last 20 problems as history. This compromise has an insignificant effect on logistic regression’s predictive accuracy, but it lets us sleep knowing that a student will not be damned for life if they were doing some unusual exploration and got 10 problems in a row wrong.

Due to 4 am, I don’t have an interactive demo on this page, but it won’t be hard to add it. If you would like to play with this please say so.

Results

This was a fairly large change that we, understandably, only wanted to deploy to a small subset of users. This was facilitated by Bengineer Kamen’s GAE/Bingo split-testing framework for App Engine. Crucially, it allowed us to measure conversions as a way of gathering more accurate statistics on actual usage data.

The experiment has been running for 6 days thus far with 10% of users using the new logistic regression proficiency model. Before I reveal anything else, see a screenshot of GAE/Bingo in action (from a few hours ago):

The graph above shows the results over time, so you can see when trends have stabilised.

Now what you’ve been waiting for, our current statistics (5 am PST, Nov. 2) show that, for the new model, we have:

Please enable JavaScript for pretty charts.

20.8% more proficiencies earned:

Proficiencies earned per user: 2.4205 for streak and 2.9237 for accuracy

15.7% more new exercises attempted:

New exercises attempted per user: 2.9843 for streak vs. 3.4533 for accuracy

4.4 less problems done (26% less) per proficiency:

Problems done per proficiency: 16.8 for streak vs. 12.4 for accuracy

Essentially the same accuracy at proficiency:

Accuracy - P(next problem correct | just gained proficiency): 0.951 for streak vs. 0.949 for accuracy

Higher accuracy attained among a set of 3 pre-chosen easy problems. Jace came up with this statistic to gauge any actual differences in learning. The basic idea is: If accuracy as determined by logistic regression is a good approximation of competence, then higher attained accuracies would be indicative of greater competence. Note the precipitous drop at 94% for the accuracy model — this is due to the proficiency threshold set at 94% for logistic regression, so once a user reaches that level, we tell them to move on. (A streak of 10 with no wrong answers nets an accuracy of 96.7%.)

Please enable JavaScript. Easy Problems - Distribution of Accuracy Attained [% of Users, Accuracy Attained] Streak: [[0.85, 0.1064], [0.9, 0.1052], [0.92, 0.1029], [0.94, 0.0973], [0.96, 0.0844]] Accuracy: [[0.85, 0.1207], [0.9, 0.1200], [0.92, 0.1195], [0.94, 0.1181], [0.96, 0.0272]]

Slightly higher accuracy attained for a set of 10 pre-chosen hard problems. Going above and beyond the call of duty seems much less popular here, among accuracy model participants.

Please enable JavaScript. Easy Problems - Distribution of Accuracy Attained [% of Users, Accuracy Attained] Streak: [[0.85, 0.0464], [0.9, 0.0433], [0.92, 0.0395], [0.94, 0.0326], [0.96, 0.0193]] Accuracy: [[0.85, 0.0473], [0.9, 0.0441], [0.92, 0.0407], [0.94, 0.0392], [0.96, 0.0025]]

P(do another problem | just answered incorrectly) not affected

11.7% more proficiencies earned for the hard problems

14.8% more proficiencies earned for the easy problems

In high level terms, we increased overall interest — more new exercises attempted, fewer problems done per proficiency — without lowering the bar for proficiency — P(next problem correct | just gained proficiency) was roughly the same for both groups. Further, it seemed that overall learning, as measured by the distribution of accuracies obtained, went up slightly under the new model.

Optimistically, we hypothesise that our gains are from moving students quicker off exercises they’re good at, while making them spend more time on concepts in which they need more practice. To confirm or deny this…

In the Pipeline

…we will look into truly where the new proficiencies are coming from. We are also interested in seeing if there is any variation in knowledge retention — in particular, we want to know if P(next problem correct | took a few days break) is affected.

This is just the end of the beginning for us. We wish to investigate and possibly implement:

Stochastic gradient descent for online learning of logistic regression

…which would allow adaptive models per user and per exercise. Should we bump up the proficiency threshold for users who find the exercises too easy?

On a similar note, could we define a fitness function that takes into account both accuracy and student frustration, and find the optimal time to tell the student to move on? Could this allow us to maximize student learning by maximizing accuracy across many exercises?

Model improvements. Here are some things we still need to try: Incorporate more features, such as time spent per problem, time since last problem done, and user performance on similar exercises. Experiment with non-linear feature transformations and combinations. Eg. Along with the above, apply regularization to prevent overfitting (thanks Andrew Ng and ml-class!) Train and use separate models for the first 5 problems vs. those after that.



This work to create an accurate predictor has many other applications than just to power the proficiency meter:

Determine if the user is struggling, and if so suggest a video to watch, using some hints, or taking a break.

Determine the optimal date to schedule a review for spaced repetition learning.

Tailor a custom mixed-question review session that addresses weak areas.

Stay tuned for a continuation blog post if we find more interesting results!

Obligatory Recruiting Plug

Think you can do better? Well, I agree! I’m sure you know lots of ways to improve what we did. Good news: we’re open-source and hiring!

We welcome contributors to our exercises and exercise framework on GitHub. Some of our best exercises were created by volunteers: check out this awesome derivative intuition exercise created by Bengineer Eater.

Another reason I love working at the Khan Academy is the passionate and talented team. Our lead developer, Bengineer Kamens, is committed to our productivity and well-being. He Bengineers internal refactorings, tools, and spends much of his time getting new developers up to speed. Without his Bengineering, it would not have been possible to collect all this interesting data. Also, if you ever have a question about jQuery, you could just ask John Resig here.

Do you want to make 0.1% improvements in ad click-thru rates for the rest of your life, or come with us and change the world of education?

Also, if you were wondering, we are not based in the UK, Canada, or Australia… my Canadian heritage compels me to spell “-our” and “-ise” when it’s not code. :P

Thank you all for your suggestions and feedback!

There’s some interesting discussion on Hacker News and Reddit.

Having ran the experiment for more than two weeks now, we analysed 10 days’ data to see if longer-term knowledge retention was affected. It turns out that students are slightly more likely to answer correctly after taking a break under the new model:

P(next correct | proficient and took break) = 0.861 for streak, 0.875 for accuracy P(next wrong | not proficient and took break) = 0.273 for streak, 0.283 for accuracy

These results are encouraging. It shows that the new model attempts to address one of the core problems with the streak — the variability of student success rates after taking a break — while at the same time increasing proficiency rates. Thus, we have reason to conclude that the accuracy model is just a better model of student proficiency.

This information gave us the confidence to roll out from 10% to 100% of users. We have now officially launched the logistic regression proficiency model site-wide!