Which of the most common Duolingo languages are hardest to learn? Brendan Tomoschuk

I am a longtime user and fan of Duolingo , a platform for learning new languages free of charge. And as a PhD student researching language learning and bilingualism, I have long awaited the opportunity to get my hands on any of the data they accumulate. Well I finally have. Burr Settles, of the Duolingo team, published a paper in the ACL proceedings discussing spaced repetition in Duolingo data. With this publication they shared their code and data on github. So I’m going to play with it a bit! Here’s what I want to know:

Clean the data

So before doing anything else let’s load some relevant libraries and take a peek at the data.

library(data.table) library(ggplot2) library(Rmisc) library(stringr) library(stringdist) library(lme4) library(SnowballC) #Data can be found here: https://github.com/duolingo/halflife-regression #fread is a faster means of loading a big dataset, which we know this will be data.raw = fread('bigset.csv')

## Read 0.0% of 12854226 rows Read 7.8% of 12854226 rows Read 16.6% of 12854226 rows Read 25.1% of 12854226 rows Read 32.0% of 12854226 rows Read 38.7% of 12854226 rows Read 44.7% of 12854226 rows Read 51.3% of 12854226 rows Read 58.9% of 12854226 rows Read 69.1% of 12854226 rows Read 79.7% of 12854226 rows Read 88.9% of 12854226 rows Read 98.3% of 12854226 rows Read 12854226 rows and 12 (of 12) columns from 1.219 GB file in 00:00:17

And let’s take a brief look at the data…

head(data.raw)

## p_recall timestamp delta user_id learning_language ui_language ## 1: 1.0 1362076081 27649635 u:FO de en ## 2: 0.5 1362076081 27649635 u:FO de en ## 3: 1.0 1362076081 27649635 u:FO de en ## 4: 0.5 1362076081 27649635 u:FO de en ## 5: 1.0 1362076081 27649635 u:FO de en ## 6: 1.0 1362076081 27649635 u:FO de en ## lexeme_id lexeme_string ## 1: 76390c1350a8dac31186187e2fe1e178 lernt/lernen<vblex><pri><p3><sg> ## 2: 7dfd7086f3671685e2cf1c1da72796d7 die/die<det><def><f><sg><nom> ## 3: 35a54c25a2cda8127343f6a82e6f6b7d mann/mann<n><m><sg><nom> ## 4: 0cf63ffe3dda158bc3dbd55682b355ae frau/frau<n><f><sg><nom> ## 5: 84920990d78044db53c1b012f5bf9ab5 das/das<det><def><nt><sg><nom> ## 6: 56429751fdaedb6e491f4795c770f5a4 der/der<det><def><m><sg><nom> ## history_seen history_correct session_seen session_correct ## 1: 6 4 2 2 ## 2: 4 4 2 1 ## 3: 5 4 1 1 ## 4: 6 5 2 1 ## 5: 4 4 1 1 ## 6: 4 3 1 1

str(data.raw)

## Classes 'data.table' and 'data.frame': 12854226 obs. of 12 variables: ## $ p_recall : num 1 0.5 1 0.5 1 1 1 1 1 0.75 ... ## $ timestamp : int 1362076081 1362076081 1362076081 1362076081 1362076081 1362076081 1362076081 1362082032 1362082044 1362082044 ... ## $ delta : int 27649635 27649635 27649635 27649635 27649635 27649635 27649635 444407 5963 5963 ... ## $ user_id : chr "u:FO" "u:FO" "u:FO" "u:FO" ... ## $ learning_language: chr "de" "de" "de" "de" ... ## $ ui_language : chr "en" "en" "en" "en" ... ## $ lexeme_id : chr "76390c1350a8dac31186187e2fe1e178" "7dfd7086f3671685e2cf1c1da72796d7" "35a54c25a2cda8127343f6a82e6f6b7d" "0cf63ffe3dda158bc3dbd55682b355ae" ... ## $ lexeme_string : chr "lernt/lernen<vblex><pri><p3><sg>" "die/die<det><def><f><sg><nom>" "mann/mann<n><m><sg><nom>" "frau/frau<n><f><sg><nom>" ... ## $ history_seen : int 6 4 5 6 4 4 4 3 8 6 ... ## $ history_correct : int 4 4 4 5 4 3 4 3 6 5 ... ## $ session_seen : int 2 2 1 2 1 1 1 1 6 4 ... ## $ session_correct : int 2 1 1 1 1 1 1 1 6 3 ... ## - attr(*, ".internal.selfref")=<externalptr>

Ok, wow. We have almost 13 million datapoints from over 115 thousand users learning 6 languages, as well as information about every word learned. Let’s look more at what’s in this dataset.

Each line of the dataset is a word for a given user, for a given session. So the first line is the word “lernt” (seen in the lexeme_string) for user u:f0 for some session of German. In this particular session they’ve seen the word twice (session_seen) and gotten it right twice (session_correct). Before this session they’ve seen the word 6 times (history_seen) and gotten it right 4 times (history_correct).

The lexeme_string variable has a lot of juicy information. First we see the surface form, which is the word in question, as it appears. After the /, we see the lemma, which is the base form of the word (unchanged to note person or tense or anything like that). Then in the first set of <>, we have the part of speech, and following that we have a lot of information about how the word is modified - so lernt is a lexical verb, in the present tense, third person, singular (in that order).

Because this data set is massive and R isn’t super for such datasets, we need to clean up the data and add some variables to make it as memory and time efficient as possible.

#Removing all lines with NA data and non-English learners for simplicity data.raw = data.raw[complete.cases(data.raw),] data.raw = data.raw[data.raw$ui_language == "en"] #Getting rid of variables that mean nothing to us data.raw$timestamp = NULL data.raw$lexeme_id = NULL

So we’ll be focusing on English speaking users learning German (de), Spanish (es), French (fr), Italian (it) and Portuguese (pt).

First, I want to find the total number of times a person has seen a word or gotten a word correct, across session. To do this I need to find the highest value for history_seen, (for each word and for each person) and remove all rows that aren’t the max for each person and word. This will show us each person’s last session for a given word. We will then add that current session information to the history information to calculate a total for each word.

#Create a temporary factor that includes both a user_id and a lexeme_string data.raw$temp = as.factor(paste(data.raw$user_id, data.raw$lexeme_string, sep = "_")) #Remove all rows that aren't the max value data.reduced = data.raw[data.raw[, .I[history_seen == max(history_seen)], by=temp]$V1] #Create total_variables data.reduced$total_seen = data.reduced$history_seen + data.reduced$session_seen data.reduced$total_correct = data.reduced$history_correct + data.reduced$session_correct #This one was especially fun to name data.reduced$total_recall = data.reduced$total_correct/data.reduced$total_seen

Next we’ll aggregate over subjects. By averaging every subject’s response to a lexeme_string, we significantly reduce the size of the dataset, having only an average value for each lexeme string produced. We lose some variability due to averaging, but the dataset is so big that shouldn’t be a problem.

data.reduced = data.frame(aggregate(cbind(total_seen, total_correct, total_recall)~ lexeme_string+ learning_language, data = data.reduced,mean)) #make learning_language a factor data.reduced$learning_language = as.factor(data.reduced$learning_language) #Peek hist(data.reduced$total_recall)

So this is definitely pretty skewed, but nothing strange considering Duolingo is designed to get people to high levels of recall over time.

Add a lemma variable That aggregation makes our data MUCH more managable, reducing our dataset to about 1% the original data size. Now we’ll add a lemma column. The lexeme_string contains the information about the word that we want to extract. While the first part, the surface form, isn’t that important to us, the “lemma” is. It’s the base word that we’ll be working with. The lemma is the word built into the lexeme string after the / and before the part of speech noted with <. So we’ll simply tell R to check every lexeme string, and extract the characters after the / and before the first <. #This removes all information before the lexeme data.reduced$lexeme_string = gsub("^.*?/","/",data.reduced$lexeme_string) data.reduced$lemma = substr(data.reduced$lexeme_string, 2, as.numeric(lapply(gregexpr('<',data.reduced$lexeme_string),head,1)) - 1)

Add item and cognate status variables Now I want to add a column that contains all of the information in the same language. For example, it’s better for analysis if I can represent the word chien as a vector containing the semantic information and the language - something like - so that when I compare it to cane - noted as - I know they’re the same item. For this I extracted all of the unique lemmas and fed them through Google Translate. I reupload that as a .csv here. (Using Google’s API to directly get translation information isn’t a free service, but translating a few thousand words in browser is…so again it’s a slightly more time intensive solution than I’d prefer, but it gets the job done.) #CSV containing all of our translations. trans = read.csv('translations.csv', encoding = "UTF-8") #Add a column combining learning_language and lemma, so that we can match the two documents together data.reduced$ll_lemma = paste(data.reduced$learning_language, data.reduced$lemma, sep = "_") trans$ll_lemma = paste(trans$learning_language, trans$lemma, sep = "_") #We'll add the actual item column in a minute Additionally, I think cognate status could impact learning. A cognate is a word that is the same, or very similar between two languages (like animal in Spanish and English). We know that cognate improves word learning, so I wanted to implement a simple measure of cognate status, using levenshtien distance. #Cognate status ## stringsim calcuates the minimal number of deletions, insertions, and substitutions that can change one word into another. trans$cognatestatus = stringsim(as.character(trans$item),as.character(trans$lemma)) data.reduced$cognatestatus = with(trans, cognatestatus[match(data.reduced$ll_lemma,ll_lemma)]) #Peek at cognatestatus hist(trans$cognatestatus) Cognatestatus looks pretty good. Many words have next to no letter overlap (low values), but others have a pretty even distribution. #Here I trim the endings off the translations to make sure plural words are not considered different than singular words. ##This is the simplest way to do that without manually editing all translations trans$item = wordStem(trans$item, language = "english") #Add item category data.reduced$item = with(trans, item[match(data.reduced$ll_lemma,ll_lemma)]) data.reduced$ll_lemma = NULL

Add a part of speech variable Now let’s add a couple simple variables that might help us capture differences in data. I’d like to extract the part of speech of each word, located in the lexeme_string, found in the first set of <>. data.reduced$pos = substr(data.reduced$lexeme_string, as.numeric(lapply(gregexpr('<',data.reduced$lexeme_string),head,1)) + 1, as.numeric(lapply(gregexpr('>',data.reduced$lexeme_string),head,1)) - 1) Now I’d like to simplify the parts of speech variable…It would be too overwhelming for me to compare every category to one another. Categories like nouns will have big Ns, but the verbs, adjectives and other category words are broken down into lots of subsections that I want to aggregate together. To do this I used the lexeme_reference.txt found on Duolingo’s github and edited it to make simpler categories. All of the categories were distilled into Nouns, Verbs, Function words (like the, on, with etc.) and describer, which is a category I just made up to cover things like adjectives and adverbs. lexref = read.csv('lexeme_reference.csv') #Add simplePos based on the POS from the lexeme reference guide data.reduced$simplePos = with(lexref, Type[match(data.reduced$pos,pos)]) data.reduced = data.reduced[complete.cases(data.reduced),] #Remove intricate part of speech variable data.reduced$pos = NULL #Remove "other"" items data.reduced = data.reduced[data.reduced$simplePos != "other",]