Popularity of online courses with open access and unlimited student participation, the so-called massive open online courses (MOOCs), has been growing intensively. Students, professors, and universities have an interest in accurate measures of students' proficiency in MOOCs. However, these measurements face several challenges: (a) assessments are dynamic: items can be added, removed or replaced by a course author at any time; (b) students may be allowed to make several attempts within one assessment; (c) assessments may include an insufficient number of items for accurate individual-level conclusions. Therefore, common psychometric models and techniques of Classical Test Theory (CTT) and Item Response Theory (IRT) do not serve perfectly to measure proficiency. In this study we try to cover this gap and propose cross-classification multilevel logistic extensions of the common IRT model, the Rasch model, aimed at improving the assessment of the student's proficiency by modeling the effect of attempts and by involving non-assessment data such as student's interaction with video lectures and practical tasks. We illustrate these extensions on the logged data from one MOOC and check the quality using a cross-validation procedure on three MOOCs. We found that (a) the performance changes over attempts depend on the student: whereas for some students performance ameliorates, for other students, the performance might deteriorate; (b) similarly, the change over attempts varies over items; (c) student's activity with video lectures and practical tasks are significant predictors of response correctness in a sense of higher activity leads to higher chances of a correct response; (d) overall accuracy of prediction of student's item responses using the extensions is 6% higher than using the traditional Rasch model. In sum, our results show that the approach is an improvement in assessment procedures in MOOCs and could serve as an additional source for accurate conclusions on student's proficiency.

1. Introduction

Shah, 2018 Shah D. A Product at Every price: a Review of MOOC Stats and Trends in 2017. Shah, 2017 Shah D. Coursera's 2017: Year in Review. Massive open online courses (MOOCs) are a recent progressive phenomenon in education. A MOOC is an online course with open access and participation of unlimited number of students. A MOOC typically consists of pre-recorded video lectures, reading assignments, assessments, and forums. MOOCs are mainly developed by universities and run on platforms such as Coursera, edX, XuetangX, FutureLearn, Udacity, MiriadaX. In 2017, more than 800 universities offered students more than 9,400 MOOCs (). The same year, the largest MOOC provider, Coursera, achieved the milestones of 30 million students and 2,700 courses ().

Students, professors and universities – the key partners involved in MOOCs, – have an interest in accurate student proficiency measuring. Students take an online course and want to study efficiently. Proficiency measuring specifies student's position on the course-line, helps him/her to identify his/her strong and weak points and map areas that need additional work. Professors and their teams develop and optimize the course content. At this stage, the aggregated proficiency measures show to what degree the content incites learning and suggest improvements of video lectures, practical tasks, and support materials. Finally, universities award online course certificates to students. Proficiency measures can provide evidence whether student has mastered the course.

Proficiency is, however, a latent variable. We cannot observe it directly; yet, we can see, for example, student's performance on assessment items. To link the observable side to the latent side, we need specific rules. These rules are called measurement or psychometric theory. Today, there exist two major psychometric theories: Classical Test Theory (CTT) and Item Response Theory (IRT) and dozens of models based on them.

Borsboom, 2005 Borsboom D. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Lord and Novick, 1968 Lord F.M.

Novick M.R. Statistical Theories of Mental Test Scores. Y j = θ j + ε j . (1)

CTT appeared around 100 years ago (). It includes three main concepts – test score (observed), true score (latent), and error score (latent) (). The theory introduces a simple linear model that links the observable test score to the sum of two latent variables, true score and error score, that is,

According to this model, the test score of the j'th student ( Y j ) is the result of his/her proficiency ( θ j ) with a random measurement error ( ε j ). The error term ( ε j ) has the expected value of zero, is assumed normally distributed and unrelated to the proficiency: E ( ε j ) = 0 , ε j ∼ N ( 0 , σ ε 2 ) , and ρ ε θ = 0 . Thus, the expected value of Y j , E ( Y j ) , is θ j . As a result, the average score is normally distributed around θ j with variance σ ε 2 / n with n being the number of observations. Hence, the more observations, the closer the average score is in general to the proficiency.

Hambleton and Jones, 1993 Hambleton R.K.

Jones R.W. Comparison of classical test theory and item response theory and their applications to test development. The CTT has several limitations (see). For MOOCs the critical limitation is the dependence of proficiency measures on test difficulty: when the test is difficult, the student receives a low estimate of proficiency and when the test is easy, the student receives a high estimate of proficiency. This is a problem for assessments in MOOCs, since changes in tests are relatively frequent in MOOCs – professors often replace or add new items on the fly. Universities run the risk of making unfair decisions in certification of students. Professors will receive biased information about learning materials functioning. Students will have the wrong map of their strengths and weaknesses.

Lord and Novick, 1968 Lord F.M.

Novick M.R. Statistical Theories of Mental Test Scores. Rasch, 1960 Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests. Rasch, 1960 Rasch G. Probabilistic Models for Some Intelligence and Attainment Tests. L o g i t ( π i j | θ j ) = ln ( π i j / 1 − π i j ) = θ j − δ i a n d Y i j ∼ B e r n o u l l i ( π i j ) , (2)

probability ( π i j ) of the correct response of student j to the item i is described by a logistic function of the difference between the student's proficiency parameter ( θ j ) and the item difficulty parameter ( δ i ). To fit the Rasch model, a marginal maximum likelihood procedure is often used to estimate the item difficulty parameters, assuming that the students are a random sample from population where the student proficiencies are normally distributed with θ j ∼ N ( 0 , σ θ 2 ) , while the items have fixed difficulty. Individual student parameters can be estimated afterwards using empirical Bayes procedures. IRT was developed 50 years ago (). It presents a class of models with nonlinear linking between student's item responses (observable variables) and his/her proficiency (latent variable). In the basic IRT model, the Rasch model (),probability () of the correct response of student j to the item i is described by a logistic function of the difference between the student's proficiency parameter () and the item difficulty parameter (). To fit the Rasch model, a marginal maximum likelihood procedure is often used to estimate the item difficulty parameters, assuming that the students are a random sample from population where the student proficiencies are normally distributed with, while the items have fixed difficulty. Individual student parameters can be estimated afterwards using empirical Bayes procedures.

Kruyen et al. (2012) Kruyen P.M.

Emons W.H.

Sijtsma K. Test length and decision quality in personnel selection: when is short too short?. Lord and Novick, 1968 Lord F.M.

Novick M.R. Statistical Theories of Mental Test Scores. In comparison to CTT, student's proficiency parameters in IRT are independent from the test difficulty. It allows comparing students even in case of partial replacement of items. However, the use of IRT in MOOCs is also challenging. Firstly, IRT requires a relatively large number of items in assessments to provide accurate proficiency measures.stated that the minimal test length for individual level decisions is 40 items. By contrast, MOOCs often offer 15 or even fewer items in summative assessments. At the same time, MOOCs yield additional observable data such as student's activity with video lectures and his/her performance in practice, which might be used as proficiency indicators to cover the lack of items in summative assessments. Secondly, the proficiency parameter in IRT as well as in CTT is typically considered constant (). This point contrasts to the reality of MOOCs where students get feedback after responding to assessment items. In addition, students may be allowed to do several attempts within one assessment. If the student fails at one attempt, he/she can be provided with help information, review a video lecture, and then make a new attempt. Thus, the student's proficiency may grow with each new attempt to solve the certain item. However, better performance with new attempts does not necessarily imply growth in proficiency, as it might also appear if the student uses attempts to enumerate the item options. The challenge here is to distinguish between these kinds of growth.

As can be seen from the above, the common psychometric models of CTT and IRT are not tailored to use directly for measuring proficiency in MOOCs. At the same time, IRT is a well-elaborated and flexible framework, and could serve as the basis for such measurements.

In this study we extend and tune up the common IRT model, the Rasch model, for application in MOOCs. We start with proposing four extensions, which model the growth of student's proficiency with attempts and involve data of student's interaction with video lectures and practical tasks to compensate for the insufficient number of items in the assessment. Then we illustrate these extensions using data from one MOOC. Finally, we check the performance of these extensions in predicting correctness of students' responses in summative assessments and show the advantage in accuracy in comparison to the common IRT model using a cross-validation procedure applied in the data from three MOOCs.