$\begingroup$

I have a dataset of student scores in an exam. Its structure is as follows:

Group Subgroup StudentID Score(4 Categories) AssessmentDate 001 001 00001 1 20170401 001 002 00002 2 20170401

There are nearly 10000 students, divided into 10 groups, each having 20 subgroups or so. Different raters were assigned to groups or subgroups, so there may be systematic bias among groups (and / or subgroups in them). The exam is carried out four times in total. Each student participated in all the four assessments.

I want to know if the exam performed well in distinguishing students. For example, do all the students tend to have the same score on average? Are different students graded statistically different scores? A well-designed exam should not let the score concentrate on one point (indistinguishable), even if students all did an excellent job.

My knowledge tells me that t-test seems to be a fit. However, the t-test I learned only deals with two series, whereas there as thousands of students. And the observations of each student are limited (only 4). Another difficulty is that the data are categorical.