I received the following question from an education researcher:

I was wondering if I could ask you a question about an HLM model I’m working on. The basic design is that we have 5 years of 8th grade student achievement data (standardized test scores, this is the dependent variable), 4th grade test scores, demographics (e.g., gender and ethnicity) and status wrt special ed or ELL, etc.. In addition, we have some school- or second-level information such as school averages of the student information, type of school (grade configuration), enrollment and so. In total there are thousands of students and many schools over the 5 years of information.

The model we’re using is quite parsimonious, using only 7 student-level effects and 4 school-level effects. What’s puzzling us is that the correlation between predicted and actual is unrealistically high…r=0.999. We’re using the HPMIXED procedure in SAS but that shouldn’t matter. By dropping variables, obviously we can get the correlation to go down, but we shouldn’t have to do this in my view. It looks like we’re overfitting things but I don’t see how. Is it important that the coefficient of variation for the dependent variable is about 10? To me, this seems quite low and, coupled with a pretty narrow range of possible values (between 1 and 4.5), maybe we are overfitting?

Anyway, I hope this question is clear. We’re uncomfortable with an unrealistically good fit and are wondering how to fix it.