Pearson, which has a five-year, $468 million contract to create the state’s tests through 2015, uses “item response theory” to devise standardized exams, as other testing companies do. Using I.R.T., developers select questions based on a model that correlates students’ ability with the probability that they will get a question right.

That produces a test that Mr. Stroup said is more sensitive to how it ranks students than to measuring what they have learned. That design flaw also explains why Richardson students’ scores on the previous year’s TAKS test were a better predictor of performance on the next year’s TAKS test than the benchmark exams were, he said. The benchmark exams were developed by the district, the TAKS by the testing company.

Mr. Stroup, who is preparing to submit the findings to multiple research journals, presented them in June at a meeting of the Texas House Public Education Committee. He said he was aware of their implications for a widely used and accepted method of developing tests, and for how the state evaluates public schools.

“I’ve thought about being wrong,” Mr. Stroup said. “I’d love if everyone could say, ‘You are wrong, everything’s fine,’ ” he said. “But these are hundreds and hundreds of numbers that we’ve run now.”

Gloria Zyskowski, the deputy associate commissioner who handles assessments at the Texas Education Agency, said in a statement that the agency needed more time to review the findings. But she said that Mr. Stroup’s comments in June reflected “fundamental misunderstandings” about test development and that there was no evidence of a flaw in the test.