In the realm of public policy, we live in an age of numbers. To hold teachers accountable, we examine their students’ test scores. To improve medical care, we quantify the effectiveness of different treatments. There is much to be said for such efforts, which are often backed by cutting-edge reformers. But do wehold an outsize belief in our ability to gauge complex phenomena, measure outcomes and come up with compelling numerical evidence? A well-known quotation usually attributed to Einstein is “Not everything that can be counted counts, and not everything that counts can be counted.” I’d amend it to a less eloquent, more prosaic statement: Unless we know how things are counted, we don’t know if it’s wise to count on the numbers.

The problem isn’t with statistical tests themselves but with what we do before and after we run them. First, we count if we can, but counting depends a great deal on previous assumptions about categorization. Consider, for example, the number of homeless people in Philadelphia, or the number of battered women in Atlanta, or the number of suicides in Denver. Is someone homeless if he’s unemployed and living with his brother’s family temporarily? Do we require that a women self-identify as battered to count her as such? If a person starts drinking day in and day out after a cancer diagnosis and dies from acute cirrhosis, did he kill himself? The answers to such questions significantly affect the count.

Second, after we’ve gathered some numbers relating to a phenomenon, we must reasonably aggregate them into some sort of recommendation or ranking. This is not easy. By appropriate choices of criteria, measurement protocols and weights, almost any desired outcome can be reached. Consider those ubiquitous articles with titles like “The 10 Friendliest Colleges” or “The 20 Most Lovable Neighborhoods.” Such articles would be more than fluff if they answered critical questions. Are there good reasons the authors picked the criteria they did? Why did they weigh the criteria in the way they did? If changes in the criteria were made, would the rankings of the friendliest colleges or most lovable neighborhoods be vastly different?

Since the answer to the last question is usually yes, the problem of reasonable aggregation is no idle matter. Recently released e-mail from employees at the credit-rating agency Standard & Poor’s indicated a wish to “discuss adjusting criteria” for rating securities and “massage the subprime and alt-A numbers to preserve market share.” The criteria adjustments were analogous to the adjustments that would put an area of abandoned buildings onto the list of the most lovable neighborhoods.