For a number of years, startups in the area of human resource analytics (also called people analytics or talent analytics) have been making dramatic claims. These companies sell software that turns administrative data to predictions. Some of these claims have recently surfaced in Big Media again: The New York Times looks at several startups that claim they can predict which candidates are better matches for jobs, and remove human biases while doing it; The Wall Street Journal highlights some companies that use data to predict if and when employees may quit. If we are to believe these reports, several companies have already built the proverbial artificial intelligence machines that are smarter than humans.

When we see this type of claims, we notice that they are framed in very general terms. This leads us to ask two questions: What is the evidence to back it up? And given the evidence we do see, what are the real-world implications?

In over 2,000 words across the two articles, we failed to find any convincing scientific facts. For example, the Times reporter tells us “Startups … say that software can do the job more effectively and efficiently than people can.” Expecting evidence, we encounter instead this following sentence: “Many people are beginning to buy into the idea.” No justification is offered when the reporter mentions “another potential result: a more diverse workplace” but later, the CEO of Gild, which sells software for automated hiring, is quoted: “Gild finds more diverse candidates than employers typically do. In tech, it surfaces more engineers who are women and older and who come from a wider variety of colleges and socioeconomic backgrounds.” Scarcely a number can be found to quantify the impact of HR analytics.

For its part, the Journal produces a number. The head of talent acquisition and development at Credit Suisse, a bank that has invested in the new HR software, estimates that “a one-point reduction in unwanted attrition rates saves the bank $75 million to $100 million a year.” If you read that too quickly, the sentence may pass for evidence of effectiveness. Slow down, and you realize that the one-point reduction in attrition rate is mere speculation. Even if such improvement could be achieved, it most likely did not result solely from using predictive software.

In the same article, the Credit Suisse executive suggests that their new HR software helped retain some 300 employees who were promoted internally. He says, “We believe we’ve saved a number of them from taking jobs at other banks.” Such is the nature of evidence frequently marshaled for Big Data analytics: “Trust us, because we have data.” This state of affairs is particularly jarring in a field that markets itself as a “data science.” There are few if any numbers and precious little science on display.

We sense that the journalists are holding their doubts in check as they completed assignments to cover the emerging HR analytics sector. The headline in the Times—“Can an algorithm hire better than a human?”—is pointedly a question and not a statement. The person who writes “Some of the software sounds as touchy-feely as the most empathetic personnel director” probably isn’t buying the optimistic claims.

We wish these reporters trusted their instincts more. Their soft approach leaves a number of dubious claims unchallenged. Readers should be made aware that these HR problems are highly complex, administrative data are fraught with defects, and the outcomes are difficult to quantify.

Take the problem of using software in the hiring process. In the Times article, there are no fewer than three incongruous descriptions of what quantity is being predicted. The CEO of Gild says the tool predicts the “likelihood that people would be interested in a job.” Two paragraphs later, he singles out a successful placement of an engineer, disclosing that being hired was the desired outcome. This section suggests that the machine predicts the probability of getting hired. Elsewhere, a Wharton professor criticizes human recruiters for using criteria that are “not predictive of how [employees] perform down the road.”

The statistical model required to produce any one of the three predictions is unique. The model that predicts who will get hired starts with all job applicants while the model that predicts interest in a job uses a more general population. The model that predicts job performance must follow employees on their jobs, and the scientists must define clear metrics for measuring performance, not a simple task. Further, the model must take into account many other factors that affect job performance, such as on-boarding experience, quality of management, and performance of team members.

The Times devotes several paragraphs to the idea that using machines will result in a more diverse workforce. Software is said to be “free of human biases.” This is a false statement. Every statistical model is a composite of data and assumptions; and both data and assumptions carry biases.

The fact that data itself is biased may be shocking to some. Occasionally, the bias is so potent that it could invalidate entire projects. Consider those startups that are building models to predict who should be hired. The data to build such machines typically come from recruiting databases, including the characteristics of past applicants, and indicators of which applicants were successful. But this historical database is tainted by past hiring practices, which reflected a lack of diversity. If these employers never had diverse applicants, or never made many minority hires, there is scant data available to create a predictive model that can increase diversity! Ironically, to accomplish this goal, the scientists should code human bias into the software.

The recent Hillary Clinton episode ought to remind us that official servers do not capture all communications. Many of the HR tools rely on tracking employee communications, computing such metrics as how many emails they send, how often they communicate outside their own teams, and so on. But certain employees withholding certain types of messages create bias. Different generations of employees prefer different communication devices, with varying levels of data collection, which also creates bias.

Humans can directly inject biases into the software. Most of the featured vendors say they analyze employee survey data. What questions are asked and not asked in these surveys determine which factors will be included in the prediction tool. For example, managers may decide that spousal harmony is too sensitive a topic to address in the employee survey, so such data will not be utilized even if it may be correlated with employee attrition.

After overcoming all the challenges of generating those predictions, scientists face a daunting task of evaluating the performance of their machines.

How should one test the claim that HR analytics software can predict the probability of an employee quitting within the next year? Presumably, the goal is to allow managers to intervene. If such actions succeeded, then the employee who was predicted to quit would not resign. If we learn that the employee did not resign within the year, was it due to the intervention or did the software produce a bad prediction in the first place?

The Gild CEO explains that his machine avoids “dismissing tons and tons of qualified people.” How should one validate this claim? An objective standard of “qualified people” does not exist, for otherwise the problem reduces to a checklist, and no sophisticated software is needed. Given the same pool of résumés, a machine and a human generate two lists of qualified candidates. The lists are created to be of equal length. If the machine favors someone on whom recruiters passed, did the recruiters miss a qualified person or did the machine make a false-positive error?

There is certainly some value in the data that can enhance current human-resources practices. Successful approaches require clear formulation of the goals, careful vetting of the data, and most importantly, a scientific approach to measuring the effectiveness of such tools. Those are the messages we’d have liked to see in the media reporting on HR analytics.

Andrew Gelman and Kaiser Fung are statisticians who deal with uncertainty every working day. In Statbusters they critically evaluate data-based claims in the news, and they usually find that the real story is more interesting than the hype.