Evidence-Based Policy “Lite” Won’t Solve U.S. Social Problems: The Case of HHS’s Teen Pregnancy Prevention Program

Highlights:

In this report, we discuss the U.S. Department of Health and Human Services’ (HHS) Teen Pregnancy Prevention (TPP) program—a federal grant program that, by law, allocates approximately 75 percent of its funds to program approaches (i.e., models) identified as “evidence based.”

A great strength of TPP is that, since the program’s 2009 enactment, it has required rigorous evaluations of the models it funds to determine which are effective in preventing teen pregnancy.

Unfortunately, within TPP’s initial grants funded over 2010-2014, only four of 24 rigorously-evaluated models (17 percent) were found to reduce teen pregnancy rates.

The reason for the low hit rate, we believe, is that HHS unwittingly undermined TPP’s success by using weak evidence standards to identify the “evidence-based” models eligible for TPP funding—a problem that we noted in 2010 would likely result in many disappointing findings for TPP-funded grants.

By contrast, the U.S. Education Department’s Investing in Innovation (i3) Fund, by using strong evidence standards, achieved much greater success: Positive impacts for 40 to 50 percent of its larger grants.

Evidence-based policy “lite”—i.e., allocating funds based on weak evidence standards—is unlikely to produce the hoped-for progress. A much sharper focus on building, identifying, and scaling models with strong evidence of effectiveness is needed to move the needle on U.S. social problems.

In this report, we discuss the U.S. Department of Health and Human Services’ (HHS) Teen Pregnancy Prevention (TPP) program—a federal evidence-based grant program that was enacted in 2009 and is funded at about $100 million per year. By law, approximately 75 percent of TPP funds are allocated for program approaches/strategies (i.e., models) that are “evidence based;” about 25 percent are allocated for the development and testing of new and innovative models; and most grantees are required to participate in a rigorous, independent evaluation to measure their model’s impact on teen pregnancy and related behavioral outcomes.

TPP is an excellent example of evidence-based policymaking in one important respect: It shows how a government spending program can use rigorous evaluations to determine which program approaches work—and which don’t work—to achieve the program’s objectives (in this case, to prevent teen pregnancy). In the first cohort of TPP grants, funded from 2010 to 2014, HHS supported successful evaluations of 24 models[1]—in most cases, high-quality randomized controlled trials (RCTs), in which youth were randomly assigned to either a treatment group that received model services, or a control group that did not. TPP thus represents an important departure from government by guesswork—the typical spending approach in which government programs allocate funds to various projects without conducting or requiring rigorous evaluations, so no one ever learns which projects were truly effective and which were not.

But in another respect, we believe TPP has fallen short of its potential to use rigorous evidence to move the needle on teen pregnancy rates. Based on our careful review of the final evaluation reports:

Only one of the 24 rigorously-evaluated models, Teen Options to Prevent Pregnancy (TOPP), was found to produce sizable, statistically-significant effects on teen pregnancy rates —a reduction in adolescent mothers’ repeat pregnancies from 39 percent in the control group to 21 percent in the treatment group (p=0.01), as summarized here .

—a reduction in adolescent mothers’ repeat pregnancies from 39 percent in the control group to 21 percent in the treatment group (p=0.01), as . Three other models were found to produce modest effects that approached, but did not reach, statistical significance at conventional levels (0.05). The Safer Sex model reduced the teen pregnancy rate from 19 percent in the control group to 16 percent in the treatment group; the Positive Prevention PLUS model reduced the rate from 3 percent to 2 percent; and the AIM 4 Teen Moms model reduced [ 2 ]

The Safer Sex model the teen pregnancy rate from 19 percent in the control group to 16 percent in the treatment group; the Positive Prevention PLUS model the rate from 3 percent to 2 percent; and the AIM 4 Teen Moms model [ ] The remaining 21 models were found to produce weak or no effects on teen pregnancy or related behavioral outcomes. HHS, in its summary of the evidence, says that eight of these models were found to produce at least one statistically-significant positive effect on a targeted behavioral outcome. While this is technically true, in these eight cases the effects were of little practical significance (e.g., because the study found a significant effect on sexual behavior shortly after program completion that faded within the following six months; [3 ] or found a significant effect at only one site in a multi-site RCT whose main finding was of no significant effects across the pooled multi-site sample [4]

In short, despite TPP’s focus on funding “evidence-based” models, the program’s rate of positive or promising impact findings – four out of 24 models (17 percent) – is no better than the usual 10-20 percent success rate in high-quality RCTs in social policy, medicine, and business [a],[b],[c]. Furthermore, within the overall group of 24 models, the success rate for those models HHS had earlier identified as “evidence-based” (i.e., 13 percent) was slightly lower than the success rate for “new and innovative” models (19 percent).

Why did so few of the “evidence-based” models produce the hoped-for effects on teen pregnancy outcomes?

The likely answer, we believe, is that HHS unwittingly undermined TPP’s success by adopting weak evidence standards to designate models as “evidence based.” As we wrote in 2010, when HHS published its initial list of 28 models that were eligible for award under TPP’s evidence-based funding tier:

“Only 2 [of the 28 models] are backed by strong evidence of a sustained effect on teen pregnancy … The other 26 models are backed by preliminary evidence…[in most cases] randomized controlled trials or quasi-experimental studies showing only short-term effects on intermediate outcomes such as condom use and number of sexual partners, but not the final, most policy-relevant outcomes (pregnancies, births, sexually-transmitted disease). When interventions backed by such preliminary evidence are evaluated in more definitive RCTs with longer-term follow-up, sometimes they are found to produce impacts on the final, policy-relevant outcomes, but too often they are not. “Bottom line: Although the [TPP] program is a major step forward compared to previous efforts, much of its funding will still likely support activities that do not have a meaningful impact on teen pregnancy.”

Another federal evidence-based program launched in 2010—the Department of Education’s Investing in Innovation (i3) Fund—illustrates that greater success is possible. In its initial rounds of funding, i3 made its largest (“scale-up”) grants to only four educational program models that met a high evidence bar, namely:

“Strong evidence[5] … that the proposed practice, strategy, or program will have a statistically significant effect on improving student achievement or student growth, closing achievement gaps, decreasing dropout rates, increasing high school graduation rates, or increasing college enrollment and completion rates, and that the effect of implementing the proposed practice, strategy, or program will be substantial and important.”

Note that, in contrast to the TPP approach, the i3 evidence standard listed outcomes that are all of clear policy importance, and required that the effects on these outcomes be “substantial and important,” not merely statistically significant.

i3 also awarded mid-sized (“validation”) grants to 15 models that met a standard for moderate evidence, which required substantial and important effects on the above outcomes but allowed studies with limitations (e.g., small samples) to qualify. And i3 awarded smaller (“development”) grants, to 48 models with high potential but only preliminary evidence. Like TPP, i3 required funded projects in all its evidence tiers to be rigorously evaluated, with RCTs used for most scale-up and validation grants.

The final results of those i3 evaluations are now in. A careful review of the studies by the Institute of Education Sciences determined that 49 were high-quality impact evaluations,[6] with the following overall findings:

50 percent of the models receiving scale-up grants (average size $49 million) produced a statistically-significant positive effect on a primary, pre-specified student academic outcome;[ 7 ]

(average size $49 million) produced a statistically-significant positive effect on a primary, pre-specified student academic outcome;[ ] 43 percent of the models receiving validation grants (average size $19 million) produced such an effect;

(average size $19 million) produced such an effect; 13 percent of the models receiving development grants (average size $4 million) produced such an effect.

In short, i3—by using truly rigorous evidence standards to select its larger grantees—succeeded where TPP did not in focusing a sizable portion of its funding on models that produced positive impacts.

TPP could do the same. Its latest funding announcement includes the same weak evidence standards as in the past. Why not instead focus its largest grant awards on replicating models that meet a high evidence standard: Rigorous evidence of substantial and important effects on actual teen pregnancy rates? These models might include, for example:

TPP could then take another page out of the i3 playbook and make (i) mid-sized grants to models whose prior evidence suggests it might plausibly reduce pregnancy rates (e.g., because the studies found especially large effects on self-reported risky sexual behavior); and (ii) small grants for the development and testing of a large number of additional models backed by preliminary evidence or compelling logic. If TPP continues requiring rigorous evaluations of its funded projects, and uses the results to continually update the models with strong evidence that qualify for the largest grants, TPP will, over time, become an increasingly potent force for reducing teen pregnancy rates nationally.

But TPP’s experience to date should serve as a warning to other government and nonprofit initiatives seeking to use evidence to address social problems: Evidence-based policy “lite”—i.e., allocating funds based on weak evidence standards—is unlikely to produce the hoped-for progress. A much sharper focus on building, identifying, and scaling models with strong evidence of substantial and important impacts is needed to move the needle on the nation’s problems.

References:

[1] HHS’s Office of Adolescent Health found that 28 studies, evaluating a total of 24 models (i) met the HHS evidence review standards for a moderate or high rating for research quality (e.g., based on low or corrected-for sample attrition, baseline equivalence of the treatment and control or comparison groups, strong contrast in services received by these groups, and appropriate study power to detect the intended effects); and (ii) evaluated the model as implemented successfully (e.g., with fidelity to the program model, and high dosage). Farb and Margolis 2016

[2] In all three of these cases, the p-value for the effect on the teen pregnancy rate was 0.07.

[3] Final evaluation reports for Love Notes, Reducing the Risk-adaptation, and Crossroads

[4] Final evaluation report for Reducing the Risk

[5] “Strong evidence” was defined as (i) more than one well-designed and well-implemented RCT or well-designed and well-implemented quasi-experimental study; or (ii) one large, well-designed and well-implemented multisite RCT.

[6] These were the 49 i3 impact evaluations (from a larger pool of 67) that the Institute of Education Sciences rated as meeting What Works Clearinghouse (WWC) evidence standards with or without reservations. These were unofficial ratings since only the WWC can officially identify a study as meeting WWC standards.

[7] We carefully reviewed the positive findings in all three of i3’s evidence tiers—as we did for the TPP evaluation results—to make sure that they were unambiguously positive (e.g., were not a positive finding in an interim study follow-up that faded out at the final follow-up).

[8] HHS funded a substantively-different version of the Carrera program in which children entered the program at average age 11 versus age 14 in the original program model. A key factor that appeared to drive the impacts on pregnancy in the original model—greater use of long-acting, reversible contraceptives in the treatment versus control group—was not a main element of the program as adapted for delivery to younger children.