Having originated as a naturalistic description of how adults help toddlers learn solve problems (Wood, Bruner, & Ross, 1976), scaffolding has expanded to one that is used among diverse learners and in the context of many problem-centered instructional approaches (Hawkins & Pea, 1987; Hmelo-Silver, Duncan, & Chinn, 2007; Stone, 1998). Along with this expansion, many scaffolding approaches, forms, and empirical studies, have emerged. For example, scaffolding now encompasses one-to-one interactions with classroom teachers (van de Pol, Volman, & Beishuizen, 2010), interaction with similarly abled peers (Pifarre & Cobos, 2010), and computer-based tools (Devolder, Van Braak, & Tondeur, 2012; Reiser, 2004). Scaffolding is used among students of diverse educational levels and demographic backgrounds (Cuevas, Fiore, & Oser, 2002; Hadwin, Wozney, & Pontin, 2005). Furthermore, scaffolding is often designed to affect knowledge and skills beyond problem-solving ability, including argumentation ability (Jeong & Joung, 2007) and deep content knowledge (Davis & Linn, 2000). Synthesizing work on this expanded conceptualization of scaffolding is important to help researchers and designers determine what works best in scaffolding among particular populations and contexts. Scaffolding synthesis work has been done, but all focus on between-subjects differences (Belland, Walker, Kim, & Lefler, 2017; Belland, Walker, Olsen, & Leary, 2015; Ma, Adesope, Nesbit, & Liu, 2014; Steenbergen-Hu & Cooper, 2013, 2014; Swanson & Deshler, 2003; Swanson & Lussier, 2001; VanLehn, 2011), leaving important questions of how much within-subject growth one might expect among average students unaddressed. In this article, we address this gap by using Bayesian network meta-analysis to synthesize pre–post growth among networks of student populations, STEM (science, technology, engineering, and mathematics) disciplines, educational levels, and assessment levels (Berger, 2013; Lumley, 2002; Mills, Thorlund, & Ioannidis, 2013).

Two techniques that can help researchers establish equivalence on response variables before the treatment is introduced are random selection and random assignment ( Higgins et al., 2011 ). But little education research incorporates true random selection and assignment of participants, leading to high risk of bias in randomization ( Higgins et al., 2011 ). Another method is to use students as their own controls through the use of a pretest that is equivalent to the posttest. Network meta-analysis allows one to synthesize pre–post differences across studies in order to make indirect comparisons between treatments that may not have been compared directly in any single study ( Lumley, 2002 ; Mills et al., 2013 ). When taking a frequentist approach to network meta-analysis, all included studies need to contain a treatment and a control condition ( Puhan et al., 2014 ). Thus, studies with multiple versions of a scaffolding treatment but no lecture control treatment cannot be included in a frequentist network meta-analysis. Taking a Bayesian approach to network meta-analysis allows researchers to include multiple treatment studies as long as each study has a treatment in common with another study ( Bhatnagar, Lakshmi, & Jeyashree, 2014 ; Goring et al., 2016 ). Furthermore, taking a Bayesian approach sets up a decision-making framework that scaffolding researchers and funders can use to indicate which contexts hold the greatest promise for scaffolding ( Jansen et al., 2011 ).

Crucial to examining scaffolding outcomes is determining whether the magnitude of pre–post gains of scaffolding depend on assessment level, defined as the nature of learning outcome targeted by assessment. Assessment levels include concept (ability to state definitions of basic knowledge), principles (ability to describe or use relationships between facts), and application (ability to use concept- and principles-level knowledge to address a new problem; Sugrue, 1995 ). Traditional meta-analysis indicated that scaffolding’s effect was greater when measured at the principles level than at the concept level ( Belland et al., 2017 ).

The context in which scaffolding is used can vary widely, and this variation is associated with real differences in scaffolding strategy ( Belland, 2017 ). Differences in context of use can be considered from two perspectives—the problem-centered instructional model with which scaffolding is used, and the subject matter in which the instruction is situated. Problem-centered instructional models with which scaffolding is used include project-based learning, problem-based learning, inquiry-based learning, design-based learning, case-based learning, and problem solving ( Belland, 2017 ). These models all involve addressing an ill-structured problem, but the nature of the problem and what should be produced, as well as inherent structure for student learning, varies between the models. For example, problem-based learning is the most open-ended in that students are expected to produce and argue for a conceptual solution to the problem ( Hmelo-Silver, 2004 ), while design-based learning and project-based learning constrain the solution type (e.g., video or designed product) students need to produce. The stages through which students need to progress vary according to model as well. With such variation in process and product, it is natural to question if corresponding within-subject effect sizes vary. This can be addressed through network meta-analysis.

The concept of instructional scaffolding originated in describing one-to-one interactions with an ever-present tutor ( Wood et al., 1976 ). Soon, researchers began to think about how the technique could be leveraged in other settings. One such way was one-to-one interactions from a classroom teacher who provided individualized help as students engaged with problems ( van de Pol et al., 2010 ). Scaffolding is now used in the context of many instructional approaches, including project-based learning, problem-based learning, inquiry-based learning, and design-based learning ( Belland, 2017 ). At the center of each is an ill-structured problem, defined as a problem that does not have just one correct solution, and which has multiple solution paths ( Jonassen, 2011 ). To address such a problem, it is necessary to represent the problem qualitatively so as to recognize the critical factors and how they interact ( Jonassen, 2003 ). Still, each problem-centered approach involves a different set of expectations, both in terms of process and product. For example, in design-based learning, students iterate designs that address the central problem (e.g., levee to prevent beach erosion of barrier islands; Kolodner et al., 2003 ); meanwhile, in inquiry-based learning, students pose and address their own questions (Keys & Bryan, 2001).

The goal of Bayesian network meta-analysis is to model a network of evidence pertaining to scaffolding treatments and common treatments—sometimes lecture-based controls and sometimes other scaffolding treatments. Because not all scaffolding treatments will have been compared directly with control, it does not make sense to calculate a two-node network computing one effect size estimate for all scaffolding treatments versus control ( Lumley, 2002 ).

MCMC simulations generate the posterior distribution, which represents the range of true ESs for each moderator. Using Bayesian probability, one can calculate the probability that each moderator level is the best ( Jansen et al., 2011 ). We report this as “probability of the best.” One can also calculate the probability that each moderator level is second best, third best, and so on. Averaging all such probability levels together for each moderator level allows one to arrive at a rank order for the levels of the moderator. We report this as “ranking.”

The presence of similar pretests and posttests within the same study can present a risk of testing bias. Within the overall Bayesian network meta-analysis of scaffolding in STEM education project, we also wrote an article covering scaffolding characteristics and risk of bias—a lens with which to code research quality that does not make assumptions when data are not present ( Higgins et al., 2011 ). Results showed that there was no substantial risk of bias due to testing effect ( Walker, Belland, Kim, & Piland, 2017 ).

The wide range of participants, context of use, study measures, and educational levels makes it unlikely that each outcome represents an approximation of a single true ES. This led us to use a random effects model ( Borenstein, Hedges, Higgins, & Rothstein, 2009 ). Analyses were conducted using the metan package of STATA 14 and WinBUGS 1.4.3. Specifically, WinBUGS 1.4.3 was used to run MCMC simulations using Gibbs sampling. We used 20,000 MCMC samples for each analysis. This study used the 2-level model θ i = β 0 + β 1 x i 1 + β 2 x i 2 + β p x i p + δ i + e i including within and between study-level covariates for every moderator ( Raudenbush, 2009 ). x i p identifies study-level coding and β p represents the regression coefficient. The random effect of studies, δ i , has the following distribution: δ i ~ N (0, τ 2 ) and the sampling error, e i , has a mean of zero and a sampling variance of σ . The seed for the random number generator was 1234 as the default setting and the starting value for beta and gamma parameter was zero. A total of 22,500 iterations for estimation of posterior distribution were generated by MCMC and 2,500 initial iterations were burned in to remove randomized initial values in every model for moderators in this study. Furthermore, we validated our models with graphical summaries (i.e., trace plot, autocorrelation, histogram and density plots). The pattern of trace was stable as the iteration number increased and the value of autocorrelation approached 0 as the lag increased.

Consensus codes were used in all analyses. An earlier version of the coding scheme was developed in two ways—through synthesis of the scaffolding literature and development of in vivo codes; this was then used for a pilot scaffolding meta-analysis project ( Belland et al., 2015 ). We presented the coding scheme and our suggested additions to encompass a broader swath of literature to our advisory board. They then either confirmed that the coding categories and their associated levels were reasonable or suggested revisions. The revised coding scheme was then used in a comprehensive, traditional meta-analysis ( Belland et al., 2017 ), and, with the exception of the calculation of ESs, the coding categories used in this article were the same.

Alternating pairs of coders from a pool of four researchers with expertise in scaffolding, meta-analysis, or both, coded the studies. Two researchers independently coded each study, and then met to discuss coding discrepancies and come to consensus. We used Krippendorff’s alpha to assess interrater reliability on initial coding because it can handle the range of scales (nominal, ordinal, and ratio) present in our coding data, and it adjusts for chance agreement ( Krippendorff, 2004 ). Because Krippendorff’s alpha adjusts for chance agreement, is appropriate to use with multiple scales, and can account for unused scale points, its values are typically lower than other popular indices of agreement such as percentage agreement and Cohen’s kappa, and thus should not be interpreted in light of such statistics. Two coders were drawn from a pool of 4, and 218 data points were used for the interrater reliability analysis. All alphas were greater than .667 (see Supplementary Table S1 in the online version of the journal), which represents the minimum standard for acceptable reliability ( Krippendorff, 2004 ). The lowest Krippendorff’s alpha values: .731 for assessment level, and .761 for context of use, were further analyzed using the q test bootstrapping method to examine the probability that the statistics were actually lower than .667 ( Hayes & Krippendorff, 2007 ). q test results for assessment level coding shows that the chance of obtaining an alpha value below .67 was 3.13%; in other words, if the population of units were coded, reliability would likely be somewhere within the confidence interval for α true of .67 to .79 (see Supplementary Table S1 in the online version of the journal). q test results for Context of Use show 90% probability that the alpha value was above .67 (see Supplementary Table S1 in the online version of the journal). While the probability to get an alpha value below .67 was 10%, the alpha value distribution followed a normal distribution, p > .05; thus, there was no concern about low reliability between coders.

Assessments were labeled on the basis of what students were asked to know and do with the target knowledge ( Sugrue, 1995 ). Concept-level assessments measured whether participants knew basic knowledge. For example, a pretest and a posttest in one study asked declarative knowledge questions about scientific instruments, the solar system, and planet characteristics ( Bulu & Pedersen, 2010 ). Principles-level assessment was coded when participants were asked to identify relationships/connections between facts, either in terms of directionality or scale. For example, an assessment invited students to read a scenario in which scientists were investigating a phenomenon, and students needed to indicate the hypotheses that was being tested ( Tan et al., 2005 ). Application-level assessment was coded when participants needed to apply concept-level knowledge and principles-level knowledge to a new holistic/authentic problem. For example, high school students needed to use physics knowledge and principles to describe how a shuffle stone moves across a shuffleboard ( Gijlers, 2005 ).

We coded this category according to the problem students were addressing, rather than the discipline of the class in which participants were enrolled. We always coded according to a broad category (e.g., engineering), and a narrower category (e.g., electrical engineering). This decision was made for two reasons: (a) the subject matter of the class did not always align with the nature of the problem being addressed, and the nature of the problem being addressed was deemed to be more important to an examination of scaffolding; (b) participants were not always drawn from a formal class. As an example of the first point, participants in Magana (2014) were in an introductory educational computing course, but were addressing a problem related to scale (nanoscale, microscale, and macroscale). Because the goal was that students be able to order, classify, and sort shapes according to scale, the STEM discipline was coded as mathematics. As an example of the second point, participants in Chen, Kao, and Sheu (2005) engaged in a mobile butterfly watching activity. Within the study, participants needed to compare photographs they took with database photos of butterflies; for this reason, it was coded as science–ecology. There was a focus on engineering implications of electrical current in another study ( de Jong, Härtel, Swaak, & van Joolingen, 1996 ). So while the participants were high school students enrolled in physics and engineering courses, the study was coded as electrical engineering.

Education level was coded as (a) primary when the majority of participants were enrolled in Grades K–5, (b) middle level when the majority of participants were enrolled in Grades 6 to 8, (c) secondary when the majority of participants were enrolled in Grades 9 to 12, (d) college/vocational/technical when the majority of participants were enrolled in a 4-year bachelor’s program or 2-year associate’s program, (e) graduate when the majority of participants were enrolled in a graduate degree program (e.g., master’s or doctorate), or (f) adult when the majority of participants were over the age of 18 years but not enrolled in a college or graduate-level program.

All included studies had at least two treatments—usually one scaffolding treatment and a lecture control condition, but sometimes two different scaffolding treatments. For each treatment group, the sample size, pretest mean, pretest standard deviation, posttest mean, and posttest standard deviation were inputted into a free online tool ( http://esfree.usu.edu/ ) to calculate effect size. All reported effect sizes used the Hedges’s g calculation. Hedges’s g was chosen because it (a) uses pooled standard deviation, which has the potential to be less biased than effect size estimates that use the control group standard deviation, and (b) is weighted according to sample size ( Hedges, 1982 ).

In Stage 2, alternating pairs of researchers read each article resulting from Stage 1 and applied inclusion criteria. Based on our inclusion criteria, 1,062 studies were excluded. The final number of included studies was k = 56. Stage 2 resulted in dropping a total of 1,062 of the studies remaining after Stage 1. The number of included outcomes varies slightly by moderator analysis, as detailed in the Results section (see Supplementary Table S1 in the online version of the journal for a list of included studies).

Application of inclusion criteria proceeded in a two-stage manner. In Stage 1, the inclusion criteria were applied in a pre–pass manner to winnow the list of studies that resulted from the literature search (see Figure 2 for the number of studies dropped according to element of the exclusion process). Specifically, one researcher applied the inclusion criteria and only removed a study from consideration if it clearly did not meet the inclusion criteria. Stage 1 resulted in dropping k = 6,471 studies that resulted from the literature search ( k = 7,589).

Inclusion criteria were that (a) participants addressed an ill-structured problem in one of the STEM fields (science, technology, engineering, and mathematics); (b) participants used a computer-based scaffolding intervention; (c) participants took a similar pretest and posttest covering a cognitive variable; (d) sufficient statistics were reported to calculate effect size; and (e) there were at least two treatments. We defined ill-structured problems as those for which qualitative representation of the problem was necessary, and not all necessary information to do so were presented to students ( Jonassen, 2011 ). All included studies had to have a treatment in common with at least one other study ( Mills et al., 2013 ). Thus, if a study compared two scaffolding types that were not examined in any other study, then it would be excluded. When more than one study reported the same data, the one with the most information (e.g., dissertation) was retained.

We used a three-pronged literature search to identify 7,589 potential studies, which were published between January 1, 1993, and December 31, 2015 (see Figure 2 ). The databases searched were ProQuest, Education Source, psycINFO, CiteSeer, ERIC, Digital Dissertations, PubMed, Academic Search Premier, IEEE, and Google Scholar, and search terms used were various combinations of the following terms: scaffold* , tutor* , computer* , intelligent tutoring system* , and cognitive tutor* . Hand searches were conducted in journals that were recommended by experts or where we had found articles related to scaffolding in mathematics and engineering education: Journal for Research in Mathematics Education , International Journal of Mathematical Education in Science and Technology , Journal of Professional Issues in Engineering Education and Practice , and Computer Applications in Engineering Education . To gain additional coverage in the areas of special education and adult learning, we conducted hand searches in the following journals: Journal of Special Education , Journal of Special Education Technology , BMC Medical Education , and Journal of Medical Education . We ended up finding no potentially includable studies from BMC Medical Education or the Journal of Medical Education . Referrals were studies in the reference lists of included studies.

Next, one collects current data—in this study, this is our coding of articles collected through literature search. Then, one runs MCMC simulations informed by the prior distribution and the current data to empirically approximate the posterior distribution, defined as the distribution of true parameters. We did this using WinBUGs ( Lunn et al., 2000 ; see Supplementary Table S1 in the online version of the journal for our WinBUGS code). Readers who are interested in learning more about how to perform the process of running calculations for a Bayesian network meta-analysis with the combination of STATA and WinBUGS are directed to the screencast available in Supplementary Video S2 in the online version of the journal. Readers interested in learning more about the foundations and application of coordinating Bayesian analysis between STATA and WinBUGS are directed to Thompson (2014) . Many of the principles behind the commands and processes would be similar if combining WinBUGS with other statistical packages like R or SAS.

Following a Bayesian approach requires that one establish a prior distribution, defined as the distribution of the parameters in question according to prior research. All relevant prior meta-analyses about computer-based scaffolding focused on between-subject, rather than within-subject differences. Therefore, existing meta-analysis results are ill-equipped to form an informative prior distribution in this study. Furthermore, we wanted the current coding, rather than a prior distribution informed by between-subjects effects, to primarily drive the approximation of the posterior distribution ( Jansen, Crawford, Bergman, & Stam, 2008 ). Therefore, this article employs a noninformative prior distribution model, which can be used when there is insufficient information about a treatment’s effectiveness or there is no consensus about the effectiveness among scholars. Among several possible prior noninformative distribution models, which have different assumptions about the variance between studies (e.g., maximum and minimum tau values), uniform prior distribution on tau (0, 5) was selected by deviance information criterion statistics (see Supplementary Table S1 in the online version of the journal), which evaluate and compare generated Bayesian models ( Spiegelhalter, Best, Carlin, & van der Linde, 2002 ).

For this synthesis effort, we followed a network meta-analysis approach from a Bayesian perspective. Network meta-analyses allow researchers to make direct and indirect comparisons of pre–post gains of different interventions that have a common comparator ( Mills et al., 2013 ). Two principal advantages of network meta-analysis are its capacity to allow researchers to (a) make indirect comparisons among treatments that were never compared in a single study and (b) rank treatments according to effectiveness ( Mills et al., 2013 ). However, the reliability of the indirect comparisons and rankings depends on the number of direct comparisons that are included in the network ( Lumley, 2002 ; Mills et al., 2013 ). Furthermore, when the number of studies that represent a certain level of a moderator is low, the results for those moderator levels can be overweighted or biased. When (a) the number of direct comparisons among moderator levels is low and (b) there is no common comparator between moderator levels, one may opt to take a Bayesian approach to analysis. At a high level, in Bayesian approaches, rather than simply calculating the distribution of a collected sample without reference to what is already known (as one would do with a frequentist approach), one (a) determines possible prior distributions (considers what is already known about the distribution of the construct in the population of interest), (b) collects data from a sample, and (c) empirically approximates the posterior distribution (through, e.g., Markov Chain Monte Carlo [MCMC] sampling; see Figure 1 ; Carlin & Chib, 1995 ; Little, 2006 ; Lunn, Thomas, Best, & Spiegelhalter, 2000 ). For a relatively comprehensive and user-friendly introduction to Bayesian data analysis approaches, readers are directed to Gelman et al. (2013) .

The magnitude of difference among the assessment levels is minor. Comparing assessment levels through ranking similarly shows that there is little evidence to say that scaffolding is more effective at a particular assessment level than another (see Table 5 ).

Scaffolding led to strong pre–post gains across assessment levels, with the lowest effect size estimate at the application level ( ḡ = 0.74), and the highest at the concept level ( ḡ = 0.87; see Figure 7 ). The credible intervals, which represent a range of true effect sizes, were relatively tight, pursuant to the large number of trials for each possible comparison, with the exception of application versus control. Accordingly, the credible interval for application was quite wide.

The network of evidence included a substantial number of direct comparisons among the assessment levels and between each assessment level and control, with the exception of between application and control (see Supplementary Figure S3 in the online version of the journal).

Mathematics and technology had the highest pre–post effect sizes: ḡ = 1.29 and ḡ = 1.06, respectively (see Figure 6 ). Most studies coded as technology were from computer science instruction ( n = 2), with the remaining outcome being from information technology. Mathematics and technology also had the highest and second highest probability of the best (see Table 4 ).

Proejct-based learning has the highest probability of the best (see Table 3 ). The ranking of problem solving is close behind that of project-based learning, but problem solving has a much lower likelihood of being the best.

The highest pre–post effect size was for project-based learning ( ḡ = 1.21; see Figure 5 ). Due to the low number of coded outcomes for the characteristics, the range of true effects (credible interval) is wide. Thus, this ES needs to be interpreted cautiously. Most pre–post effect sizes were quite large, with the exception of inquiry-based learning ( ḡ = 0) and modeling/visualization ( ḡ = 0.28). This implies that scaffolding can lead to strong pre–post effect sizes across a wide range of problem-centered instructional approaches.

With the exception of problem solving, the number of coded outcomes for each problem-centered instructional model was very small (see Supplementary Figure S3 in the online version of the journal). This resulted in a very large range of true effects as calculated through Bayesian simulations.

When examining ranking and probability of the best, one finds scaffolding to have a high probability of having the best ranking when used among students with learning disabilities (see Table 2 ). Indeed, the probability of the best is virtually nil for all other education populations.

The pre–post gains are consistently positive and substantial across educational populations (see Figure 4 ). The number of outcomes for scaffolding used by traditional students was the greatest, leading the group to have the tightest credible interval. Note that N = 36 for control in the education level network, while N = 35 for control here and for other moderators. This is because one study contained outcomes associated with two different educational levels—middle level and secondary; for the education level analysis, such outcomes could not be combined, while for the other moderator analyses, the outcomes needed to be combined. Scaffolding for students with learning disabilities had the largest effect size estimate ( ḡ = 3.13) by a large margin. This effect size estimate should be considered tentative, as the MCMC sampling was based on four outcomes from a single study. ELL also had a large pre–post effect size ( ḡ = 0.92).

The evidence is strongest for the comparison of traditional students using scaffolding versus control (25 outcomes; see Supplementary Figure S3 in the online version of the journal). There are some studies that contained multiple educational populations. For example, traditional students using scaffolding co-occurred with underrepresented students, high-performing students, underperforming students, and ELL in at least one study for each combination.

Using a Bayesian network meta-analysis approach allows estimation of the true effect size and enables rank ordering treatments and calculating the probability that each treatment is the best. Scaffolding led to the highest pre–post gains at the college and graduate levels (see Table 1 ), ranked first and second with a 35% and a 47% chance of being the best, respectively.

Pre–post effect size estimates are highest among college- and graduate-level learners, at ḡ = 1.16 and ḡ = 1.2, respectively (see Figure 3 ). The 95% credible intervals represent ranges of true pre–post effects of scaffolding in each respective category. There were some true effects that were below zero for all educational levels except college. This is a function of the number of coded outcomes on which the Bayesian simulations estimated the posterior distribution.

When interpreting the network plot (see Supplementary Figure S3 in the online version of the journal), one can see the number of unique outcomes for each level (e.g., middle level) of the target characteristic (e.g., education level). Each solid line between two circles represents the number of direct comparisons between the two levels of the target characteristic. For example, the solid line between middle level and control shows that there were eight direct comparisons of middle-level students using scaffolding with students in a control condition. Of note, for education population, there are no studies that compared students at different educational levels, which is to be expected. Dotted lines indicate indirect comparison information that can be ascertained among treatment characteristics that were never directly compared in a single study. The number of outcomes were (a) greatest at the college/vocational/technical level ( k = 12); (b) roughly equivalent among primary ( k = 7), middle level ( k = 8), and secondary ( k = 6); and (c) lowest among graduate/professional ( k = 3). Because this is a Bayesian network meta-analysis, the number of outcomes refers to actual coded outcomes; the degree of precision of effect size stimates depends on the number of coded outcomes. Also, not all included studies had a control condition. Thus, the number of control outcomes does not equal the number of included studies.

Discussion