Search Process

The initial databases we used to search for articles were Education Source, PsycInfo, Digital dissertation, CiteSeer, ERIC, PubMed, Academic Search Premier, IEEE, and Google scholar. Those databases were recommended by a librarian and experts who have expertise in the fields related to this study. However, some articles were duplicated across databases or we could not find any article that satisfied our inclusion criteria from the following databases: PubMed, Academic Search Premier, and IEEE. Various combinations of the following search terms were used in the databases listed above: “scaffold, scaffolds, computer-based scaffolding/supports,” “Problem-based learning,” “cognitive tutor,” “intelligent tutoring systems,” “Science, Technology, Engineering, Mathematics,” and subcategories of higher-order thinking skills. The search terms were determined by researchers’ consensus and advisory board members’ advice.

Inclusion Criteria

The following inclusion criteria were used: studies needed to (a) be published from January 1, 1990, and December 31, 2015; (b) present sufficient information to conduct Bayesian meta-analysis (statistical results revealing the difference between treatments and control group, number of participants, study design); (c) be conducted in the context of problem-based learning within STEM education; (d) clearly reveal which types of scaffolding they used; and (e) address higher-order thinking skills as the intended outcome of the scaffold itself. We found a total of 21 studies with 47 outcomes (see Appendix). The numbers of studies and outcomes differ because some studies had multiple outcomes with different levels of the moderators, which were used in this meta-analysis (i.e., scaffolding intervention, higher-order thinking skills, scaffolding customization and its method, scaffolding strategy, and disciplines).

Moderators for meta-analysis

Scaffolding Intervention

Conceptual scaffolding provides expert hints, concept mapping and/or tools to engage in concept mapping, and visualization depicting concepts to help them identify what to consider when solving the problem (Hannafin et al. 1999). For example, scaffolding in Su (2007) was designed to focus attention on key content and to stay organized in a way to achieve project requirements. Strategic scaffolding helps students identify, find, evaluate information for problem-solving, and guide a suitable approach to solve the problems (Hannafin et al. 1999). An example can be seen in the scaffolding in Rosen and Tager (2014), which enabled students to construct a well-integrated structural representation (e.g., about the benefits of organic milk). Conceptual scaffolding can be distinguished from strategic scaffolding in that conceptual scaffolding helps students consider tasks from different angles through the reorganization and connection of evidence; on the other hand, strategic scaffolding tells students how to use the evidence for problem-solving (Saye and Brush 2002). Metacognitive scaffolding allows students to reflect on their learning process and encourages students to consider possible problem solutions (Hannafin et al. 1999). For example, the reflection sheet in Su and Klein (2010) encouraged students to summarize what they had learned, reflect upon it, and then debrief that information. Motivational scaffolding aims to enhance students’ interest, confidence, and collaboration (Jonassen 1999a; Tuckman and Schouwenburg 2004).

Higher-Order Thinking Skills

Scaffolding is often designed to enhance higher-order thinking skills (Aleven and Koedinger 2002; Azevedo 2005; Quintana et al. 2005). The definition of higher-order thinking and its subcategories differ according to different scholars. Higher-order thinking can be defined as “challenge and expanded use of the mind” (Newmann 1991, p. 325) and students can enhance higher-order thinking skills through active participation in such activities as making hypotheses, gathering evidence, and generating arguments (Lewis and Smith 1993). According to Bloom’s taxonomy (Bloom 1956), higher-order thinking is the stage beyond understanding and declarative knowledge. Therefore, analyzing, synthesizing, and evaluating can be classified as higher-order skills. Analysis means the ability to identify the components of information and ideas, and to establish the relations between elements (Lord and Baviskar 2007). For example, scaffolding in Bird Watching Learning (Chen et al. 2003) provided pictures, questions, and other information that learners can use to identify bird species. Synthesis refers to recognition of the patterns of components and the formation of a new whole through creativity. Through this ability learners can formulate a hypothesis or propose alternatives (Anderson et al. 2001). The mapping software in Toth et al. (2002) can help students formulate scientific statements using hypotheses and data with a summary of information found in web-based materials. Defined as the ability to judge the value of material based on definite criteria, evaluation allows learners to judge the value of data and experimental results and justify conclusions (Krathwohl 2002). For example, several strategies of scaffolding (question prompts, expert advice, and suggestion) in Simons and Klein (2007) provided learners support to rate the reliability and believability of each evidence item and the extent to which they believe the statements that they generated. Based on the above illustrative phrases, critical thinking and logical thinking can be combined to form the critical thinking category of “Analysis” and creative thinking and reflective thinking can be combined to form the critical thinking category of “Synthesis,” and problem-solving skills and decision-making can be combined to form the critical thinking category of “Evaluation” (Bloom 1956; Hershkowitz et al. 2001).

Therefore, in this study, I defined higher-order thinking skills as those cognitive skills that allow students to function at the analysis, synthesis, and evaluation levels of Bloom’s Taxonomy, and in this study, the categorization for intended outcomes can be analysis, synthesis, and evaluation as the variation of higher-order skills.

Scaffolding Customization and Its Methods

By effectively controlling the timing and degree of scaffolding, students reach the final learning goal by their own learning strategies and processes (Collins et al. 1989). In this sense, scaffolding customization is defined as the change of scaffolding frequency and its nature based on a dynamic assessment of students’ current abilities (Belland 2014). There are three kinds of scaffolding customization (i.e., fading, adding, fading/adding supports). Fading means that scaffolds are introduced and then pulled away. As an example of fading, Web-based Inquiry Science Environment (Raes et al. 2012) faded scaffolding according to the level of students’ learning progress (e.g., full function of scaffolding in the beginning step, but no scaffolding in advanced steps). On the other hand, adding is defined as increasing frequency of scaffolds, reducing the interval of scaffolding, and adding new scaffolding elements as the intervention goes on. In Chang et al. (2001), if students continue to struggle excessively, greater quantities and intensities of scaffolding can be requested through the use of a hint button. Moreover, other type of scaffolding customization is fading/adding, which is defined as increasing or pulling away scaffolds depending on students’ current learning status and their requests. Scaffolding, which neither increased nor decreased regarding its nature or frequency, was categorized as none. According to the meta-analysis reported by Belland et al. (2017), in 65% of the included studies, there was no scaffolding customization. Moreover, Lin et al. (2012) also pointed out a lack of a number of studies adopting fading function (9.3%) in a review of 43 scaffolding-related articles. This means that while many scholars maintained that fading is an important element of scaffolding (Collins et al. 1989; Dillenbourg 2002; Puntambekar and Hübscher 2005; Wood et al. 1976), scaffolding customization has largely been overlooked in scaffolding design.

There are three ways to determine scaffolding fading, adding, and fading/adding: performance-adaptation, self-selection, and fixed time interval. Performance-adaptation means that the frequency and nature of scaffolding can be changed by students’ current learning performance and status. On the other hand, self-selection is defined as the scaffolding customization by students’ own decision to request fading, adding, and both of them. In intelligent tutoring systems that can monitor students’ ability, scaffolding fading is often performance-adapted and adding supports is self-selected. In addition to the bases of performance and self-selection, scaffolding customization can also be fixed, defined as adding or fading after a predetermined number of events or a fixed time interval has passed (Clark et al. 2012). Among the scaffolding customization methods (i.e., performance-adapted, self-selected, and fixed), performance-adapted scaffolding customization was the most frequent (Belland et al. 2017), but there have been few studies that investigated which scaffolding customization method have the highest effects on students’ learning performance in the context of problem-based learning.

Scaffolding Strategies

Scaffolding strategies include feedback, question prompts, hints, and expert modeling (Belland 2014; Van de Pol et al. 2010). Feedback is the provision of information regarding the students’ performance to the students (Belland 2014). In Siler et al. (2010), a computer tutor that covered experimental design evaluated students’ designs and provided feedback about their selection of the variables of interest. Question prompts help students draw inferences from their evidence and encourage their elaborative learning (Ge and Land 2003). Students read question prompts that directed their attention to important problem elements and encouraged them to conduct certain tasks (Ge et al. 2010). Hints are clues or suggestions to help students go forward (Melero et al. 2011). For example, when students tried to change the paragraph text, computer systems showed word definitions and provided audio supports to read these words. Expert modeling presents how experts perform a given task (Pedersen and Liu 2002). In Simons and Klein (2007), when students struggled with balloon design, expert advice was provided to help them distinguish between valuable and useless information. In addition, several types of strategies can be used within one study to satisfy students’ different needs according to the contexts (Dennen 2004; Gallimore and Tharp 1990). As an example of several types of strategies, in Butz et al. (2006), students received expert modeling, question prompts, feedback, and hints to solve a real-life problem in their introductory circuits class.

Discipline

In this paper, “STEM” refers to two things: (a) the abbreviation of Science, Technology, Engineering, and Mathematics, in which scaffolding was utilized and (b) integrated STEM curricula. Integrated STEM education began with the aim of performance enhancement in science and mathematics education as well as the cultivation of engineers, scientists, and technicians (Kuenzi 2008; Sanders 2009). Application of integrated STEM education increased students’ motivation and interests in science learning and contributed to positive attitudes toward a STEM-related area (Bybee 2010). For example, the results of two meta-analyses indicated that the integrative approaches on STEM disciplines showed higher effects (d > 0.8) on students’ performance than in separate STEM disciplines (Becker and Park 2011; Lam et al. 2008). However, there are few studies investigating the effects of computer-based scaffolding, which has been commonly utilized in each STEM field, in the context of problem-based learning for integrated STEM education. Therefore, it may be worthwhile to investigate the comparison of scaffolding effects between integrated approach in STEM education and each STEM fields. In this regard, integrated STEM education and each STEM discipline (i.e., Science, Technology, Engineering, and Mathematics) are included as discipline moderator.

Table 2 shows the moderators and subcategories of each moderator in this meta-analysis.

Table 2 Moderators in Bayesian meta-analysis Full size table

Prior Distribution in This Study

In Bayesian analysis, the estimation of the posterior distribution can be substantially affected by how one can set up the prior distribution. Typically, there are three methods to determine the prior distribution. One method is to follow experts’ opinions about parameter information related to a certain topic. Experts’ opinions reflect the results of existing studies, and it is possible that their opinion can represent current trends about the effects of a certain treatment. Unfortunately, there are few summarized experts’ opinions regarding the effects on computer-based scaffolding in PBL. Those that do exist can be seen as highly subjective. In this regard, this study excluded the use of expert opinion as a possible prior distribution models. As the second method, one can utilize the results of meta-analysis as a prior distribution. There are two representative meta-analyses related to computer-based scaffolding including intelligent tutoring systems (ITS)—Belland et al. (2017) and Kulik and Fletcher (2016). In the case of Kulik and Fletcher’s meta-analysis, their research interests focused on how the effects of computer-based scaffolding including ITS vary in the various contexts of learning environments such as sample size, study duration, and evaluation types. This means that their results did not emphasize the characteristics of ITS. Therefore, it is difficult to utilize these results as prior distribution of this study, which focuses on the characteristics of scaffolding. Recently, a National Science Foundation-funded project (Belland et al. 2017) aimed to synthesize quantitative research on computer-based scaffolding in STEM education. The moderators in this project are overlapped with the many moderators in this study. However, the big difference between the previous TMA and this paper is the learning contexts. This paper only focuses on problem-based learning, but the contexts in the TMA included several problem-centered instructional models (e.g., inquiry-based learning, design-based learning, project-based learning). Such problem-centered instructional models incorporate many different teacher roles, learning goals and processes, student learning strategies, and scaffolding usage patterns (Savery 2006). This makes it difficult to apply the results of TMA into the informative prior distribution in this paper, which only handled problem-based learning.

The last method to set up the prior distribution is to use non-informative prior distribution. If one does not have enough prior information about the parameter θ, one can hope that the selected prior has little influence on the inferences for generating the posterior distribution of the parameter. In other words, non-informative prior distribution only contains minimal information about parameters. For example, if assuming the range of parameter from 0 to 1, one can set up μ (i.e., the mean of parameter) as 0 and the variances of μ as 1. But, if the parameter occurs on an infinite interval, the variance between studies should be large enough (τ 2 → ∞) to have little influence on the posterior. These techniques make the prior distribution non-informative. In this paper, several non-informative prior distribution models (i.e., uniform, DuMouchel, and Inverse Gamma), which specify different weighted values of the between-study variance, τ 2, were used to identify the most suitable model fits for the given data.

Data Coding

All study features were coded by theory-driven constructs regarding scaffolding characteristics, roles, and the contexts of its use. And these codes were validated by experts in the field of scaffolding, problem-based learning, and STEM education. All effect sizes for all features were calculated by Hedges’s g, a measure of effect size that is corrected due to the use of weighted standard deviations. In addition, considering the variances between individual studies with the wide range of education population, subject areas, and scaffolding interventions, this study utilized the random effects model assuming that the true effect size may vary from study to study. The kinds of possible quantitative results from the individual studies were F statistics, t statistics, mean differences, and chi-square. With these quantitative results, all effect sizes of computer-based scaffolding corresponding the moderators were calculated using the metan package of STATA 14. Two graduate students who have extensive knowledge of scaffolding and problem-based learning, as well as coding experience for meta-analysis, participated in the coding work. The primary coder selected the candidate studies based on the inclusion criteria and generated initial codes about the moderators (i.e., scaffolding type, scaffolding strategy, scaffolding customization, scaffolding customization methods, higher-order thinking skills, and disciplines) in this study. The second coder also coded the data independently, and then the coding between two coders was compared. When there was inconsistency of coding between the coders, consensus codes were determined through discussion. Inter-rater reliability was calculated using the Krippendorff’s alpha statistic when the initial coding was finished. Krippendorff’s alpha is used to measure the level of coders’ agreement on the values of variables (i.e., nominal, ordinal, and ratio) in the coding rubric by computation of the weighted percent agreement and weighted percent chance agreement (Krippendorff 2004). Krippendorff (2004) recommended 0.667 as the minimum accepted alpha value to avoid the wrong conclusion from unreliable data. All Krippendorff’s alpha values (α ≥ 0.8) across all moderators were above the minimum standard for reliable data, and this indicated that there was strong agreement between the two coders (see Fig. 1).

Fig. 1 Krippendorff’s alpha for inter-rater reliability (dotted line indicates minimum acceptable reliability (Krippendorff 2004)) Full size image

Data Analysis

For data analysis, STATA 14 and WinBUGS 1.4.3 were utilized. WinBUGS 1.4.3 provides Bayesian estimation including prior distributions options by MCMC and STATA 14 imported the results and codes from WinBUGS and generate graphical representations.

Markov Chain Monte Carlo simulations were used to sample from a probability distribution of the Bayesian model (Dellaportas et al. 2002). Integrating Markov chains can replace unstable or fluctuated initial values of random variables with more accurate values through repetitive linear steps, in which the next state (i.e., value of variable) can be influenced by the current one, not by preceding one (Neal 2000). In this process, 22,000 MCMC iterations for estimation of posterior distribution were generated and 2000 initial iterations were omitted to eliminate initial values that were randomly given. After analysis, deviance information criteria (DIC) was utilized for identifying model fits (Spiegelhalter et al. 2002). The lowest value in DIC means the best model to predict the reproduction of data as the observed data (Spiegelhalter et al. 2004). DuMouchel for “scaffolding customization methods” and uniform prior distributions for all remaining moderators had the smallest value of DIC (see Table 3). Uniform and DuMouchel assume different variances between studies (i.e., τ 2), and the results can differ according to which prior distribution was used, even though the underlying dataset is the same. After MCMC generated the posterior distribution of each moderator, the validation of models was investigated through four types of graphs—trace plots, autocorrelation plots, histogram plots, and density plots.

Table 3 DIC values of prior distributions Full size table

Observed Data Characteristics

The number of observed data points across subcategories within moderators is unbalanced (see Table 4). There was no included study that involved motivation scaffolding; thus, motivation scaffolding could not be included in this paper. Moreover, around 10.6% of the outcomes included in this paper had small sample sizes (n < 10), resulting in the possibility of small-study effects. Smaller studies often show larger effect sizes than larger ones, leading to overestimation of treatment effects (Schwarzer et al. 2015).

Table 4 Number of outcomes according to subcategories Full size table

To investigate empirically if there were small-study effects, I conducted Egger’s regression to test the null hypothesis of “there are no small-study effects” (see Table 5). The result shows that there are small-study effects since the null hypothesis was rejected, p < 0.05.

Table 5 Egger’s regression test for small-study effects Full size table

More than 80% of recently published meta-analyses may contain biased results caused by small-study effects (Kicinski 2013). This means that there may be a high probability of having biased results if a traditional meta-analysis approach with this data was employed. However, using a Bayesian approach can address potential small-study effects by shrinking overweighted effect sizes through interval estimation and the appropriate use of priors (Kay et al. 2016; Kicinski 2013; Mengersen et al. 2016).

Interpretation of Bayesian Meta-analysis

Bayesian inference is based on posterior probability, which is in turn based on the likelihood of the observed data, not point or interval estimation from the frequentist approach (i.e., confidence interval (CI)) because the standard error gets closer to 0 due to the large number of samples generated through MCMC simulations (Robins and Wasserman 2000). The Bayesian 95% credible interval (CrI) is similar in some ways to the 95% CI from the frequentist perspective, but there is a big difference between them in terms of basic principle and interpretation. A 95% confidence interval means the range including the true effect size in 95% of the cases across all possible samples from the same population. This means that the population parameters are fixed and the samples are random in the frequentist approach (Edwards 1992). On the other hand, Bayesian approach regards parameters as random and samples as fixed (Ellison 2004). Therefore, the Bayesian 95% CrI indicates a 95% probability range of values on the posterior distribution (i.e., parameters), which is generated by fitting the predetermined prior distribution including the information of parameters into the observed data. For example, a wider CI means huge standard error caused by little knowledge of effects or small samples, but CrI is a range of true effects on treatments at the level of populations. Therefore, most Bayesians are reluctant to use frequentist hypothesis testing using p values because the observed data was not included in the posterior distribution (Babu 2012; Bayarri and Berger 2004; Kaweski and Nickeson 1997). However, many scholars interpret the results of Bayesian analysis with the perspective of frequentists, and this causes misunderstanding of results (Gelman et al. 2014).