Publication bias analyses demonstrated that the validity of conscientiousness is moderately overestimated (by around 30%; a correlation difference of about .06). The misestimation of the validity appears to be due primarily to suppression of small effects sizes in the journal literature. These inflated validity estimates result in an overestimate of the dollar utility of personnel selection by millions of dollars and should be of considerable concern for organizations.

Sensitivity analyses refer to investigations of the degree to which the results of a meta-analysis remain stable when conditions of the data or the analysis change. To the extent that results remain stable, one can refer to them as robust. Sensitivity analyses are rarely conducted in the organizational science literature. Despite conscientiousness being a valued predictor in employment selection, sensitivity analyses have not been conducted with respect to meta-analytic estimates of the correlation (i.e., validity) between conscientiousness and job performance.

Our analysis addresses the personality trait, conscientiousness. It is considered one of the “Big 5,” a term that refers to five broad dimensions that succinctly describe human personality [ 30 ]. Shaffer and Postlethwaite [ 31 ] conducted the most comprehensive meta-analysis to date in which they assessed the correlation (i.e., validity) between conscientiousness and job performance (k = 113). Of the Big 5, conscientiousness was found to have the largest magnitude validity (the observed validity range for conscientiousness was .13 to .20 [ 31 ]). The authors found that the other Big 5 personality traits had observed mean validities that were less meaningful from a practical perspective in a selection context (i.e., where job performance is the criterion). A concern with the Shaffer and Postlethwaite study is that they concluded that the validity estimates for conscientiousness are not affected by publication bias [ 31 ]. However, they did not perform any sensitivity analysis. This paper applies sensitivity analyses, specifically publication bias and outlier analyses, to evaluate the robustness of their conclusions. To facilitate this task, we replicated their approach and crossed the frame-of-reference variable with all other moderators, which allowed us to reduce moderator-induced heterogeneity and to assess whether the influence of outliers and/or publication bias varied across sub-distributions.

Sensitivity analyses in meta-analytic studies include publication bias and outliers analyses. Publication bias occurs to the extent that research findings on a particular relation that are available are not representative of all research findings on that relation of interest [ 6 , 16 ]. Although publication bias analyses are rare in the organizational sciences, such analyses are much more common in other disciplines. For example, van Lent, Overbeke, and Out examined the role of review processes in the publication of drug trials in medical journals [ 17 ]. Kicinski examined publication bias in several meta-analyses in four major medical journals [ 18 ]. Publication bias has also been addressed in animal research [ 19 , 20 ]. In both the medical sciences [ 21 – 23 ] and the social sciences [ 24 ], publication bias appears to be primarily driven by authors who do not submit null or otherwise undesirable findings [ 16 , 25 ]. These authors are likely responding to journal policies that discourage the publication of research with non-significant findings as well as replications that can enable the evaluation of the credibility of previous research findings [ 10 , 26 ]. In addition to publication bias, outliers can have a noticeable effect on meta-analytic results [ 27 , 28 ]. Unfortunately, although outlier analyses are a type of sensitivity analysis [ 11 ], only around 3% of all meta-analyses in the organizational sciences report assessments of outliers [ 29 ].

Meta-analytic findings are viewed as a primary means for generating cumulative knowledge and bridging the often lamented gap between research and practice [ 1 – 4 ]. However, concerns regarding meta-analytic results and our cumulative knowledge remain [ 5 – 10 ]. Sensitivity analyses address the degree to which the results of a meta-analysis remain stable when conditions of the data or the analysis change [ 11 ]. To the extent that results remain stable, they can be considered robust. Unfortunately, the vast majority of meta-analyses in the organizational sciences fail to conduct sensitivity analyses and do not report the robustness of the meta-analytic findings [ 12 ] despite the fact that scientific organizations such as the American Psychological Association [ 13 , 14 ] and the Cochrane Collaboration [ 15 ] require or recommend such analyses.

We defined the lowest validity estimate as the smallest value from any of these seven analyses (i.e., , osr, , t&f , sm m , sm s , and PET-PEESE). We defined the baseline range estimate (BRE) as the absolute difference between and the validity estimate farthest away (either the lowest or highest value). We defined the maximum range estimate (MRE) as the absolute difference between the lowest and the highest value. When calculating the relative difference of the range estimates, we used , the potentially best mean estimate, as the base (i.e., as 100%). Consistent with Kepes et al., we characterized the magnitude of publication bias as negligible if the relative range (BRE or MRE) was smaller than 20%, as moderate if the relative range (BRE or MRE) was between 20% and 40%, and as large if the relative range (BRE or MRE) was larger than 40%. For the P-TES estimates, we used the decision rules from Francis [ 44 ] to determine whether the data were suspect (i.e., a probability of .1 or less is consistent with an inference that the data should be viewed with skepticism). We find the decision rules from Kepes and colleagues reasonable, and note that other researchers have used them [ 54 ], and, to date, no critiques of them have been offered. However, readers may choose to adopt other decision rules. We provide the data and results needed to assist the reader in such an effort.

We relied on decision rules offered in Kepes et al. in determining the range of mean validity estimates and the magnitude of publication bias [ 33 ]. These decision rules are summarized here. First, we estimated the highest validity defined as the RE meta-analytic mean ( ). Next, we performed several sensitivity analyses, including the one sample removed analysis (osr), the trim and fill analysis (t&f ), and selection models with moderate (sm m ) and severe (sm s assumptions of publication bias to derive additional mean validity estimates. We also conducted P-TES, PET-PEESE, and p-uniform analyses. We defined the highest validity estimate as the highest value from any analysis that provided an adjusted effect size estimate ( , osr, , t&f , sm m , sm s , and PET-PEESE) [ 33 ]. We excluded the results from p-uniform due to their lack of convergence with the results from the other, more established methods. We note that this is likely due to the heterogeneity of our data [ 47 ].

Finally, we note that some analyses use Fisher’s z transformed Pearson’s correlation coefficients (i.e., r). The transformation is used in some statistical methods, in part, because it makes the sampling distribution symmetrical. Given the relatively small magnitude of our correlations, the Fisher z coefficients and the untransformed correlation coefficients were nearly identical. Still, in the interest of making our analyses clear and our results fully replicable, we detail which statistical methods transformed correlation coefficients into Fisher z. The issue is whether the statistical method uses Fisher z transformed correlation coefficients in calculations. All methods that did use Fisher z coefficients in calculation used a back transformation of the results into untransformed correlation coefficients. The meta-analyses that yielded the random effect mean, the confidence interval for the mean, the Q test, the I 2 statistic, and the tau estimate were conducted with CMA, which uses Fisher z coefficients in calculations. Although CMA did not provide the prediction interval, we calculated it using output from CMA, which again does calculations using Fisher z correlations. Likewise, the one sample removed analyses, and the trim and fill analyses, were conducted using CMA and were thus based on Fisher z coefficients. The selection models use Fisher z transformed correlations as well. P-uniform was also conducted on Fisher z coefficients. The PET-PEESE and outlier analyses were conducted using untransformed correlation coefficients. We emphasize that for all results, the coefficients are in the metric of untransformed correlation coefficients and thus can be compared.

Each sensitivity analysis has some limitations, which is why we ran multiple analyses and sought convergence across methods. Next, we address strengths and weakness of trim and fill due to issues raised by reviewers. A PubMed search of the words “trim and fill” (or “trim & fill”) from 1999, the date of the dissertation that introduced the method, through 2014 yielded 142 citations. A search of ProQuest dissertations yielded 187 citations. We offer that this indicates that the method has many adherents. The primary weakness of the method is that the results can be inaccurate in the presence of heterogeneity (i.e., variance not due to random sampling error) [ 40 , 51 , 52 ]. Thus, in our analyses, the credibility of the trim and fill results is strongest in those sub-distributions in which we control for moderators (and thus control to some degree for heterogeneity) [ 33 ]. Regarding the influence of heterogeneity on PET-PEESE, Moreno et al. conducted a comprehensive simulation study that included variants of Egger’s test of the intercept [ 53 ]. Two of these variants (fixed-effects model and fixed-effects variance model) correspond to the two components of PET-PEESE. They concluded that these variants can be inappropriate in very heterogeneous settings [ 53 ]. Similarly, as noted in the description of p-uniform [ 47 ], this method overestimates the mean effect as heterogeneity increases. To the extent that our data set has heterogeneity, p-uniform, and maybe also PET-PEESE, could be inappropriate. However, because both methods are relatively new, we argue that it is informative to apply them to our data set to see the extent to which their results converge with the results of the other, more established methods.

First, we derived random-effects (RE) meta-analytic estimates. Second, we conducted one-sample removed analyses to examine the influence of each individual sample on the meta-analytic results [ 37 ]. Next, we performed publication bias analyses using multiple methods to triangulate the effect size estimate [ 38 ] and to identify the possible range of point estimates (i.e., mean correlations) rather than relying on a single estimate [ 33 ]. We used contour-enhanced funnel plots [ 39 ], the trim and fill analysis with the L estimator [ 40 ], selection models [ 41 ], and cumulative meta-analysis by precision [ 37 ] to perform our publication bias analyses. A modified confunnel command in Stata was used to create the contour-enhanced funnel plots [ 33 ]. A priori selection models were conducted in R [ 42 ] with the p-value cut-points to model moderate and severe instances of publication bias suggested by Vevea and Woods [ 41 ]. In addition, using R, we also ran tests of excess significance (P-TES; [ 43 , 44 ]), PET-PEESE (precision-effect test, precision effect estimate with standard error) analyses [ 45 ], whereby PET is Stanley and Doucouliagos’ (formula 6 [ 45 ]) modified version of Egger’s test of the intercept [ 46 ], and p-uniform analyses [ 47 ]. P-TES estimates the probability of the obtained results given the statistical power of the primary studies. Thus, contrary to the other analyses, P-TES does not provide an effect size estimate that is adjusted for publication bias. When estimating power for the primary studies, we used the random-effects mean from the distribution as the estimate of the population correlation (ρ) and set the significance level at .05. A set of effects with a probability of less than .1 is typically considered to lack credibility [ 44 ]. Finally, we used Viechtbauer and Cheung’s outlier and influence diagnostics to identify potential outliers [ 48 ]. This procedure was conducted in R; it includes seven ‘leave-one-out’ diagnostic measures specifically adapted or developed for the meta-analytic context that examine the influence of each individual study. Viechtbauer [ 49 ] included descriptions of these diagnostics and the criteria for determining which study may be considered to be an outlier. We ran all of our analyses with and without the identified outlier(s). As recommended, our results are presented with and without outliers, and we only assess the presence of bias in distributions consisting of at least 10 samples because conclusions from smaller distributions are questionable due to the lack of statistical power and second-order sampling error [ 33 , 35 , 50 ].

To facilitate understanding of our analysis for those in varying disciplines, we use the term “distribution” to refer to a set of effect sizes. When the effect sizes are sub-divided into smaller groups based on their values on one or more moderator variables, we refer to the subsets of effect sizes as “sub-distributions.” Consistent with the personnel selection literature, we use the term “validity” to describe the correlation between one measure, in this case a self-report assessment of conscientiousness, and a measure of job performance.

We used data from Shaffer and Postlethwaite that included 113 correlation coefficients [ 31 ]. Unless otherwise noted, our sensitivity analyses were conducted using Comprehensive Meta-Analysis (CMA, version 2.0 [ 32 ]) and follow the recommendations of Greenhouse and Iyengar [ 11 ] and Kepes et al. [ 33 ]. Given that CMA is based on the Hedges and Olkin [ 34 ] tradition of meta-analysis, our results differed slightly from the psychometric meta-analysis method [ 35 ] used by Shaffer and Postlethwaite [ 36 ]. We note that the reliability coefficients of the personality scales in this data set are between .79 and .87 (based on the coefficients from the data set for measures that reported at least three reliability coefficients).

Results

Using the approach detailed by Viechtbauer and Cheung [48] and the diagnostics and criteria for determining whether a particular study is an outlier described by Viechtbauer [49], we identified one outlier (the correlation coefficient from Lao [55]). We verified that the study was correctly coded (see [55], p. 32, Table 1). The sample is composed of police officers (“State Troopers”). Other research has found lower than typical prediction of law enforcement job performance from measures of general cognitive ability and employment interviews [33, 56]. Hirsh and colleagues speculated that the lower magnitude correlations may be due to the supervisor having limited opportunity to observe the work of the police officer [56]. Police officers typically patrol alone in their police car out of the view of their supervisor. Our results by sub-distributions are presented in Table 1 for all primary samples and in S1 Table contains the results without the one identified outlier.

Table 1 contains the results of the conscientiousness analyses conducted for the full distribution and publication bias results are offered for all sub-distributions with at least 10 correlations. The first two columns in Table 1 show the distribution analyzed and the number of samples (k) in the distribution. Columns three through nine display the results from the meta-analytic RE model: the mean observed correlation ( ), the associated 95% confidence interval (95% CI), the associated 90% prediction interval (90% PI), the Q statistic, I2, τ, and the one-sample removed analysis (minimum, maximum, and median mean validity estimates). The next four columns (10 through 13) contain the results from the trim and fill analysis, including the side of the funnel plot where the samples were imputed (FPS; a left-hand side imputation is consistent with an inference of publication resulting from the suppression of small magnitude effect sizes; [33, 40]), the number of imputed samples (ik), the trim and fill adjusted observed mean correlation (t&f ), and the trim and fill adjusted 95% confidence interval (t&f 95% CI). Columns 14 and 15 display the results from the moderate and severe selection models, including their respective adjusted observed estimates for instances of moderate and severe publication bias (sm m and sm s ) and their respective variance component. Column 16 provides the probability for the test of excess significance (P-TES). We report the probability of the chi-square test as the P-TES value and note that this is a probability of excess significance and is not an effect size. The next two columns, column 17 and 18, display the PET (precision-effect test) and PEESE (precision effect estimate with standard error) adjusted observed mean estimates (i.e., PET and PEESE , respectively; the PET column also includes its associated one-tailed p-value, which is used to determine whether the PET or the PEESE is the adjusted observed mean for the meta-analytic distribution [45]). The final column contains the p-uniform adjusted estimate of the mean effect size and its 95% confidence interval (p-uniform [95% CI]).

We note that the P-TES values changed substantially in a few distributions when the sole outlier was dropped, suggesting that the outlier substantially influenced the P-TES value. These differences can be examined by comparing Table 1 with the S1 Table. For example, for the non-journal article sub-distribution of effect sizes, the P-TES value including the outlier was .53, but .95 with the outlier dropped. For the non-journal articles which used a non-contextualized measure, the P-TES was .65 when including the outlier and .76 without the outlier. When the purpose of the measure was classified as general purpose, P-TES was .32 with the outlier and .84 without it. When the research design was concurrent, the P-TES value including the outlier was .12 and thus approached a value (.10) in which one might draw an inference of a non-credible data set. However, the P-TES rose to .49 when the outlier was removed from the data set. Based on these results, when P-TES is used as a sensitivity analysis in a meta-analysis, we recommend that it be conducted with and without outliers to determine the robustness of the results. Using the typical criterion .10 or less [44], neither the full distribution nor the sub-distributions were judged to be non-credible sets of data. Concerning the results reported in Table 1, we found varying degrees of robustness in the meta-analytic mean (i.e., validity) estimates for conscientiousness. For the entire distribution (k = 113), the RE meta-analytic mean estimate (.16) was robust to the one sample removed analyses (e.g., the mean estimate did not change). However, the 90% prediction interval, which indicates the likely range of “true” effect sizes, is relatively wide (.03, .29). Furthermore, the RE meta-analytic mean estimate was not robust to all publication bias analyses. Specifically, the trim and fill estimate of .13 and the severe selection model estimate of .12 were noticeably smaller in magnitude than the RE estimate. Confirming these results, the PET-PEESE estimate was .13 (because PET was significant, the PEESE adjusted mean estimate was selected [45]). The PET test (.09, p < .001) supports the results from the trim and fill analysis by indicating that the effect size distribution is asymmetric; that small magnitude effect sizes are likely to be missing from the meta-analytic distribution.

The contour-enhanced funnel plot (see Fig 1a) shows that all but one of the 23 imputed samples were in the area of statistical insignificance, which is consistent with an inference of publication bias stemming from the suppression of small magnitude correlations [33, 50]. The forest plot for the cumulative meta-analysis by precision shown in Fig 2a suggests that as sample sizes decrease, there is a noticeable drift toward higher validities. The cumulative point estimate starts at .07 (N cum [cumulative sample size] = 2,717; k cum [cumulative number of samples] = 4) with relatively large samples and increases to .13 (N cum = 9,250; k cum = 28) with the addition of smaller samples. Finally the validity estimate increases to .16 (N cum = 19,625; k cum = 113) with the addition of even smaller samples. This is consistent with an inference of publication bias resulting from the suppression of small magnitude correlations (from small samples). These patterns, especially the one from the contour-enhanced funnel plot, are also inconsistent with the notion that the small sample bias (i.e., small sample studies show systematic differences from larger sample studies due to assessing different populations or having measures of different sensitivity) is the cause for the observed results [33, 50]. We conclude that publication bias has likely affected the observed mean validity of conscientiousness for predicting job performance such that it is likely to be smaller in magnitude than the RE meta-analytic mean of .16. We note that most of the bias stems from journal articles (see Table 1 as well as Figs 1 and 2), which is consistent with an inference of the suppression of statistically non-significant results. Thus, it is the literature published in journals that is largely responsible for distorting the research on the validity of conscientiousness.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Three contoured funnel plots for the validity of conscientiousness by data source. (A) Conscientiousness data from all data sources. (B) Conscientiousness data from journal articles. (C) Conscientiousness data from non-journal sources. Correlations are graphed as circles with an X-axis of correlation magnitude and a Y-axis of the inverse standard error of the correlation. The filled black circles represent the observed correlations and the clear circles represent the trim-and-fill imputed correlations. The clear area contains correlations that are not statistically significant (p > .05). The darkest gray area contains correlations that may be described as marginally significant (p-values ranging from .05 to .10). The lighter gray area contains correlations that are statistically significant (p < .05). Note that most of the imputed correlations are found in the data distribution drawn from studies published in journals; relatively few of the imputed correlations are found in the data distribution drawn from unpublished studies. This fact is consistent with an inference that publication bias in the full data distribution is largely due to the suppression of statistically insignificant correlations in journal published articles. Thus, it is the journal articles that are largely responsible for distorting the research on the validity of conscientiousness. https://doi.org/10.1371/journal.pone.0141468.g001

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Three forest plots for the validity of conscientiousness by data source. (A) Conscientiousness data from all data sources. (B) Conscientiousness data from journal articles. (C) Conscientiousness data from non-journal sources. Forest plots for the cumulative meta-analyses by precision for the validity of conscientiousness (i.e., the correlation between conscientiousness and job performance) are displayed. To obtain the plots, validities were sorted from largest sample size to smallest sample size and entered into the meta-analysis one at a time in an iterative manner. The lines around the plotted means are the 95% confidence intervals for the meta-analytic means. For panels A and B, the mean validities drift from smaller to larger as correlations from smaller and smaller sample size studies are added the to the distribution being analyzed. For Panel C, no noticeable drift is observed. The drifts from smaller to larger meta-analytic means are consistent with an inference of statistically insignificant correlations from smaller sample size studies being suppressed (i.e., publication bias). The lack of meaningful drift in panel C suggests that the data suppression is largely in the journal published articles (see panel B). Thus, it is the data published in journal articles that are largely responsible for distorting the research on the validity of conscientiousness. https://doi.org/10.1371/journal.pone.0141468.g002

The results from the analyses without the outlier were similar. Therefore, we do not discuss the analyses or results without the one outlier. However, the results for all analyses without the outlier are provided in the supplementary materials (see S1 Table).

Our findings, including the range estimates (BRE and MRE) and conclusions, are summarized in Table 2 (S2 Table contains the conclusions for the sub-distributions without the sole outlier). We note that the range estimates are not necessarily perfectly comparable if the severe selection model did not provide a sensible solution (indicated by n/a in Table 1; see [41]). For these distributions, the observed range estimates may be smaller when compared to distributions where the full range of estimates is available. In addition, the results of the p-uniform analyses did not converge well with the results from the other, more established methods. Most likely, this is largely due to the heterogeneity in the data. The article that introduced p-uniform [47] provided simulation evidence that it noticeably overestimates the effect size as heterogeneity increases. We note that our I2 values are typically near about 50, indicating non-trivial heterogeneity, which adversely affects the performance of p-uniform [47]. Correspondence with one of the authors of the article introducing the p-uniform method, while informative, did not result in a decision rule concerning the magnitude of I2 values at which p-uniform should not be used [57]. Because of the nonconvergence with the results from the other, more established methods and our substantial uncertainty about the appropriateness of the p-uniform approach for these data (e.g., van Assen et al. [47] noted that this method performs poorly with heterogeneous data, which may explain why the p-uniform results generally did not converge with the other results), we excluded the results from our conclusions and Table 2 (and S4 Table; for conclusions of the results with the sole outlier that includes the results from the p-uniform analysis, see S3 Table).

Based on the sum of evidence, we conclude that the conscientiousness data are not meaningfully influenced by a sole outlier. We also found that, in general, the data on conscientiousness are noticeably affected by publication bias. Thus, the apparent suppression of small magnitude effect sizes, which the contour-enhanced funnel plots indicated to lie predominantly in the area of statistical insignificance, likely has led to the overestimation of the validity of conscientiousness. The results for the sub-group distributions of samples from journal articles (k = 67) and non-journal sources (k = 46) support this notion because samples published in journal articles reported larger average effect size estimates ( = .19) than samples from non-journal sources ( = .12; see Table 1). Distributions involving journal articles tended to be the most non-robust as well, typically with differences of at least .10 and overestimations of more than 60% (see Table 2). For illustrative purposes, we also provide the contour-enhanced funnel plots for both of these distributions as well as the forest plots from the respective cumulative meta-analysis by precision (see Fig 1b and 1c as well asFig 2b and 2c). The contour-enhanced funnel plots and the cumulative meta-analyses by precision support an inference of publication bias and an overestimation of the mean validity for data from journal articles as well [33]. By contrast, the data from non-journal sources seems to be relatively robust to publication bias (see Table 2). Thus, it is the data from journal articles that are largely responsible for distorting the research on the validity of conscientiousness.

In addition, we found that the RE mean validity estimates for distributions involving contextualized measures of conscientiousness were sometimes more robust than the mean estimates for distributions involving non-contextualized measures. For the distribution of all non-contextualized measures of conscientiousness (k = 91), the 90% prediction interval ranged from .00 to .29. By contrast, the prediction interval for the distribution of contextualized measures (k = 22) ranged only from .16 to .22. However, for many other distributions, the contextualization of conscientiousness measures did not matter. Often contextualized and non-contextualized sub-distributions were non-robust to a similar degree (non-robust to a moderate or even large degree [see Table 2]).

Although one may argue that the absolute difference between the RE meta-analytic mean estimates and the publication bias adjusted mean estimates tend to be rather small in magnitude (i.e., approximately .06 for most distributions), the relative differences tend to be noticeable (i.e., typically greater than 30%) and may be interpreted as moderate in size [6, 33]. Furthermore, for data from journal articles, the overestimation appears to be large, for contextualized as well as non-contextualized measures of conscientiousness (see Tables 1 and 2).

Based on a reviewer request, statistical significance tests are provided in Table 3 for the moderator subgroups analyzed in Table 1. Results in S4 Table are for the data set with the sole outlier removed.