Worries related to calibration and missing feedbacks can also be mitigated by testing model results against data about paleoclimate epochs. 69 - 72 Paleoclimate states provide partly independent information not used in model development, and they were driven by forcings quite different from those of modern climate. However, the boundary conditions and the data are limited and derived from proxy data, which introduces large uncertainties, and those types of data also get increasingly used in the model development and evaluation process, which weakens the argument of an independent test.

The second worry is a more fundamental one. Even if climate model results fit to data that were not used in calibration, this may not provide strong reasons for a model's adequacy for long‐term projections. Long‐term projections for high‐forcing scenarios lie outside the scope of boundary conditions previously observed in the instrumental record. At least prima facie , there is no reason to assume that successful performance under current boundary conditions is a good guide to successful performance under future boundary conditions that describe high‐forcing scenarios. We are confident that the physical principles on which climate models are based can be extrapolated beyond the range where they are evaluated. However, this is less clear for parameterizations that are empirically derived from observations mostly covering the narrow climate regime of the past century, and for the interaction between parameterizations and physical principles. For long‐term projections, additional processes and feedbacks (e.g., methane emissions from thawing permafrost) may become relevant and take the system out‐of‐sample with respect to existing observations. If a model does not account for these processes and feedbacks, it could fit almost perfectly even to data about past and present climate not used for calibration but still be biased for projections. Success with respect to past and present climate alone is thus no assurance that the model will also be successful in projecting future climate (Ref 2 , p. 828). 7 , 12 , 24 , 67 Some climate scientists conclude from this that it is hard to tell how relevant past data are or that they are not relevant at all for evaluating a model's adequacy for climate projections (Ref 68 , p. 2146). This conclusion may be too hasty, but the considerations behind it show that to further strengthen an argument from model fit to the adequacy of the model for long‐term projections, we need independent reason to assume that the model captures the relevant climate processes and feedbacks. f

The familiar strategy to avoid this problem is to split data and use one part of a dataset to calibrate a model and the other part to test it. 3 This has triggered a debate about whether data used in calibrating a model can nonetheless also be used in evaluating the model; that is, whether double‐counting is legitimate or whether data used for evaluation should not be used in calibration and need in this sense be use‐novel (see Ref 65 , for an overview). Climate scientists often declare double‐counting as illegitimate. d This is an overstatement if it implies that the accommodation of data through calibration provides no epistemic support for the model. Successful calibration can confirm a model at least to some extent because it is far from trivial that a model can be successfully calibrated. The reason is that climate models can be evaluated and calibrated on a large number of variables and scales, but calibration involves usually only a limited number of parameters (Ref 24 , p. 174). Philosophers, on the other hand, have argued that from a Bayesian perspective, there is no difference between calibrating and confirming and thus no problem with using the same data to calibrate and confirm a model. For the Bayesian, calibration ‘is simply the common practice of testing hypotheses against evidence’ (Ref 66 , p. 615). Frisch 24 has shown that this claim does not follow from Bayesian formalism alone, and that in case of complex climate models, fit to data not used in calibration is a superior test of the adequacy of a model for long‐term projections than fit to data used to calibrate the model. e Thus, to assess the strength of a non‐deductive argument from model fit to the adequacy of the model for long‐term projections, we need information about the extent to which the fit in question was dependent on tuning and whether tuned elements on which the fit depends have been tested out‐of‐sample (Ref 9 , p. 245). The strength of such an argument depends also on how independent the data used to calibrate the model are from the data used to test it, and on the extent to which data not used in explicit calibration guided model construction in some other way, for example, by influencing choices in the model structure and design.

The first concerns calibration or tuning. The values of the parameters involved in parameterizations are often poorly constrained by an understanding of the underlying processes and need to be calibrated against observational data. Model calibration is unavoidable in climate modeling, and routinely done, but has so far been rarely discussed and documented systematically. 58 , 59 Model calibration consists in choosing a parameter configuration so that the model results better fit to data about past and present climate. 60 The worry is that if the fit to data is due to calibration, then it does not provide a strong reason for the adequacy of the model for long‐term projections. 3 , 4 , 9 , 49 , 61 One reason is that parameters are often not tuned to their ‘correct’ values; calibration allows to compensate for structural errors by introducing compensating biases (e.g., in climate sensitivity and radiative forcing) during the calibration process. 39 , 62 The calibration of a model may thus guarantee success with respect to past and present climate irrespective of whether the model correctly accounts for those underlying processes that are relevant to the long‐term evolution of the climate system. A second reason why model fit that is due to tuning does not provide a strong reason for a model's adequacy for projections is that the choice of parameters or model structure may be inconclusive given the data used in calibration. At least in some cases, there are different sets of parameter values that result in equally good fits with data. 63 Different models that agree in their performance as far as the dataset in calibration is concerned can disagree with respect to out‐of‐sample applications and thus with respect to long‐term projections. For example, calibrating a model to short GMST trends provides only weak constraints for projections of future climate. 24 , 64 Both reasons substantiate the worry that the performance of a model with respect to the future might not be similar to the performance with respect to the data to which the model is tuned. It is thus unclear whether model success can be extrapolated from past and present to the future. A priori we should not expect that it can, yet this is often done implicitly.

Besides issues related to the reliability of data and the fit to data, there is the general problem of induction going back to the philosopher David Hume. Hume's problem was roughly the worry whether nature is uniform and things continue to go on in the same ways (Ref 57 , 1.3.6). The specific worry in connection with arguments for a model's adequacy for long‐term climate projections is whether the model's success in representing past and present climate provides good reasons to assume that it is adequate for projecting future climate. Here are two instances of this worry.

Reliably indicating past and present climate cannot ensure that a model is adequate for long‐term projections, but under certain conditions, it warrants an increased confidence in those projections. Claims about the empirical accuracy of model results should therefore be understood as premises of a non‐deductive argument for the model's adequacy for projections of the desired kind. Non‐deductive strength is non‐monotonic, that is, adding premises can yield a stronger or weaker argument (see Box 2 ). Hence, the evaluation of such arguments needs to assess whether all relevant information has been taken into account and the conditions for increased confidence are met. Whether this is indeed the case, is often hard to decide. This makes the evaluation of non‐deductive arguments difficult.

FIT TO RESULTS OF OTHER MODELS: ROBUSTNESS

Climate model projections are too long term to allow repeated direct comparison with data, but they can be compared with projections of other models or model versions. This is what climate modelers extensively do in ensemble studies because of uncertainties in how to represent the climate system such that models lead to accurate projections of future climate.

In an ensemble study, each of several climate models or model versions is run with the same or similar initial and boundary conditions. There are two main types of such studies (Ref 67, p. 582). Perturbed physics (or parameter) ensemble studies employ different versions of the same model that differ in the values of their uncertain parameters, that is, are effectively parameter sensitivity tests. In this way, the ensemble explores how climate projections are impacted by the uncertainty about the values that should be assigned to model parameters. Multimodel ensemble studies employ several models that differ in a number of ways, for example, in number and complexity of processes included, parameterizations, spatiotemporal resolution, numerical methods and computing platforms. In this way, the ensemble explores how climate projections are impacted by structural and parametric uncertainty; that is, uncertainty about the form that modeling equations should take and how they should be solved computationally. The most ambitious multimodel ensemble study to date is Coupled Model Intercomparison Project 5 (CMIP 5) which has collected results from about 60 models from nearly thirty modeling centres around the world.73 Both types of ensemble studies often include a limited investigation of the impacts of initial condition uncertainty as well by running multiple cases for the same experiments with different initial conditions.

Ensembles help to deal with uncertainties either by producing robust projections or by providing estimates of uncertainty about future climate change. A model projection is robust if all or most models in the ensemble agree regarding the projection. If all models in an ensemble show more than a 4°C increase in GMST by 2100 when run under a certain forcing scenario, this projection is robust. In what follows, we focus on multimodel ensemble studies but similar arguments can be made for perturbed physics ensembles. We discuss three inferences from the robustness of projections: to their likely truth, to the warranted confidence in the projections, and to the correctness of the underlying causal assumptions.

Inference to the Likely Truth of a Projection An inference from robustness of projections to their likely truth is legitimate if we have reasons to assume that it is likely that at least one model in the multimodel ensemble correctly projects the quantity of interest within the specified error margin. A premise to this effect could be justified in two different ways (Ref 67, p. 584–589). One is to cite the success of the models in simulating past and present climate to support the claim that it is likely that at least one simulation in the ensemble correctly projects the quantity of interest within the specified error margin. Considerations in the last section pointed out the limits of such an argument. A second way to justify the required premise refers to the construction of the models rather than to their performance. It argues that the multimodel ensemble samples enough of the current uncertainty about how to represent the climate system for the projection at issue that it is likely that at least one simulation correctly projects the quantity of interest within the specified error margin. The problem with this line of argument is that today's multimodel ensembles group together existing models and are thus ‘ensembles of opportunity,’ ‘not designed to span an uncertainty range’ (Ref 62, p. 2653). One of the main sources of uncertainty are parameterizations of subgrid‐processes such as cloud formation. Each state‐of‐the‐art climate model includes some representation of clouds, but ensemble studies do not attempt to ensure that the ensemble as a whole adequately samples (or spans) current uncertainty about how clouds should be represented; the same holds for other subgrid‐processes (Ref 67, p. 585). Moreover, it is unclear how such a sampling could be achieved. In case of parameter uncertainty, the space of possibilities in which plausible alternatives are to be identified is clear since it is the space of numerical values (although it is computationally intractable due to its dimensionality), but in the case of structural uncertainty as it is addressed in multimodel ensemble studies, the space of possibilities is indeterminate since it ranges over model structures (Ref 74, p. 216). One may argue that in the presence of limited understanding and potential unknown unknowns it is fundamentally impossible to sample the uncertainty in how to build and calibrate a model since we do not really know what the uncertainty is.

Inference to the Warranted Confidence in a Projection Similar difficulties beset sampling‐based arguments from the robustness of model projections to the warranted confidence (or degree of belief) in these projections. Such arguments combine robustness considerations with additional criteria of adequacy. Suppose that S is the set of all theoretically possible models that meet sufficiently well basic criteria such that each model in S has a significant chance of being adequate for projection P within some specified margin of error. For example, they simulate relevant aspects of past and present climate sufficiently well, include particular physical assumptions and have an appropriate spatiotemporal resolution. The models in S are currently considered the best theoretically possible models for P. In the absence of overriding evidence, the warranted confidence in P can then be identified with the fraction f of models in S whose simulations agree with respect to P within the specified error margin. Now, if the models in an ensemble constitute a random sample from S, then the fraction of models in the ensemble whose simulations agree with respect to P within the specified error margin provides a good estimator of f and thus of the warranted degree of confidence in P (cf. Ref 67, p. 593–595). The problem with an argument along this line is that today's multimodel ensembles are not random samples from the set of all theoretically possible models that meet basic criteria of adequacy. It is unclear what the space of possible models that meet the required criteria actually is, and climate scientists do not select today's models from this space by randomized procedures. As ensembles of opportunity, today's ensembles are not the kind of sample from which statisticians would usefully estimate uncertainty since their ‘sampling is neither systematic nor random’ (Ref 75, p. 2068). Currently available multimodel ensembles such as CMIP5 are not designed to systematically explore the space of models that meet the required criteria, and the statistical interpretation of the ensemble is unclear.76 All we have is a very limited space of practically possible models, and this space involves near duplicates since models are used several times with minor modifications only. Even if these duplications were eliminated and the remaining model space randomly sampled, there would still be structural dependencies between the models.54, 77-79 The models are of course based on the same physical understanding and use the same basic equations, but they also partly use the same parameterizations, make similar simplifications, and use the same computational methods; in many cases, they even share large fractions of code. As a result, the models inevitably share common errors (e.g., in the simulation of the Inter‐tropical Convergence Zone ITCZ80). Moreover, some climate processes that will significantly influence future climate change are not represented in any of today's models; some of these processes are recognized (e.g., the effect of methane hydrates), and perhaps some not. Both raises the worry that simulations from today's climate models might not so infrequently agree with respect to a projection, even though it is false (or biased),6 because most models share similar deficiencies. Furthermore, the interdependency of models within today's small ensemble studies makes it likely that the models do not differ enough to provide a representative sample of the set of all theoretically possible models that meet the basic criteria of adequacy (whatever they exactly are). Given current uncertainty about how to represent the climate system adequately, the set of possible models that meet the basic criteria of adequacy is likely to include models that differ significantly from today's models. If today's models differ from one another much less than random samples from S would, they are biased estimators of the fraction of models in S whose simulations agree with respect to P within some error margin, and thus of the warranted degree of confidence in the projection at issue (Ref 67, p. 594). The IPCC (Ref 1, ch. 12) acknowledges this and downgrades the probabilities based on the frequency of ensemble results.g The robustness of model projections cannot be directly translated into probabilities without making strong assumptions about the ensemble, about dependence and criteria for adequate models.84 But to the extent that the models of an ensemble are independent and differ thus in ways that are relevant for the projections at issue, the robustness warrants an increased confidence in the projections and can thus figure as an additional premise in a non‐deductive argument for a model's adequacy for those projections. Such an argument combines premises about the robustness of model results with premises about their empirical accuracy; that is, premises about the model's success in reproducing (use‐novel) data of past and present climate with premises about the agreement of the projections with projections of other models of an ensemble that are equally successful in reproducing (use‐novel) data of past and present climate. The strength of such a non‐deductive argument depends on whether all relevant information has been taken into account. To assess the strength of the argument, we need therefore more information about the extent to which the models in the ensemble are independent from each other and differ thus in relevant ways, for example, regarding equations, parameterizations, parameter values, resolution, boundaries, and numerical coding.85