The representation of cloud radiative effects and the atmospheric heating and moistening due to moist convection remains a major challenge in current generation climate models, leading to a large spread in climate prediction. Here we show that neural networks trained on a high‐resolution model in which moist convection is resolved can be an appealing technique to tackle and better represent moist convection in coarse resolution climate models.

Representing unresolved moist convection in coarse‐scale climate models remains one of the main bottlenecks of current climate simulations. Many of the biases present with parameterized convection are strongly reduced when convection is explicitly resolved (i.e., in cloud resolving models at high spatial resolution approximately a kilometer or so). We here present a novel approach to convective parameterization based on machine learning, using an aquaplanet with prescribed sea surface temperatures as a proof of concept. A deep neural network is trained with a superparameterized version of a climate model in which convection is resolved by thousands of embedded 2‐D cloud resolving models. The machine learning representation of convection, which we call the Cloud Brain (CBRAIN), can skillfully predict many of the convective heating, moistening, and radiative features of superparameterization that are most important to climate simulation, although an unintended side effect is to reduce some of the superparameterization's inherent variance. Since as few as three months' high‐frequency global training data prove sufficient to provide this skill, the approach presented here opens up a new possibility for a future class of convection parameterizations in climate models that are built “top‐down,” that is, by learning salient features of convection from unusually explicit simulations.

1 Introduction Convective parameterization remains one of the main roadblocks to weather and climate prediction (Bony et al., 2015; Medeiros et al., 2014; Sherwood et al., 2014; Stevens & Bony, 2013). In fact, most of the intermodel spread in equilibrium climate sensitivity can be traced back to the representation of clouds (Schneider et al., 2017). Convective schemes exhibit systematic biases in the vertical structure of heating and moistening, precipitation intensity, and cloud cover (Daleu et al., 2015, 2016). These errors, in turn, feed back onto larger‐scale circulations, deteriorating general circulation model (GCM) simulations, and prediction skill (Bony et al., 2015). A challenge in current convective schemes is representing the transitions between different types of convection, such as the transition from shallow to deep convection (Couvreux et al., 2015; D'Andrea et al., 2014; Khouider et al., 2003, 2010; Khouider & Majda, 2006; Peters et al., 2013; Rochetin, Couvreux, et al., 2014; Rochetin, Grandpeix, et al., 2014), which is especially crucial to predicting both continental precipitation and modes of climate variability (Arnold et al., 2014). In addition, most convective parameterizations do not represent important processes, such as convective aggregation, which are essential to accurately predicting the response of clouds and precipitation to global warming, as well as modes of climate variability (Arnold & Randall, 2015; Bony et al., 2015; Bretherton & Khairoutdinov, 2015; Coppin & Bony, 2015; Jeevanjee & Romps, 2013; Muller & Bony, 2015; Wing & Emanuel, 2014). Current generation climate models (and typical weather forecast models) with parameterized convection do not capture much of the degree of organization, nor do they represent mesoscale convective systems (MCS; Hohenegger & Stevens, 2016). MCS and the impact of shear are, however, crucial for correct representation of rainfall and radiative feedback (Cao & Zhang, 2017; Houze, 2004; Moncrieff, 2010; Moncrieff et al., 2012, 2017; Tan et al., 2015). Finally, another challenge is that climate sensitivity is strongly related to the interaction between deep and shallow convection (Bony et al., 2015), and the coupling between clouds, convection, and the large‐scale circulation, which is currently poorly captured by parameterized convection (Bony et al., 2015; Daleu et al., 2016; Hohenegger & Stevens, 2016; Nie et al., 2016). Many of the previously mentioned problems related to the representation of convection are partly alleviated when using convective‐permitting resolutions, that is, at horizontal grid spacing of ~2 km or less. For instance, the transition between shallow and deep convection can be correctly captured at convective permitting scale (Khairoutdinov et al., 2009; Khairoutdinov & Randall, 2006). Convective aggregation is observed at convective permitting scale (Hohenegger & Stevens, 2016) so that cloud resolving models (CRMs) have been the tool of choice to understand convective aggregation (Arnold & Randall, 2015; Bony et al., 2015; Bretherton & Khairoutdinov, 2015; Coppin & Bony, 2015; Jeevanjee & Romps, 2013; Muller & Bony, 2015; Wing & Emanuel, 2014). CRMs (at convective permitting scales <2 km) also correctly reproduce MSCs and squall lines (Moncrieff & Liu, 2006; Taylor et al., 2009), as well as extreme precipitation events driven by larger scale anomalies. Convective‐permitting simulations better represent modes of tropical climate variability (Arnold et al., 2014), shallow to deep convection (Guichard et al., 2004), and mesoscale propagation (Hohenegger et al., 2015). Therefore, modeling at convective‐permitting scales is transformative to the representation of convection. It is, however, impractical at present to use convective resolving resolution at global scale for climate prediction given its computational requirements (Satoh et al., 2008). While global cloud resolving models (GCRMs) can be run easily for months, multidecadal simulations are computationally challenging. To alleviate this problem, an interesting approach has been to use cloud “superparameterization (SP),” which computes the subgrid vertical heating and moistening profiles within a GCM grid cell by sampling a curtain of an embedded 2‐D CRM that uses convective permitting resolution (Grabowski, 1999; Khairoutdinov et al., 2005). This has led to many successes such as the possibility to rectify the diurnal continental cycle, to improve the representation of the MJO, and to represent both some MCS propagation and some degree of aggregation, and reduce overly strong land‐atmosphere coupling (Benedict & Randall, 2009; Grabowski, 2001; Holloway et al., 2012, 2015; Khairoutdinov et al., 2005; Kooperman et al., 2016a, 2016b; Pritchard & Somerville, 2009; Pritchard et al., 2011; Qin et al., 2018; Randall, 2013; Sun & Pritchard, 2016). While promising, SP is not without its own idealizations that also limit its predictive ability and usefulness for climate simulation. For instance, restricting explicit convection to two dimensions makes it difficult to represent momentum transport (Arakawa, 2011; Jung & Arkawa, 2010; Tulich, 2015; Woelfle et al., 2018), and the limited CRM domain extent artificially constrains vertical mixing efficiency (Pritchard et al., 2014). Meanwhile, the typical use of 1–4 km CRM horizontal resolution and 250‐m vertical resolution cannot resolve important boundary layer turbulence, lower tropospheric inversions, and associated entrainment that are critical to low cloud dynamics (Parishani et al., 2017). In light of this ongoing deadlock, we propose to use an alternative approach to convective parameterization in which convection is represented using a machine‐learning algorithm based on artificial neural networks (ANNs), trained on superparameterized simulations, called Cloud Brain (CBRAIN). ANNs can approximate any nonlinear deterministic function, a property called the universal approximation theorem (Schmidhuber, 2015). Clearly, parameterizing convection appears as an ideal problem for the use of machine learning algorithms and especially ANNs. Indeed, machine‐learning algorithms have been used in many applications where a clear physically based algorithm could not be defined. Applications have included self‐driving cars, society games (chess and go; Silver et al., 2016), speech recognition (Hinton et al., 2012), object recognition and detection, medical detection of cancers (Karabatak & Ince, 2009; Khan et al., 2001; Zhou et al., 2002), and genomics. There are also applications of ANNs to the geosciences, such as for rainfall prediction (Miao et al., 2015; Moazami et al., 2013; Tao et al., 2016), weather forecast, soil moisture (Kolassa et al., 2013, 2016; Kolassa, Gentine, et al., 2017; Kolassa, Reichle, & Draper, 2017), and surface turbulent flux retrievals (Alemohammad et al., 2017; Jimenez et al., 2009; Jung et al., 2011). Specifically, the development of deep learning and deep neural networks, that is, those with multiple hidden layers, has led to important developments in many different fields such as object detection or game strategy learning (Dahl et al., 2011; Hinton et al., 2012; LeCun et al., 2015; Silver et al., 2016; Tao et al., 2016). One of the advantages of ANNs is that once trained, they are computationally efficient, as most of the computational burden is dedicated to the training phase. Our aim here is to use such ANN techniques to better parameterize convection in coarse‐scale climate simulations by learning from cloud‐permitting SP‐simulations, while trying to minimize the computational cost compared to those cloud‐permitting simulations, which are still computationally prohibitive.

2 Data 2.1 SuperParameterized Community Atmosphere Model To evaluate this idea, we use a well‐validated version of the SuperParameterized Community Atmosphere Model (SPCAM3) in a simplified aquaplanet configuration with zonally symmetric SSTs following a realistic meridional distribution (Andersen & Kuang, 2012). The global model uses a spectral dynamical core with approximately two‐degree horizontal resolution (T42 triangular truncation) and 30 levels in the vertical. The CRM uses a simplified bulk one‐moment microphysics scheme and a Smagorinsky 1.5‐order subgrid scale turbulence closure as described by (Khairoutdinov et al., 2003) and shares the host GCM's vertical grid. For computational efficiency and convenience we use the “micro‐CRM” (8‐column) CRM domain discussed by Pritchard et al. (2014) for this proof of concept. Following a three‐month spin‐up period, we save global data at the host global model time step frequency (every 30 min) representing arterial inputs to (and outputs from) each of 8,192 cloud‐resolving arrays embedded SPCAM. The simulation is run for two years, yielding around 140 million training samples per year. One year of data represents 375 Gb.

3 Neural Network Setup We are using an ANN to predict SPCAM's total physics package tendencies, that is, the cumulative tendency produced by turbulence, convection, and radiation. Rather than purely isolating any of the above subtendencies from the CRM or GCM parameterizations, we chose a holistic approach in representing their sum—that is, the arterial total heating and moistening profiles that ultimately link a GCM's subgrid physics to its dynamical core. This has practical advantages in that the individual physical subprocesses—turbulence, convection, microphysics, and radiation—can interact in complex, nonlinear ways. Approximating the net effect of such interactions is one of the big strengths of ANNs. The ANN is not interacting with the dynamical core and uses the same inputs as SPCAM at each time step. The ANN is written using the Python library Keras (https://keras.io), a high‐level wrapper around TensorFlow (http://www.tensorflow.org). The code for the ANN training as well as for the validation and analysis below can be found at https://github.com/raspstephan/CBRAIN‐CAM. Training took on the order of 12 hr on a graphical processing unit (Nvidia GTX 970). The first year of SP‐CAM data was used for training, while the second year was used for independent validation. The feedforward ANNs consist of interconnected layers, each of which has a certain number of nodes (Figure S1). The input and output variables are listed in Table 1. The first layer is the input layer, which in our case is a stacked vector containing the input variables including their vertical variation for a specific column. No latitude or longitude information is specifically passed to the neural network, meaning that we train a single neural network to be used for every column. The last layer is the output layer, which again is a stacked vector of the four output vertical profile variables. All layers in between are called hidden layers. Deep neural networks have more than one hidden layer. The values in the nodes of each layer are weighted sums of all node values in the previous layer plus a bias, passed through a nonlinear activation function. Here we used the Leaky Rectified Linear Unit (LeakyReLU) a(x) = max (0.3x, x), which resulted in better scores compared to other common activation functions such as tanh, sigmoid, or regular ReLU. The output layer is purely linear without an activation function. Table 1. List of Input and Output Variables Used for the Neural Network Input variables Vertical levels Output variables Vertical levels Temperature at beginning of time step 30 Convective and turbulent temperature tendency 30 Humidity at beginning of time step 30 Convective and turbulent humidity tendency 30 Surface pressure 1 Longwave heating tendency 30 Sensible heat flux 1 Shortwave heating tendency 30 Latent heat flux 1 Temperature tendency from dynamics 30 Humidity tendency from dynamics 30 Incoming solar radiation 1 Size of stacked array 124 120 Training an ANN means optimizing the weight matrices and bias vectors that define it, to minimize a loss function—in our case the mean squared error—between the ANN outputs and the truth for a given input. The loss is computed for a shuffled (in space and time) minibatch of the training data with a batch size of 1,024 samples. To reduce the loss, the gradient of the loss function with respect to all weights and biases is computed using a backpropagation algorithm, followed by a step down the gradient—that is, stochastic gradient descent. In particular, we use a version of stochastic gradient descent called Adam (Kingma & Ba, 2014). How much to step down the gradient is determined by the learning rate. We started with a learning rate of 10−3, dividing it by 5 every 5 epochs (i.e., five passes through the entire training data set). In total, we trained for 30 epochs. Regularization techniques were not necessary because we did not see any signs of overfitting given the large number of training samples. Despite the random initialization of the ANN weights and biases, the final result proved robust between training realizations. For an ANN to train efficiently, all input values should be on the same order of magnitude. For this purpose, for each input variable, we subtracted the mean and divided by the standard deviation, independently for each vertical level; not normalizing did not modify any results but extended the duration of the training process. To make the outputs comparable, we converted the output variables (i.e., convective and radiative heating as well as convective moistening rates) to common energy units.

4 Results 4.1 Sensitivity to ANN Architecture and Amount of Training Data We start by testing how the amount of ANN parameters and their configuration impacts the performance. Table S1 summarizes 12 separate ANN architectures tested. As a first metric of skill we assess a mean squared error statistic computed across all four output variables, all space, and all time during the second simulated year. That is, given knowledge of the inputs to each CRM, we measure the error across 143 million separate ANN predictions of the CRM heating and moistening output profiles received by SPCAM's dynamical core, during a one‐year time period that was not included in the training data set. Figure S2a shows strong sensitivities to network architecture that underscore the importance of the ANN design—more parameters generally produce better scores and deeper networks give better results, because they also allow for more nonlinear interactions. For all subsequent analyses we thus only use our best performing network—a large, deep network with eight hidden layers of 512 nodes each. A key question for the generalizability of our approach is how much training data is needed. For this we incrementally increase the length of continuous simulation data for training up to one year (Figure S2b). As expected, more training data do lead to better scores on the validation set. But, interestingly, three months appear to be sufficient to yield most of the information (Figure S2b). This suggests promising potential to generalize our approach beyond an SPCAM demonstration test bed to other simulation strategies that do even more justice to the true physics of moist convection. Indeed, three‐month simulations are practical even for GCRMs or high‐resolution, 3‐D variants of SP. 4.2 Evaluation of NN Predictions Latitude‐longitude and pressure‐latitude snapshots (Figures 1 and 2) provide a good qualitative starting point for evaluating the NN predictions (supplement videos). Overall, the NN predictions agree remarkably well with the SP‐CAM truth in terms of horizontal and vertical structure. Lower tropospheric convective (turbulent and latent) heating and moistening associated with the intertropical convergence zone and extratropical cyclones occur at approximately the correct geographic locations (Figures 1a and 1d). The radiative heating rates show very good agreement, which is particularly impressive given the fact that there is no cloud condensate information in the input; that is, cloud‐radiative feedback is all internal to the ANN. For instance, ANN skillfully predicts the geographic location of shortwave absorption by water vapor and regional cloud anomalies (Figures 1g and 1h) as well as the vertical location of longwave cooling maxima at the tops of subtropical boundary layer clouds and deep tropical clouds (Figures 2e and 2f). However, one issue for the convective heating and particularly moistening rates is that the NN predictions are smoother and do not exhibit as much of the variability as SP‐CAM (internal stochastic variability). Indeed, the ANN is by definition deterministic and thus cannot reproduce any stochasticity. Figure 1 Open in figure viewer PowerPoint Latitude‐longitude snapshot of neural network predictions and the corresponding SP‐CAM truth at model level 20 (roughly 700 hPa) for one time step in the validation set. Figure 2 Open in figure viewer PowerPoint Pressure‐latitude snapshot at 180° longitude corresponding to Figure 3 To assess the quality of the predictions in more detail, we analyze R2 averaged over both time and horizontal dimensions to yield statistics for each level and predicted variable (Figure 3). R2 is defined as one minus the ratio of the sum of squared error to the true variance. The radiative heating rates are well represented throughout the column, particularly for shortwave heating. The convective tendencies interestingly show a distinct profile with less predictive skill in the boundary layer and the stratosphere. In the stratosphere, this lower skill is simply due to the near absence of convection at upper levels and likely not a concern. In the boundary layer, the reasons for reduced skill are discussed more below. Figure 3 Open in figure viewer PowerPoint R2 computed for each model pressure level and variable as described in the text. First, for a closer analysis of the skill in the troposphere, we also look at spatial statistics. Pressure‐latitude maps of R2 and the standard deviation (Figure 4) reveal patches of especially high skill in the midlevels at the equator and midlatitudes, which correspond to the locations of the Intertropical Convergence Zone and the midlatitude storm tracks. Since these are the locations of latent heating most fundamental to forcing the free tropospheric general circulation, this is reassuring regarding the potential of CBRAIN to reproduce important heating and moistening tendencies in future tests that could allow it to feedback with a dynamical core. Figure 4 Open in figure viewer PowerPoint Pressure‐latitude maps of (a and b) R2 and (c and d) true standard deviation averaged over time and longitude. Regions where the variance was less than 0.05% of the global variance were masked out. The skill in the boundary layer is significantly lower, again. One possibility is that this reflects the difficulty in representing mesoscale effects and subcloud layer organization as well as its memory (D'Andrea et al., 2014; Mapes & Neale, 2011). SPCAM does include some degree of convective aggregation (Arnold et al., 2015) and also carries memory of CRM organization from one‐time step to the next through the embedded CRM (Pritchard et al., 2011). Our ANN does not include memory, as our objective was to mimic most current practice in convective parameterization, which is local in space and time. Another source of lower R2 is related to the higher internal variability in SPCAM simulations compared to the ANN prediction, evident in Figures 1 and 2. This may be less of an issue in configurations that use larger, or 3‐D CRMs; the small‐extent 2‐D CRMs used here are known to throttle deep updrafts and lead to unrealistically intense extremes (Pritchard et al., 2014). SPCAM does have some internal stochasticity (Subramanian & Palmer, 2017), which, by definition, a deterministic ANN cannot reproduce. The boundary layer and shallow convection tendencies, particularly for the moistening rate, are much noisier and thus appear much more stochastic than at higher levels. In these lower levels, the predictions here have significantly less variability in terms of its mean squared error loss function, which encourages the ANN to predict just an average value in cases where it is not certain.

5 Discussion and Conclusion We have demonstrated that machine learning, and neural networks in particular, can skillfully represent many of the effects of unresolved clouds and convection, including their vertical transport of heat and moisture and the interaction of radiation with clouds and water vapor. The concept was proven in an idealized test bed using SPCAM over an aquaplanet. The implication of the success in this context is that an approach like CBRAIN could glean the advantages of GCRMs or high‐resolution, 3‐D SPs not yet practical for multidecadal climate simulations. There are, however, important steps required for full implementation of CBRAIN in a GCM. First, neural networks do not intrinsically preserve energy and moisture. This can be fine for implementation in a weather forecast model but energy and moisture conservation are required for climate prediction. Second, neural networks are inherently deterministic. It was shown here that the resulting CBRAIN representation of heating and moistening tendencies was too smooth compared to the original SPCAM field used for training, which is more variable especially in the lower levels of the atmosphere (below 700 hPa). An important next test is to examine how CBRAIN feeds back with the GCM's resolved scale dynamics and surface fluxes. A final challenge is related to the fact that inherently a machine‐learning algorithm is trained on existing data. For climate prediction, the algorithm should be able to generalize to situations that have potentially not been seen such as changes in trace gas profile and concentrations or aerosols, and should be able to represent convection over continents. Notwithstanding the above challenges, we believe that our preliminary results motivate the case that machine learning represents a powerful alternative to GCRMs or embedded‐2‐D CRM parameterizations. It is computationally efficient, even for relatively large networks. For instance, without specific optimization, a preliminary test showed that CBRAIN was 10 times faster than the micro‐CRM form of SP used in our study and produces tendencies of unresolved physics comparable to SP. It would thus be several orders of magnitude faster than an SP equipped with large, 3‐D, high‐resolution domains, or a GCRM. CBRAIN is also naturally fitted for data assimilation since computation of the adjoint is straightforward and analytical, making it a natural candidate for operational weather forecasting. CBRAIN could represent a useful alternative to current parameterizations, which have followed a “bottom‐up” deterministic strategy that still exhibits too many biases for satisfying prediction of the future hydrological cycle. A “top‐down” strategy that instead learns the realistic complexity of simulated convection, as captured in short multimonth simulations at convection permitting resolution, is an attractive alternative. As global temperature sensitivity to CO 2 is strongly linked to convective representation, this might also improve our estimates of future temperature.

Acknowledgments The neural network and analysis code can be found at https://github.com/raspstephan/CBRAIN‐CAM. The exact version of the code base used for this study is tagged grl_submission. The raw SP‐CAM output is very large (several TB) and available upon request to Mike Pritchard. M. P. acknowledges funding from the DOE SciDac and Early Career Programs (DE‐SC0012152 and DE‐SC00‐12548) as well as the NSF (AGS‐1734164). Stephan Rasp was funded by the German Research Foundation (DFG) Transregional Collaborative Research Center SFB/TRR 165 “Waves to Weather”. Computational resources for our SPCAM3 simulations were provided through the NSF Extreme Science and Engineering Discovery Environment (XSEDE) under allocation TG‐ATM120034.

Supporting Information Filename Description grl57488-sup-0001-2018GL078202-S01.docxWord 2007 document , 2.1 MB Supporting Information S1 Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.