Experiment

The experimental apparatus is described in detail in37. Initially 87Rb atoms are cooled in a combined 2D and 3D MOT system and subsequently cooled further by RF (radio frequency) evaporation. The cloud is then loaded into a cross beam optical dipole trap for the final evaporation stage. It is this stage that is the subject of the optimization process. At this point, the sample contains 4 × 107 atoms at a temperature of ~5 μK with a phase space density of ~0.05. The cross dipole trap is formed from two intersecting 1090 nm and 1064 nm lasers with approximate waists of 350 μm and 300 μm respectively producing a trap with frequencies 185 × 185 × 40 Hz. The depth of the cross trap is determined by the intensity of the two beams and is found to be approximately 70 μK. The 1064 nm beam is controlled by varying the current to the laser, while the 1090 nm beam is controlled using the current and a waveplate rotation stage combined with a polarizing beamsplitter to provide additional power attenuation while maintaining mode stability. A diagram of the experimental set up is shown in Fig. 1. Normally the power to these beams is ramped down over time, thereby lowering the walls of the trap and allowing the higher energy atoms to leak out. The remaining atoms rethermalize to a lower temperature, enabling cooling. Once the gas has been cooled to temperatures on the order of nK, a phase transition occurs and a macroscopic number of atoms start to occupy the same quantum state. This transition is called Bose-Einstein condensation38. We hand over control of these ramps to the MLOO. We consider two parameterizations: one simple, where we only control the start and end points of a linear interpolation; and one complex, where we add variable quadratic, cubic and quartic corrections to the simple case (see Supplemental Equations).

Figure 1 The experiment and the ‘learner’ form a closed loop. The learner produces a parameter set, X, for the experiment to test, these are converted into cooling ramps and used to perform an experiment. After the evaporation process is finished, an image of the cold atoms taken is used to calculate a cost function based on its quality as a resource C(X). C(X) is then fed back to the learner. The learner uses a GP to model the relationship between the input parameter values and the cost function values produced by the experiment. This model depends on a set of correlation lengths, or hyperparameters. Part (a) of the figure plots a set of observed costs (black circles with bars for uncertainty) with three possible GP models fit to the data: one with a long correlation length (red dotted), a medium correlation (blue solid) and a short correlation length (green dashed). Each GP is illustrated by a mean cost function bracketed by two curves indicating the function +/− one standard deviation from the mean. The correlation lengths affect both the mean and uncertainty of the model; note that the uncertainty approaches zero near the observed points. A final cost function is produced as a weighted average over the correlation lengths. This model is used to pick the next parameters X for the experiment. Full size image

Performance Measure

The approach we propose is a form of supervised learning, meaning that we provide the learner with a number that quantifies the quality of the resource produced or in optimization terminology a cost that must be minimized. Naïvely one might try to use a measure based on temperature and particle number. However determining these quantities accurately near condensation is difficult when constrained to very few runs per parameter set. Instead, a technique was created to measure the width of the edges of the cloud. For thermal clouds this edge is broad, but as the sample cools and condenses these edges become sharper. To quantify this, an absorption image of the final state of the quantum gas is taken after a 30 ms expansion of the cloud, with the image providing the optical depth as a function of space. This absorption image is taken at resonance, resulting in saturation of the image (see Fig. 2). Whilst this makes determining peak density difficult, it ensures that the edges of the cloud are accurately determined. The cost is then calculated from all data between a lower and upper threshold optical depth. The lower threshold is determined by the noise in the system. The upper threshold is set slightly lower than the saturation level of the image. Only data from between the bounds is used and the cost is simply the average of these values. In practice this means the sharper the edges of the cloud, the lower the cost. Indeed, low quality thermal clouds have broad edges, whereas the ideal BEC has much sharper edges. Each parameter set is tested twice with the average of the two runs used for the cost. Tests of the variation in cost for a set of parameters run-to-run indicate they obey a Gaussian distribution. As such we are able to estimate the uncertainty from two runs as twice the range. In doing so, the chance we have underestimated the uncertainty will be 27%. We therefore also apply bounds to the uncertainty to eliminate outliers overly affecting the modeling process. The cost function can be evaluated as long as some atoms are present at the end of the evaporation run. In cases where the evaporation parameters produced no cloud twice for a set of parameters, we set the cost to a default high value.

Figure 2 The optimization of the evaporation stage of creating a BEC using the complex 16 parameter scheme. The first 20 evaluations are an initial training run using a simple Nelder-Mead algorithm. The machine learning algorithm (green) then quickly optimizes to BEC. The insets show the different regimes experienced by the experiment, from a completely Gaussian thermal distribution, through the bimodal distribution containing a thermal background to the sharp edged BEC. The included cross-section illustrates how the cost decreases as the edges of the cloud get sharper. Full size image

Algorithm

We treat the experiment as a stochastic process which is dependent on the parameters X = (x 1 ,…, x M ). When we make a measurement and determine a cost, we interpret this as a sample of this process C(X) with some associated uncertainty U(X). We define the set of all parameters, costs and uncertainty previously measured as , and respectively and collectively refer to these sets as our observations . The aim of OO is to use previous observations to plan future experiments in order to find a set of parameters that minimize the mean cost of the stochastic process . Unique to the MLOO approach, we first make an estimate of the stochastic process given our observations , which is then used to determine what parameters to try next.

We model as a GP–a distribution over functions–with constant mean function and covariance defined by a squared exponential correlation function where H = (h 1 , …, h M ) is a set of correlation lengths for each of the parameters. The mean function conditional on the observations and correlation lengths H of our GP is: , which is evaluated through a set of matrix operations26 (see Supplemental Equations). As we are using a GP, we can also get the variance of the functions conditioned on and H: 26. Both of these estimates depend on the correlation lengths H, normally referred to as the hyperparameters of our estimate. We assume that H is not known a priori and needs to be fitted online.

The correlation lengths H control the sensitivity of the model to each of the parameters and relates to how much a parameter needs to be changed before it has a significant effect on the cost (see Fig. 1). A standard approach to fit H is maximum likelihood estimation26. Here, the hyperparameters are globally optimized over the likelihood of the parameters H given our observations , or 26 (see Supplemental Equations). However, when the data set is small there will often be multiple local optima for the hyperparameters whose likelihoods are comparable to the maximum. We term these hyperparameters the hypothesis set with corresponding likelihood set .

To produce our final estimates for the mean function and variance we treat each hypothesis as a particle30 and perform a weighted average over . The weighted mean function is now defined as and weighted variance of the functions is , where are the relative weights for the hyperparameters. Now that we have our final estimate for , we need to determine an optimization strategy for picking the next set of parameters to test.

Consider the following two strategies: We could always test parameters that are predicted to minimize , making our learner act as an ‘optimizer’. But this learner could get trapped in local minima and re-enforce its ignorance; Or we could test parameters that maximize (i.e. where we are most uncertain), this would provide us with experimental data that helps us best refine our model and discriminate between the hypotheses, making our learner act like a ‘scientist’. But this learner may require a large number of trials to map the space and would not prioritize refinement of the global minima. We chose to implement a balanced strategy that repeatably sweeps between these two extremes by minimizing a biased cost function: , where the value for b is linearly increased from 0 to 1 in a cycle of length Q. During testing with synthetic data, we found sweeping the learner between acting like a ‘scientist’ (b = 0) and an ‘optimizer’ (b = 1) was more robust and efficient than fixing the learner to one strategy. When we minimized we also put bounds, set to 20% of the parameters maximum-minimum values, on the search relative to the last best measured X. We call these bounds a leash, as it restricts how fast the learner could change the parameters but did not stop it from exploring the full space (similar to trust-regions39,40). This was a technical requirement for our experiment: when a set of parameters was tested that was very different from the last set, the experiment almost always produced no atoms, meaning we had to assign a default cost that did not provide meaningful gradient information to the learner. Once the next set of parameters is determined they are sent to the experiment to be tested. After the resultant cost is measured this is then added to the observation set with N → N + 1 and the entire process is repeated.

As a benchmark for comparison, we also performed OO using a Nelder-Mead solver41, which has previously been used to optimize quantum gates25.