Unified quantum exploration algorithm

In this work, we seek to obtain pulse sequences that can unitarily steer a quantum system towards given desired dynamics. For our purposes, we quantify this task through the state-averaged overlap fidelity \({\mathcal{F}}(U(t))\) with respect to a target unitary \({\hat{U}}_{\text{target}}\),

$${\mathcal{F}}(\hat{U}(t))={\left|\frac{1}{\text{dim}}{\text{Tr}}\left[{\hat{U}}^{\dagger }(t){\hat{U}}_{\text{target}}\right]\right|}^{2}.$$ (1)

Here, \(\hat{U}(t)\) denotes the time-evolution operator of the system, which solves the Schrödinger equation. We fix for concreteness our physical architeture as superconducting circuit QED,49 being both a highly tunable and potentially scalable architecture, with potential near-term applications.50 The system is chosen to be a resonator-coupled two-transmon system, as depicted in Fig. 1a. Here the transmon qubits are mounted on either side of a linear resonator and we drive the first qubit with an external control \(\Omega\), which could be a piecewice constant pulse as depicted in the bottom of the figure. The system dynamics are governed by the Hamiltonian51

$$\hat{H}(t)=\Delta {\hat{b}}_{1}^{\dagger }{\hat{b}}_{1}+J({\hat{b}}_{1}^{\dagger }{\hat{b}}_{2}+{\hat{b}}_{1}{\hat{b}}_{2}^{\dagger })+\Omega (t)({\hat{b}}_{1}^{\dagger }+{\hat{b}}_{1}),$$ (2)

where \({b}_{j}\) is the qubit-lowering operator for transmon \(j\), and the external control \(\Omega (t)\) is shaped by the optimization algorithm to maximize (1), with \({\hat{U}}_{\text{target}}=\sqrt{ZX}\) being a standard entangling gate (see Methods). \({\hat{U}}_{\text{target}}\) with single-qubit gates form a universal gate set, e.g., for quantum computation on a surface code circuit-QED layout. We fix the parameters to be within typical experimental values (see e.g. refs 52,53) for the qubit–qubit coupling \(J/2\pi\) = 5 MHz and the detuning \(\Delta /2\pi\) = 0.35 GHZ.

Fig. 1: Overview of circuit schematic and the AlphaZero algorithm. a Circuit-QED architecture consisting of two qubits (colored boxes) mounted on either side of a transmission line resonator. The first qubit is directly driven at the resonance frequency of the second one for a cross-resonance gate. An example of a piecewise-constant pulse is depicted below the setup. b The schematics of a Monte Carlo tree search. Here the nodes are depicted as pulse sequences and the edges as lines. A single search consists of a forward propagation, expansion, and a back-pass (see text). During the expansion phase, the corresponding state (unitary) is fed as input to the neural network. The edges are initialized using the probabilities \({\bf{p}}\) from the neural network and the back-pass uses the value \(v\) also given by the neural network. c The neural network architecture used in AlphaZero. The network takes the state (unitary with \(N=4\)) encountered in the tree search, as input and outputs probabilities for selecting individual actions \({\bf{p}}=\{{p}_{{a}_{1}}\!,{p}_{{a}_{2}}\!, \ldots \}\) and an estimate of the final score (fidelity) \(v\). Full size image

We consider three optimization classes to test a unified AlphaZero algorithm and benchmark it against both domain-specialized and domain-agnostic algorithms. These three correspond to control parameters \(\Omega (t)\) that are digital, i.e. taken from a discrete set of possibilities; that can vary continuously as a function of continuous but highly filtered controls; and lastly, piecewise-constant controls, which is standard in the QOCT approximation.

Within the RL framework, an autonomous agent must interact with an environment that at each time step \(t\) inhabits a state \({s}_{t}\). Here we choose the unitary \(\hat{U}(t)\) to represent this state. The agent then alters the unitary at each time step \(t\) by applying an action \({a}_{t}\) (here \(\Omega (t)\)) that transforms the unitary \(\hat{U}(t)\to \hat{U}(t+\Delta t)\). For instance if the task is to create a piecewise-constant pulse sequence as depicted in Fig. 1a., the different actions would correspond to a finite set of predefined pulse amplitudes. This is unlike standard QOCT, where the amplitudes are allowed to vary continuously. In the following, we will refer to the generation of an entire pulse sequence as an episode.

The purpose of the agent is to maximize an expected score \(z\) at the end of an episode, which we choose to be the fidelity \(z={\mathcal{F}}(U(T))\). Here \(T\) denotes the final time. This is done by implementing a probabilistic policy \({\boldsymbol{\pi }}({s}_{t})=({\pi }_{{a}_{1}}\!,{\pi }_{{a}_{2}}\!,\ldots )\), which maps states \({s}_{t}\) to probabilities of applying actions, i.e. \({\pi }_{{a}_{j}}=\,{\text{Pr}}\,({a}_{j}| {s}_{t})\). The agent attempts to improve the policy by gradually updating it with respect to the experience it gains.

Figure 1b, c illustrate the tree search and the neural network for AlphaZero, respectively. The unitary found in the tree search is used as input for the neural network. The upper output of the neural network approximates the present policy for a given input state, i.e. \({p}_{a} \sim {\pi }_{a}\). Meanwhile, the lower output provides a value function which estimates the expected final reward, that is \(v({s}_{t}) \sim {\mathcal{F}}(T)\). In our work, we have found that providing AlphaZero with complete information of the physical system in form of the unitary to benefit its performance, though this may scale poorly for systems with larger Hilbert spaces.

Both functions use only information about the current state and suffer from being lower-dimensional approximations of extremely high-dimensional state and action spaces. The insight of the AlphaZero algorithm is to supplement the predictive power of the value function \(v({s}_{t})\) with retrodictive information coming from future action decisions in a Monte Carlo search tree. The tree depicted in Fig. 1b consists of nodes, which represent states (here depicted as pulses) and edges, which are state-action pairs (depicted as lines). At each branch in the tree, the algorithm chooses actions based on a combination of those with the highest expected reward and the highest uncertainty, a measure of which edges remain unexplored. Whenever new states (called leaf-nodes) are explored, the neural network is used to estimate the value of that node, and the information is propagated backward in the tree to the root node. The forward and backward traversals of the tree are described in greater detail in Methods. For the interested reader we have also provided a step-by-step walkthrough of the algorithm in the Supplementary Materials.

In the manner described above, the predictive nature of the network is able to inform choices in the tree while the retrodictive information coming back in time is able to give better estimates of the state values already explored, which are then used to train the network, i.e. update the network parameters in order to improve its predictions. The training occurs after each completed episode. This reinforcing mechanism is thus able to globally learn about the parameter landscape by choosing the most promising branches while effectively culling the vast majority of the rest. The result is neither an exhaustive sampling at full depth, which would yield the true landscape albeit at a computationally untenable cost, nor is it an exhaustive sampling at shallow depth, which would require a prohibitively slow learning rate for information from the full depth of the tree to propagate back. Instead, AlphaZero intelligently balances the depth and the breadth of the search below each node. While the hidden-variable approximation given by the neural network and MC tree are certainly not exhaustive and cannot find solutions with exponentially small footprint, it is nonetheless able to discover patterns and learn an effective global policy strategy that produces robust, heterogeneous classes of promising solutions. In our implementation we restrict AlphaZero such that it can only find new unique solutions, which is done by cutting off branches in the tree that have previously been fully explored. Hence, each solution found by AlphaZero is different from any previously found solution.

In what follows we apply the algorithm with a unified set of algorithmic parameters (hyperparameters) to three optimization classes: discrete, continuous, and continous with strong constraints. We have found these hyperparameters by fine-tuning them with respect to the continuous problem. The three problem types accentuate different optimization strategies. In the discrete optimization case, we show how AlphaZero stands up against other domain-agnostic methods (where the gradient is not defined) and compare their abilities to learn structures in the parameters. For the constrained continuous pulses, we validate the hypothesis that the analytical gradient, while computable, is highly inefficient and indeed unable to find near global solutions that are at least as good as those found by AlphaZero. Finally, in the continuous-valued piecewise-constant case, we show the balance between state-of-the-art physics-specialized and agnostic AlphaZero approaches. We show that the combination of exploration and exploitation is able to produce new clusters of high-quality solutions that are otherwise highly unlikely to be found, while learning hidden problem symmetry.

Digital gate sequences

As a first application with AlphaZero, we demonstrate optimal control using Single Flux Quantum (SFQ) pulses.45,54,55 The aim is to control the quantum system by using a pulse train that consists of individual, very short pulses typically in the pico-second scale. This technology originated as way of utilizing superconductors for large-scale, ultrafast, digital, classical computing.56 At each time slice there either is a pulse or not, which implies that the unitary evolution is governed by two unitaries \({\hat{U}}_{1}\) and \({\hat{U}}_{0}\). Hence, the pulse train can be stored as a digital bit string with 0 and 1 denoting no pulse and a single pulse respectively. SFQ devices are interesting candidates for quantum computation since they potentially allow for ultrafast gate operations as well as scalable quantum hardware.55 We model the pulses as \(\Delta t\) = 2.0 ps Gaussian functions \(\Omega (t)=\frac{a}{\sqrt{2\pi }\tau }{{\rm{e}}}^{-\frac{{(t-\Delta t/2)}^{2}}{2{\tau }^{2}}},\) where \(\tau\) = 0.25 ps and \(a=2\pi /1000\). The pulse is depicted to the right in Fig. 2a.

Fig. 2: Optimization in the digital case (SFQ). a Illustration of how a SFQ pulse train can be encoded into a bit string, with bits displayed above the figure and gate number below. The figure also contains a zoom-in that depicts the exact shape of a single pulse. b The best obtained infidelity (\(1-{\mathcal{F}}\)) at each gate duration for different discrete optimization algorithms. Each algorithm had 50 h wall time for each gate duration. c A comparison between the infidelities obtained by AlphaZero and the GA at 60 ns. For AlphaZero, each dot represents the infidelity obtained at the end of a unique episode, i.e. generation of an entire pulse-sequence, while for the GA each dot represents the highest scoring member in the population after each iteration. Full size image

The optimization task is to find the input string that maximizes the fidelity functional (1). The current approach for this type optimization is to apply a genetic algorithm (GA)45,57,58. Besides GA and AlphaZero, we also compare a conventional algorithms, stochastic descent (SD) as in ref. 24 SD is a time-local, greedy optimizer that changes the pulse at a randomly chosen time if this results in an increasing fidelity.

Our unified AlphaZero algorithm has an action space of 60 for the neural network, and thus we group together binary SFQ action choices of multiple time steps. For this purpose, we take larger steps in time, and the 60 action choices are given using bit strings from a randomly chosen basis (see Methods). We benchmark the different algorithms by using equal wall-time simulations. For all simulations presented in this paper, we used a wall-time of 50 h on an Intel Xeon X5650 CPU (2.7 GHz) processor. Similar to ref. 45 we use a population size of 70 with a mutation probability of 0.001 for the GA (see Methods). In the Supplementary Material, we provide an alternative analysis of the wall time consumption needed to reach a predefined infidelity.

The results are plotted in Fig. 2b. Amongst conventional approaches, we see the SD algorithm performs slightly better than the GA. We attribute this to the fact that the SD algorithm is a greedy exploitation algorithm, while the GA is an exploration algorithm performing random permutations. As with many exploration algorithms, learning can be quite slow. We emphasize that AlphaZero contains a deep lookahead tree search, which we found crucial to the success of our RL implementation (having originally tested a simpler RL algorithm, Deep Q-Network (DQN),31 on a less complicated problem). We see in Fig. 2b that AlphaZero indeed performs dramatically better than the greedy approach, with up to over an order of magnitude improvement in the low error regime. Interestingly, AlphaZero’s solutions begin to fluctuate in the low infidelity regime. We attribute this to the neural network’s incapability of distinguishing between different unitaries belonging to the very low infidelity regime. However, AlphaZero still maintains a systematic improvement over the other algorithms.

AlphaZero and GA are both learning algorithms in the sense that they utilize previous obtained solutions in order to form new ones. We compare the learning curves for the two algorithms in Fig. 2c, where we have plotted the infidelity as a function of wall time at 60 ns. For AlphaZero, we use the infidelity after each episode, where each data point is unique. For GA, we use the best performing solution in the population after each iteration. Since GA is a relatively greedy algorithm it performs very well initially, but fails to explore the larger solution space as the members in the population converge upon a single class of solution and the learning curve flattens out. In contrast, AlphaZero keeps a high level of exploration that ultimately allows it to reach a very large number of different high-fidelity solutions.

Constrained analog pulses

A common challenge within quantum optimization is achieving realistic and efficient controls when experimental limitations constrain the underlying dynamics. Such constraints become very important when high precision is required, e.g. for very high-fidelity operation of quantum technologies. Here, we consider standard constraints on duration, bandwidth, and maximum energy. Such constraints can be expected to greatly increase the computational cost of Hessian approximation-based solutions, which are otherwise known to converge quickly16 and generally outperform other greedy methods.21,22 The workhorse algorithm for this is GRAPE,8 with quasi-Newton20 and exact derivative59 enhancements being crucial to the state of the art and its super-linear convergence. Note that GRAPE, unlike AlphaZero, can handle pulses that are continuous in amplitude.

We model the bandwidth constraints via a convolution with a Gaussian filter function

$$\tilde{\Omega }(t)={\int_{-\infty}^{\infty}}{\rm{e}}^{-\frac{{(t-t^{\prime} )}^{2}}{{\sigma }^{2}}}\Omega (t^{\prime} ){\rm{d}}t^{\prime} ,$$ (3)

where \(\tilde{\Omega }(t)\) denotes the filtered control function. Figure 3a illustrates the effect of this filter. Here, a piecewise-constant pulse (dark blue) with amplitudes \({a}_{1-4}\) is convoluted into a smooth pulse (light orange) via Eq. (3). Throughout the remainder of this paper, we constrain the pulse amplitude to lie between 0 and \({\Omega }_{\text{max}}/2\pi\) = 1.0 GHz.

Fig. 3: Optimization of constrained continuous pulses (Gaussian filtering). a A piecewise-constant pulse (dark blue) convoluted by a Gaussian filter (light orange). The vertical orange lines depict the resolution, i.e. number of substeps per 4.0 ns. Here \(\sigma\) = 0.7 ns. b The infidelity (error) between the exact and the calculated unitary as a function of its resolution. c Comparison between AlphaZero and GRAPE on the cross-resonance gate using Gaussian filtered pulses. Each point represents the best obtained infidelity during a 50 h wall time simulation at that specific gate duration. Full size image

Most commonly, GRAPE is applied to piecewise-constant pulses, but it can be modified to include filtering,59,60 as we also do here. Each time-step is divided into a number of substeps (giving the resolution) and the filtered pulse is then approximated as being constant within each substep. This subdivision is depicted in Fig. 3a as light orange vertical lines. In order to obtain the gradient, GRAPE calculates the time-evolution unitary using matrix exponentiation at each substep.

Figure 3b shows the error (infidelity) between the exact and discretized unitaries as a function of the resolution. If we seek errors well below the desired gate error (\(1{0}^{-2}\)), the resolution should be well above a hundred. This significantly impedes the performance of GRAPE for this type of problem, since it requires considerably more matrix multiplications. A different strategy is to limit the control to a set of discretized amplitudes whose corresponding unitary can be calculated in advance and then apply a discretized optimization algorithm such as AlphaZero. In order to do so, we apply a two-action update strategy, where we propagate from half the previous pulse to halfway into the next one. So, if the previous action was \({a}_{2}\) and the next one \({a}_{3}\) then the unitary \({U}_{2,3}\) would correspond to the shaded region in Fig. 3a. Here we ignore negligible contributions from adjacent pulses. For instance, calculating \({U}_{2,3}\) would be independent of \({a}_{1}\) and \({a}_{4}\). Here we limit the amplitude to 60 different values (out of a continuous set), hence this methods requires calculating \(6{0}^{2}=3600\) unitaries, which we do in the beginning of the simulation.

In our comparison between AlphaZero and GRAPE, we choose 4.0 ns convoluted pulses using \(\sigma\) = 0.7 ns. For GRAPE, we choose a resolution of 200, i.e., every 4.0 ns interval is divided into 200 substeps. Figure 3c shows the results of an equal wall-time simulation, where GRAPE was applied with random seeding. Here, AlphaZero obtains a systematic improvement over its domain-specialized counterpart. At 96 ns, AlphaZero outperforms GRAPE with an improvement that is significantly above one order of magnitude. Interestingly, both graphs show significant fluctuations, which we attribute to the difficulty of the optimization task itself caused by the highly constrained dynamics. This is likely compounded by the random initialization of the neural network which can effect the convergence properties of AlphaZero. Despite these fluctuations, AlphaZero performs significantly better in the regime of interest corresponding to infidelities below \(1{0}^{-2}\).

Piecewise-constant analog pulses

So far, we have considered problems where gradient searches have not been applicable (digital sequence) or where gradient searches become inefficient (constrained analog pulses). For specific tasks where highly specialized algorithms exist and are known to perform relatively well, domain-agnostic algorithms typically perform inadequately. Thus, to properly benchmark our algorithm we have also considered the domain of piecewise-constant pulses, a scenario where GRAPE typically performs extremely well due to the presence of high-frequency components and the limited number of matrix multiplications. In the following we hence focus on picewise constant pulses where we choose a single-step duration of 2 ns. In this scenario, we characterize the performance of the exploitation and exploration algorithms in terms of both the variety of solutions found and the quality of the solutions. Here we also introduce a new RL algorithm, named Q-learning. Q-learning was one of the first RL algorithms developed, and applied recently to quantum control.24,37 It is a tabular-based algorithm that applies one-step updates in order to solve the optimal Bellman equation61 (see Methods). Here we apply an extended version that uses eligibility traces and regular replays, similar to ref. 37 Originally, we attempted to apply Q-learning to digital gates (SFQ) pulses, but ultimately the algorithm failed due to the very large search space, where tabular-based methods are known to break down. This is one reason why modern RL algorithms use deep neural networks instead, motivating also our use of AlphaZero.

At first, we compare the algorithms already discussed, namely Q-learning, Stochastic Descent, AlphaZero, and GRAPE. Figure 4a shows GRAPE is able to outperform the other algorithms for piecewise-constant pulses. However, AlphaZero still performs well despite its limitation of only having amplitude-discretized controls. To improve the AlphaZero algorithm further we conceive a hybrid algorithm where GRAPE optimizes the solutions found by AlphaZero. The hybrid algorithm, which is given the same wall-time as the others, is also plotted in Fig. 4a. Here the hybrid algorithm shows a significant improvement over GRAPE near 60 ns, which we relate to the presence of a quantum speed limit (QSL) where the optimization task becomes difficult due to induced traps in the fidelity landscape.22,24,26 It is also worth noting that the optimization curve flattens out and the two algorithms again perform equally well when the pulse duration goes beyond 62 ns. Interestingly, there is an entirely different time scale, beyond 200 ns (not plotted here), for reaching gate infidelities below \(1{0}^{-4}\).

Fig. 4: Optimization on continuous (piecewise-constant) pulses. a An equal wall-time comparison between the various algorithms. The AlphaZero (here abbreviated AZ) Hybrid is presented in the text. b The fraction of successful solutions found by AlphaZero Hybrid and the GRAPE algorithm. Full size image

We also quantify the number of successful solutions found by either GRAPE or the hybrid AlphaZero algorithm, which we define as solutions having infidelities within four times the lowest infidelity obtained. The fraction of successful solutions are plotted in Fig. 4b. Here the improvement is even more substantial. At 60 ns, we find almost three orders of magnitude more successful solutions compared to GRAPE with random seeding. The fact that the GRAPE-curve dips around 60 ns seems to confirm our previous statement about the QSL in the sense that this is a combinatorially harder region to obtain relatively good solutions. We believe this marks a transition into a combinatorially harder optimization region in the control landscape similar to the glassy phase transition studied in ref. 62 Having a large number of good solutions is especially important because experimentally it may be that some are better suited or some provide additional advantages.

To further investigate the differences between the two algorithms, we perform a clustering analysis of the obtained solutions similar to ref. 11 We compare the exploration of the control parameter landscape using a two-dimensional embedding provided by the t-SNE (t-Distributed Stochastic Neighbor Embedding) visualization method.63

The t-SNE algorithm represents a set of high-dimensional vectors by a set of low-dimensional vectors, e.g. two dimensions as used here. If the Euclidean distance between any pair of vectors in the high-dimensional set is low (high) then the Euclidean distance between the same pair in the low-dimensional set should also be low (high). This allows one to visualize and investigate clustering tendencies. Specifically, the distances are weighted by the Student’s t-distribution whereby nearby points are strongly favored. The algorithm then calculates joint probability distributions that any pair of vectors are “close” to each other both in the high- and low -dimensional set, and then minimizes the Kullback-Leibler divergence between the two distributions. See ref. 63 for a more detailed description.

We do a single t-SNE analysis at 60 ns, plotted in Fig. 5, which we have separated for clarity into different figures for GRAPE (a), AlphaZero i.e. Hybrid before optimization (b), and after optimization (c). Here the color scale depicts the infidelity. We used a perplexity of 30 (a paramater t-SNE uses that estimates how many nearest neighbors each point has), which we found gave reasonable clustering-plots.

Fig. 5: Two-dimensional representation of the final pulse vectors at 60 ns using the t-SNE algorithm. t-SNE is a visualization method that here represents the final pulse vectors by a set of two-dimensional vectors, in such a way that if any two pulse vectors are close to each other then the coordinates of their two-dimensional representations should be relatively close to each other as well. The color scale shows the infidelity of the pulses. a GRAPE with random seeding, b AlphaZero, c The Hybrid, i.e. AlphaZero solutions after being optimized with GRAPE. In the latter case, some example high-fidelity pulses are shown. Full size image

Strikingly, the two algorithms seem to prefer entirely different portions of the landscape. GRAPE mostly finds solutions clustered to the left in the t-SNE representation, but its high performing solutions are actually clustered to the right. Interestingly, AlphaZero primarily finds solutions in the right region, which implies that AlphaZero has identified an underlying basic generic structure of good solutions. When all the AlphaZero solutions are optimized this leads to a large quantity of high performing solutions that inhabit the same region in the t-SNE representation.

We also see that the hybrid solutions naturally cluster towards some general basins of attraction. This suggests that AlphaZero has not converged on a single class but multiple different classes of solutions with different underlying physics. Some pulses from different clusters are depicted, showing some resemblance to typical bang-bang sequences. The different clustering that occurs demonstrates that a global exploration has indeed taken place, effectively finding different classes of solutions in different parts of the landscape.

We further test the hypothesis that AlphaZero has found underlying structure that supersedes a shallow heuristic search. Note that the solutions seem to have at least some symmetry with respect to a reflection around the center of the time-axis. In fact, this symmetry already exists in the control problem. Since the Hamiltonian is real and the target its own transpose, the fidelity is unchanged if the pulse sequence is reversed i.e. \({\mathcal{F}}({\Omega }_{1},{\Omega }_{2},\ldots ,{\Omega }_{N-1},{\Omega }_{N})={\mathcal{F}}({\Omega }_{N},{\Omega }_{N-1},\ldots ,{\Omega }_{2},{\Omega }_{1})\). This property defines a discrete \({{\mathbb{Z}}}_{2}\) symmetry similar to what was studied in ref. 64

However, it is not a priori clear that satisfying this symmetry is a good control strategy. We quantify the degree of time-asymmetry in the pulses via the measure

$$\begin{array}{r}C(\{\Omega (t)\})=\displaystyle\frac{1}{N}\sqrt{\sum _{j=1}^{N}| {\Omega }_{j}-{\Omega }_{N-j}{| }^{2}},\end{array}$$ (4)

where \(C=0\) implies pulses that are completely palindromic, i.e. symmetric with respect to reversion of the sequence.

We plot in Fig. 6 the infidelity and the asymmetry for the two algorithms, i.e. in (a) we plot random seeds and the same seeds optimized by GRAPE; in (b) we plot the AlphaZero solutions and the same solutions further optimized by GRAPE, which is the Hybrid algorithm.

Fig. 6: Asymmetry of solutions. The figure depicts the infidelity (\(1-{\mathcal{F}}\)) as a function of the asymmetry measure (4) at 60 ns for: a Random seeds (plotted in red-yellow scale) followed by an optimization with GRAPE (plotted in blue-green scale) and b AlphaZero solution (plotted in red-yellow scale) followed by an optimization with GRAPE (i.e. the Hybrid algorithm, plotted in blue-green scale). The color scale depicts the iteration of the algorithm. Full size image

Here the color scale depicts the iteration number. The first thing to notice is that high-fidelity solutions tend to maintain this symmetry. The second feature is that GRAPE often only partially satisfies this symmetry. In contrast, AlphaZero learns over its training to increasingly prefer this symmetry, moving towards the bottom left of the plot. After post-optimization using GRAPE, the solutions improve significantly in infidelity and move ever further to the bottom left emphasizing this trend. Note that we are not arguing that \(C=0\) is optimal, but merely that increasing the symmetry to some extent seems preferable. We conclude that AlphaZero has identified this underlying symmetry specific to the problem instance we have chosen. Naturally, hard-coding such heuristics would not only be inefficient, but for many problems finding symmetries is nontrivial. Using deep learning, AlphaZero is able to learn these hidden symmetries without the need for human intervention. We therefore expect that AlphaZero’s ability to learn hidden problem structures generalizes to other problems as well.