Google’s superconducting-qubit architecture that allows tunable qubit-qubit coupling36 is called the gmon architecture. To the lowest order of approximation, the system Hamiltonian of gmon qubits consists of one-body and nearest-neighbor-two-body terms represented by bosonic creation and annihilation operators, \(\hat a_j^\dagger\) and \(\hat a_j\), and bosonic number operator \(\hat n_j\), for the j-th bosonic mode. In the rotating-wave approximation (RWA), with a constant rotation rate chosen as the harmonic frequency of the Josephson junction resonator (see Supp. A), the two-qubit gmon Hamiltonian takes the form:

$$\begin{array}{*{20}{l}} {\hat H_{{\mathrm{RWA}}}(t)} \hfill & = \hfill & {\frac{\eta }{2}\mathop {\sum}\limits_{j = 1}^2 {\hat n_j} (\hat n_j - 1) + g(t)(\hat a_2^\dagger \hat a_1 + \hat a_1^\dagger \hat a_2) + \mathop {\sum}\limits_{j = 1}^2 {\delta _j} (t)\hat n_j} \hfill \\ {} \hfill & {} \hfill & { + \mathop {\sum}\limits_{j = 1}^2 i f_j(t)\left( {\hat a_je^{ - i\varphi _j(t)} - \hat a_j^\dagger e^{i\varphi _j(t)}} \right),} \hfill \end{array}$$ (1)

where the time-independent parameter η represents the anharmonicity of the Josephson junction, and the seven time-dependent control parameters are: (1) amplitude f j (t) and (2) phase φ j (t) of the microwave control pulse; (3) qubit detuning δ j (t), and (4) tunable capacitive coupling or g-pulse g(t). The computational subspace is spanned by the two lowest energy levels of each bosonic mode: \({\cal H}_2 = {\mathrm{Span}}\{ |0\rangle _j,{\mathrm{|}}1_j\rangle \}\), where |n〉 j represents a Fock state with n excitations in the j-th mode.

An effective control cost function is crucial to efficient control optimization and to guaranteeing the full controllability over the quantum system. We propose a control cost function that includes leakage errors, control constraints, total runtime, and gate infidelity as soft penalty terms that are readily optimizable using RL techniques without compromising system controllability. We illustrate the design of a UFO cost function in the tunable gmon superconducting-qubit architecture.36

A unitary gate is realizable through the control of the time-dependent Hamiltonian defined in Eq. (1) according to \(U(T) = {\Bbb T}\left[ {\exp \left( { - i{\int}_0^T {\hat H_{{\mathrm{RWA}}}} (t)dt} \right)} \right]\), with \({\Bbb T}\) denoting the time-ordering operator. The inaccuracy of the controlled two-qubit unitary gate U(T) with respect to a target unitary gate U target is measured by the gate infidelity: \(1 - F[U(T)] = 1 - (1/16)\left| {{\mathrm{Tr}}(U^\dagger (T)U_{{\mathrm{target}}})} \right|^2\),20,21,22,23 which vanishes only when U(T) = U target up to a global phase. This definition of control inaccuracy is widely used in quantum control optimization1,16,20,21,22,29,30,34 for its modest computational overhead during iterative optimizations. Additionally, diamond distance and average gate infidelity are alternative measures for control inaccuracy. The former provides a better measure of the coherent error but is harder to calculate, and the later can be measured through randomized benchmarking.36 As shown in ref. 37,38 the gate infidelity is related to diamond distance and average gate infidelity. To reduce computational overhead, we choose gate infidelity as the first part of our UFO cost function to penalize the control inaccuracy.

The second part is a penalty term on the accumulated leakage errors derived in Supp. B.2. The last two terms of the control cost function penalize the total runtime T and violation of control boundary conditions. Boundary conditions are chosen to facilitate convenient gate concatenations: microwave pulses and the g-pulse should vanish at both boundaries such that the computational bases and the Fock bases coincide. This is enforced by adding \(\mathop {\sum}

olimits_{t \in \{ 0,T\} } [g^2(t) + f^2(t)]\) to the control cost function. Such boundary constraints also help to minimize the errors caused by deviations from the RWA arising from the fast-oscillating nature of the non-RWA terms; see Supp. A for details. We thus obtain the full UFO cost function:

$$C(\chi ,\beta ,\gamma ,\kappa ) = \chi [1 - F[U(T)]] + \beta L_{{\mathrm{tot}}} + \mu \mathop {\sum}\limits_{t \in \{ 0,T\} } {\left[ {g^2(t) + f^2(t)} \right]} + \kappa T$$ (2)

where χ penalizes the gate infidelity, β penalizes leakage errors, μ penalizes violation of the boundary constraints, and κ penalizes the total runtime. These hyper-parameters are optimized to achieve satisfactory control outcomes. To apply to quantum computing platforms other than gmon qubits, each term of the UFO cost function can be modified to best describe the optimization target based on the platform’s underlying physics.

Leakage error bound

To identify different sources of leakage errors, we decompose Eq. (1) into three parts: \(\hat H_{{\mathrm{RWA}}}(t) = \hat H_0 + \hat H_1(t) + \hat H_2(t)\), where \(\hat H_0 = (\eta /2)\mathop {\sum}\limits_{j = 1}^2 {\hat n_j} (\hat n_j - 1)\) accounts for the large constant-energy gaps separating the qubit subspace from higher energy subspaces. It also determines the minimum energy gap Δ, separating the qubit subspace from the nearest higher energy subspace. Henceforth, we set the Planck constant h = 1 for the convenience of discussion, and the energy scale is measured in units of MHz. The block-diagonal Hamiltonian

$$\begin{array}{*{20}{l}} {\hat H_1(t)} \hfill & = \hfill & {\mathop {\sum}\limits_{j = 1}^2 {\delta _j} (t)\hat n_j + if_1(t)\left( {|0\rangle _1\langle 1|_1e^{ - i\varphi _1(t)} - |1\rangle _1\langle 0|_1e^{i\varphi _1(t)}} \right) \otimes {\Bbb I}_2} \hfill \\ {} \hfill & {} \hfill & { + if_2(t){\Bbb I}_1 \otimes \left( {|0\rangle _2\langle 1|_2e^{ - i\varphi _2(t)} - |1_2\rangle \langle 0|_2e^{i\varphi _2(t)}} \right)} \hfill \\ {} \hfill & {} \hfill & { + g(t)\left( {|1_1\rangle |0\rangle _2\langle 1_2|\langle 0|_1 + |0\rangle _1|1\rangle _2\langle 0|_2\langle 1|_1} \right)} \hfill \\ {} \hfill & {} \hfill & { + 2g(t)\left( {|2\rangle _1|1\rangle _2\langle 2|_2\langle 1|_1 + |1\rangle _1|2\rangle _2\langle 1|_2\langle 2|_1} \right)} \hfill \end{array}$$ (3)

accounts for the coupling within the qubit subspace Ω 0 = Span{|00〉, |10i〉, |01〉, |11〉} and within the first excited energy subspace Ω 1 = Span{|20〉, |21〉, |12〉, |02〉}, and the block-off-diagonal \(\hat H_2(t) = \hat H_{{\mathrm{RWA}}}(t) - \hat H_0 - \hat H_1(t)\) accounts for the couplings between different energy subspaces. \(\hat H_2(t)\) is the culprit behind leakage errors. But, because \(\hat H_1(t)\) and \(\hat H_2(t)\) both derive from microwave pulses and the g-pulse, one cannot turn off \(\hat H_2(t)\) without turning off control over the single-qubit Pauli X and Y unitaries from \(\hat H_1(t)\) that are crucial for obtaining full controllability of the qubit system. In order to suppress and evaluate coherent leakage errors induced by \(\hat H_2(t)\), we adopt a rotated basis given by the TSWT framework, under the assumption that inter-subspace and intra-subspace couplings are much smaller than the energy gap separating different subspaces: \(|f_j(t)|\sim |\delta _j(t)|\sim |g(t)|\sim \varepsilon \ll \eta \sim {\mathrm{\Delta }}\), see Supp. B. The effective block-off-diagonal Hamiltonian \(\hat {\Bbb H}_{{\mathrm{od}}}(t)\) after the TSWT can thus be suppressed to any chosen order by applying the correct order of TSWT.

There are two independent sources of leakage errors for TSWT-based quantum control that dominate in superconducting-qubit gate controls. The first is the direct coupling leakage caused by the non-zero block-off-diagonal Hamiltonian after the second-order TSWT. The second is the leakage caused by unwanted excitations due to fast modulation of the system Hamiltonian. We derive in Supp. B the following bound for the coherent leakage errors at time T:

$$L_{tot} = \frac{{\left\Vert {\hat {\Bbb H}_{{\mathrm{od}}}(0)} \right\Vert}}{{{\mathrm{\Delta }}(0)}} + \frac{{\left\Vert {\hat {\Bbb H}_{{\mathrm{od}}}(T)} \right\Vert}}{{{\mathrm{\Delta }}(T)}} + \mathop {\int}\limits_0^T {\frac{1}{{{\mathrm{\Delta }}^2(t)}}} \left| {\left| {\frac{{d^2\hat {\Bbb H}_{{\mathrm{od}}}(t)}}{{dt^2}}} \right|} \right|dt,$$ (4)

where \(\hat {\Bbb H}_{{\mathrm{od}}}(t)\) is of magnitude \(O\left( {\frac{{\varepsilon ^3}}{{\Delta ^2}}} \right)\) after the second-order TSWT.

In addition to the coherent leakage errors bounded by (4), there also exist incoherent leakage errors due to the violation of adiabaticity from the time-dependent nature of our control quantum dynamics in the off-resonant regime. We derive a generalized adiabatic theorem to bound the non-adiabatic leakage error in Supp. B.2. We show there that such non-adiabatic leakage is not dominant in the off-resonant frequency regime, i.e., (4) accounts for the dominant leakage errors in both the resonant and off-resonant regimes.

Deep trusted-region reinforcement learning

The control space for the two-qubit quantum gate is parametrized at each time step t by a real valued vector \(\vec u(t) = \{ f_1,f_2,\varphi _1,\varphi _2,\delta _1,\delta _2,g\}\) specifying 7 amplitudes of the controllable system Hamiltonian. Our use of policy NN is based on a piecewise constant (PWC) representation of control trajectory, which contains around one thousand time steps for each gate-control sequence. Such PWC encoding was previously considered disadvantageous39 for the following reasons: (1) the lack of analytic form of gradient expression can lower the accuracy of the control optimization given the same computational resources; (2) PWC control may introduce unwanted high-frequency components that are detrimental for causing leakage errors. We show with this work that these limitations are largely obviated in reality: (1) the input to experimental quantum control DACs are also PWC signals with time step limited by sample rate and control amplitude accuracy limited by transfer function uncertainties,1 analytic control function therefore has to be truncated during experimental implementations and suffers from discretization errors not accounted by its original control optimization; (2) control filter design can be easily integrated into the PWC control optimization to guarantee a desired frequency bandwidth of the control-pulse sequence (see Supp. C); and (3) PWC representation can be directly transferred to close-loop system calibration and control optimization to interface directly with control DACs.

Our RL agent is comprised of two neural networks (NNs): one maps a given state containing the information about the simulated unitary gate U(t i ) at the current step t i to the mean and variance of the Gaussian distribution of the proposed control actions \(\vec u\left( {t_{i + 1}} \right).\) for the next step (the policy NN); the other NN takes the simulated unitary gate U(t i ) as input to output the predicted reward associated with the current unitary (the value function NN).12 Notice the salient difference from the on-policy RL utilized in this work differ from previously studied off-policy RL in previous work15,16 is that the control trajectory (embedded in policy NN) is represented independently from the control cost (value function NN). Off-policy RL, such as Q learning,13 on the other hand, uses a single NN to represent both the control trajectory and the associated reward.

Both the policy and value function NN are fully connected three layers NNs of dimension 64, 32, and 32. Intuitively, the first NN, the policy NN encodes the analytic and non-local feature of control solutions. Such encoding, which is traditionally captured by a carefully chosen analytic function,40 is now represented by a model-independent NN without any prior knowledge of the target cost function. The value function NN encodes the projected future interactions with a stochastic environment and the associated control cost, which is used to adjust the learning rate of the policy NN’s gradient descent.

Both of the RL agent’s NNs interact with a training environment that evaluates the quantum dynamics under the RL agent’s proposed control action and returns the updated unitary gate and the corresponding control cost (as reward); see Fig. 1. Optimization consists of many episodes, each of which contains all the time steps of a complete quantum-control trajectory. The duration of such a sampled control trajectory is determined by the minimum of a predefined runtime upper bound and the time it takes to meet a termination condition. In our case, the termination condition is measured by a satisfactory value of the UFO cost function. After sampling a batch of 20,000 episodes, the policy NN is updated to maximize the expected discounted future reward based on the proposed policy variation within the trusted region, and the value function NN is updated to fit the expected discounted future reward based on the newly added samples. A detailed algorithm is presented in refs. 12,41 We found that the control robustness against control errors is significantly improved by simulating experimentally relevant Gaussian fluctuations in the control amplitudes using a stochastic RL training environment. Our discovery differs from recent results in the sampling-based method for obtaining control robustness in that we specifically include optimization over leakage errors in the presence of control fluctuations.

Fig. 1 Overview of the RL implementation: at the iteration time step n + 1, the policy NN proposes a control action in the form of the system Hamiltonian \(\hat H_{n + 1}\), the training environmenttakes the proposed action and evaluates the Schrödinger equation under a noisy implementation \(\hat H_{n + 1} + \delta \hat H_{n + 1}\) for time duration \({\triangle} t\) to obtain a new unitary gate \(U_{n + 1}\) and calculates the associated cost function, both of which are fed into an RL agent. The policy NN and value NN of the RL agent are updated jointly based on the trajectory of the simulated unitary gate, controlaction and associated control cost Full size image

We verified the quality and robustness of our control scheme by evaluating the average fidelity of the noise-optimized control solution under different control-noise model parameters in the next section. There, we compare the performance of our RL-optimized control solution with that of the optimal gate synthesis. The latter provides the minimum number of required gates from a finite universal gate set to realize the same unitary transformation. Our RL control solutions achieve: (1) up to a one-order-of-magnitude of improvement in gate time over the optimal gate synthesis approach based on the best known experimental gate parameters in superconducting qubits; (2) a two-orders-of-magnitude reduction infidelity standard deviation over solutions from both the noise-free RL counterpart and a baseline stochastic gradient descent (SGD) method; and (3) around two-orders-of-magnitude reduction in average infidelity over control solutions from the SGD method.

Two-qubit gate-control optimization

We now apply the UFO framework to find fast and high-fidelity two-qubit gate controls that are robust against control errors. We define gate-control robustness under a given control-noise model as a bounded deviation of the average gate fidelity \(\bar F({\cal E},U_{{\mathrm{target}}})\) from an ideal average gate fidelity F ideal :

$$\left| {\bar F({\cal E},U_{{\mathrm{target}}}) - F_{{\mathrm{ideal}}}} \right| < \varepsilon _0,{\kern 1pt} {\mathrm{for}}\,\varepsilon _0 > 0,$$ (5)

where the average gate fidelity

$$\bar F({\cal E},U_{{\mathrm{target}}}) = {\int} d \psi \left\langle \psi \right|U_{{\mathrm{target}}}^\dagger {\cal E}\left| \psi \right\rangle \left\langle \psi \right|U_{{\mathrm{target}}}\left| \psi \right\rangle$$ (6)

embodies the quality of the gate-control quantum channel by averaging over the whole state space under a uniform Haar measure,42 with the trace-preserving quantum operation \({\cal E}\) accounting for the noisy implementation of a target unitary U target ; see Supp. D for detail. The average gate infidelity is defined accordingly as \(1 - \bar F({\cal E},U)\).

Such a robustness criterion can be validated for a given control scheme using a number of computational steps that is linear in the total degrees of freedom of control parameters. However, it differs from the canonical definition in optimal control theory,10,11 where the number of computational steps for the analysis of robustness using control Hessians scales cubically with the control parameters’ total degrees of freedom. For special cases, such as closed-system single-qubit control, there exist analytic expressions for the control Hessian.10,11 But in the current work we choose a more practical definition of robustness that is scalable to multi-qubit control problems.

Traditional quantum-control trajectory optimization depends on complete knowledge of the underlying physical model. In contrast, the success and robustness of RL persist with incomplete and potentially flawed modeling. It is often the case in experiments that the exact control error model is unknown. Given partial information about the control error model, can we leverage RL optimization to find robust control solutions against not just one but a set of control error models? In our case, we deployed RL agents, trained by trust-region policy optimization12 in the OpenAI platform,43 to find near-optimal control solutions to the UFO cost function described in Eq. (2). We incorporated a pertinent control-noise model for gmon superconducting-qubit Hamiltonian36 into a stochastic training environment. At each time step, amplitude fluctuations sampled from a zero-mean Gaussian distribution with 1 MHz standard deviation, which amounts to around 5% control parameter uncertainty, were added to Hamiltonian parameters that are known to be prone to fluctuation: qubit anharmonicity, qubit detuning amplitudes, microwave control amplitudes, and qubit g-pulse amplitude. See Supp. A for the details. Harnessing the sample-noise resilience of RL optimization, we expected the optimized control to be robust against a family of control-noise models despite being trained under a single model. This was indeed proven to be the case as evidenced by our numerical simulations, see Fig. 4.

$${\cal N}(\alpha ,\alpha ,\gamma ) = exp\left[ {i(\alpha \sigma _1^x\sigma _2^x + \alpha \sigma _1^y\sigma _2^y + \gamma \sigma _1^z\sigma _2^z)} \right.$$ (7)

In gmon superconducting qubits, the energy gap that separates the qubit subspace from the nearest higher energy subspace is Δ(s) ≈ 200 MHz. We apply control frequency filters (Supp. C) to piecewise constant analog-control signals such that the bandwidth of the proposed Hamiltonian modulation is limited to 10 MHz. Given that our off-diagonal Hamiltonian after the second-order TSWT is of order 100 KHz (Supp. B.1), the first leakage-bound term in Eq. (4), \({\int}_0^1 \frac{1}{{{\mathrm{\Delta }}^2(s)}}\frac{1}{T}\left| {\left| {\frac{{d^2\hat {\Bbb H}_{{\mathrm{od}}}(s)}}{{ds^2}}} \right|} \right|ds\), is of order 10−4, which is close to the fault-tolerant threshold for leakage error of the near-term surface code.44 Although the gmon Hamiltonian is fully controllable under our UFO paradigm, we targeted a family of two-qubit gates parametrized by

$${\cal N}(\alpha ,\alpha ,\gamma ) = {\mathrm{exp}}[i(\alpha \sigma _1^x\sigma _2^x + \alpha \sigma _1^y\sigma _2^y + \gamma \sigma _1^z\sigma _2^z)],$$ (8)

where \(\sigma _j^k\) for k ∈ {x, y, z} is the jth qubit’s Pauli matrix. Optimal gate synthesis45 provides the optimal decomposition of such unitary transformation into a minimum number of arbitrary single-qubit rotations and CZ gates, yields a depth-seven circuit containing three two-qubit gates and five single-qubit gates, see Fig. 2. This gate family includes the SWAP, ISWAP, CNOT, and CZ gate, the fermionic SWAP gate, and Given’s rotation up to single-qubit rotations. Both the fermionic swap gate and Given’s rotations are used for realizing Jordan-Wigner transformations in fermionic Hamiltonian simulation.46,47,48 Identifying continuous controls that outperform their optimal gate synthesis counterparts for this family of gates thus has far-reaching applications across quantum chemistry and quantum simulation. The UFO cost function’s parameters were optimized through a grid search and turned out to be χ = β = 10, μ = 0.2, κ = 0.1, values that are applicable to all target gates.

Fig. 2 Optimal gate synthesis for realizing unitary gate \({\cal{N}} (\alpha, \alpha, \gamma)\) Full size image

We compared the overall runtime of our noise-optimized control obtained by the RL agent with its optimal gate synthesis counterpart. Based on state-of-the-art experimental implementations, we set the gate time for each single-qubit gate to 20 ns and CNOT to 45 ns. Optimal gate synthesis in Fig. 2 thus has a 215 ns runtime.

The gate times of our noise-optimized control schemes for three different values of γ are shown in Fig. 3. There, different data points for the same γ are obtained by the same RL agent with an adaptive step size in α to guarantee a constant upper bound on the total optimization time: target gate α will be increased by one step α = α + 0.1, either when the agent obtains a control solution with a low enough overall cost, or when the optimization time for a given α exceeds a predefined value. We discovered that it takes significantly less time for an RL agent to learn a new target unitary gate, based on the successful learning of a nearby target, than to learn a new target gate afresh, which provides heuristic evidence for the transfer learning facilitated by RL using a deep NN. The use of an adaptive step size can be replaced by parallel RL agents, each dedicated to a fixed target unitary gate, but that was not the focus of the current study.

Fig. 3 Gate run time of two-qubit gate family \({\cal{N}} (\alpha, \alpha, \gamma)\) for \(\gamma = \pi/2\) (blue curve), \(\gamma = \pi/6\) (green curve) and \(\gamma = \pi/3\) (yellow curve). The standard optimal gate synthesis run time for this gate family is around 200 ns, marked by dashed red line. Total leakage errors and gate infidelity are upper bounded by \({O}\) (10−4) and \({O}\) (10−3), respectively, for all cases Full size image

Figure 3 shows that an RL optimization provides a one-order-of-magnitude runtime improvement for the two-qubit gate family parametrized by \({\cal N}(\alpha ,\alpha ,\pi /2)\) with α ∈ [1.2, 1.7] over the optimal gate synthesis. Such significant improvement originates from the decomposition of this two qubit unitary right at the center of this region with α = γ = π/2 into a direct product of single-qubit unitaries. This demonstrates the hardware efficiency of our control optimization of finding the underlying unitary relations to automatically reduce gate time. In particular, the target unitary gate can be rewritten as \({\cal N}(\alpha ,\alpha ,\pi /2) = - {\mathrm{exp}}\left[ {i(\alpha \sigma _1^x\sigma _2^x + \alpha \sigma _1^y\sigma _2^y)} \right]{\mathrm{exp}}\left[ { - i\frac{\pi }{2}\sigma _1^z} \right]{\mathrm{exp}}\left[ { - i\frac{\pi }{2}\sigma _2^z} \right]\) whose two-qubit entangling part is directly realizable through a time evolution under the gmon Hamiltonian defined in Eq. (1) without detuning or microwave controls: δ j (t) = f j (t) = 0 with j ∈ {1, 2}. Our RL control optimization is thus able to detect such an inherent regularity, which relates a given system Hamiltonian to the family of target unitary gates that are efficiently implementable. Isolated peaks in the gate time plot in Fig. 3 are potentially due to control singularities, which suggests the need for further studies into the hardness of the analog-control landscape in the presence of leakage and control errors.

We verified the robustness of the noise-optimized control solution \(\hat H_{{\mathrm{RWA}}}(t)\) from RL by evaluating its average fidelity \(\bar F({\cal E},U_{{\mathrm{target}}})\) and the standard deviation of the control gate fidelities \(F[U(\hat H_{{\mathrm{RWA}}}(t))]\) under different control-noise instances \(\delta \hat H_{{\mathrm{RWA}}}(t)\) sampled from the same Gaussian distribution N(0, σ noise ):

$$\sigma _{{\mathrm{fidelity}}} = \sqrt {{\Bbb E}_{\delta \hat H_{{\mathrm{RWA}}}(t)\sim N(0,\sigma _{{\mathrm{noise}}})}\left( {F[U(\hat H_{{\mathrm{RWA}}}(t) + \delta \hat H_{{\mathrm{RWA}}}(t))] - F_{{\mathrm{ave}}}} \right)^2} ,$$ (9)

$$F_{{\mathrm{ave}}} = {\Bbb E}_{\delta \hat H_{{\mathrm{RWA}}}(t)\sim N(0,\sigma _{{\mathrm{noise}}})}F[U(\hat H_{{\mathrm{RWA}}}(t) + \delta \hat H_{{\mathrm{RWA}}}(t))].$$ (10)

We consider a Gaussian family of stochastic control error models: the amplitude fluctuations of control parameters are described by zero-mean Gaussian distributions with a standard deviation σ noise ranging from 0.1 to 3.5 MHz. The gate-control performance under the noise model with 1 MHz standard deviation is a reasonable indicator for experimental implementations. Nevertheless, the exact value of the standard deviation is hard to determine and can drift over time. The blue curve in Fig. 4 represents the average fidelity of the noise-optimized control by RL, which stays within the range of [99.5%, 98%] under the given noise model parameter range, satisfying our control robustness definition with ε 0 = 0.007 at σ noise = 1 MHz.

Fig. 4 Average fidelities of the optimized quantum control schemes vs the Gaussian control noise variance for the gate \(\cal{N}\)(2.2, 2.2, π/2). The blue line represents the performance of the noise-optimized control obtained by an RL agent trained under a noisy environment. The green line marked by diamond shapes represents the performance of the control obtained by an RL agent with a noise-free environment. The red dashed line represents the performance of the control trajectory obtained by SGD. Subplot a: zoomed in comparison of the average fidelities of the noise-optimized and noise-free RL control solutions under different values of Gaussian control noise variance. Subplot b: comparison of fidelity variances of three different control schemes under different control noise variances σ noise , where each data point is taken from 60 different control trajectories with control amplitude error at every time step sampled from the Gaussian distribution N(0, σ noise ) Full size image

In Fig. 4, we compare noise-optimized control with a noise-free control solution obtained by an RL agent without a stochastic environment, represented by the green curve marked by diamonds, and with that obtained by a baseline SGD technique using the Adam optimizer,49 represented by the red dashed lines. For the gradient calculation in each SGD iteration, we utilize the averaged gradient over a minibatch of 10 control trajectories. Each one of these trajectories is added with a perturbation sampled from a zero-mean Gaussian distribution with standard deviation 1 MHz to the original Hamiltonian control variables at the concerned time step. We provide our SGD solver the same amount of optimization wall time as the corresponding RL solvers, which amounts to around 2000 random restarts per target gate.

The noise-optimized control solution manifests up to a one-order-of-magnitude improvement in average gate infidelity over the noise-free control solution using RL, and around two-orders-of-magnitude improvement in average gate infidelity over SGD baseline solutions. Moreover, the sampled fidelity standard deviation of the noise-optimized RL solver is consistently two-orders-of-magnitude lower than that of the two other methods throughout the tested noise model parameter range. This result validates the improved stability of our control solution obtained by a policy-gradient trained RL agent against experimentally relevant Gaussian control-noise models.

The major difference between the baseline SGD approach and our on-policy RL is the model-dependence: SGD relies on the calculation of the gradient of the control cost function while the on-policy RL is model-independent and does not directly utilize the physical models to calculate the gradient of its two neural network. Instead, on-policy RL only requires the calculation of the control cost function at each time step. Because the control cost function is easier to compute than the gradient relevant to SGD, on-policy RL possesses more potential than SGD towards scaling up to many qubits. Our work demonstrates the advantageous performance of RL method over SGD in two-qubit gate-control optimization in face of realistic control errors including leakage and stochastic control fluctuations. However, it remains an open question whether an advantage persists when gradient estimation is computationally inexpensive such that SGD or other gradient based optimization also applies.