The network architecture

The smallest building block of a quantum neural network is the quantum perceptron, the quantum analogue of perceptrons used in classical machine learning. In our proposal, a quantum perceptron is an arbitrary unitary operator with m input qubits and n output qubits. Our perceptron is then simply an arbitrary unitary applied to the m + n input and output qubits which depends on \({({2}^{m+n})}^{2}-1\) parameters. The input qubits are initialised in a possibly unknown mixed state ρin and the output qubits in a fiducial product state \({\left|0\cdots 0\right\rangle }_{\text{out}}\) (note that this scheme can easily be extended to qudits). For simplicity in the following we focus on the case where our perceptrons act on m input qubits and one output qubit, i.e., they are (m + 1)-qubit unitaries.

Now we have a quantum neuron which can describe our quantum neural network architecture. Motivated by analogy with the classical case and consequent operational considerations we propose that a QNN is a quantum circuit of quantum perceptrons organised into L hidden layers of qubits, acting on an initial state ρin of the input qubits, and producing an, in general, mixed state ρout for the output qubits according to

$${\rho }^{{\rm{out}}}\equiv {{\rm{tr}}}_{{\rm{in}},{\rm{hid}}}\left({\mathcal{U}}({\rho }^{{\rm{in}}}\otimes {\left|0\cdots 0\right\rangle }_{{\rm{hid}},{\rm{out}}}\left\langle 0\cdots 0\right|){{\mathcal{U}}}^{\dagger }\right),$$ (1)

where \({\mathcal{U}}\equiv {U}^{{\rm{out}}}{U}^{L}{U}^{L-1}\cdots {U}^{1}\) is the QNN quantum circuit, Ul are the layer unitaries, comprised of a product of quantum perceptrons acting on the qubits in layers l − 1 and l. It is important to note that, because our perceptrons are arbitrary unitary operators, they do not, in general, commute, so that the order of operations is significant. See Fig. 1 for an illustration.

Fig. 1: A general quantum feedforward neural network. A quantum neural network has an input, output, and L hidden layers. We apply the perceptron unitaries layerwise from top to bottom (indicated with colours for the first layer): first the violet unitary is applied, followed by the orange one, and finally the yellow one. Full size image

It is a direct consequence of the quantum-circuit structure of our QNNs that they can carry out universal quantum computation, even for two-input one-output qubit perceptrons. More remarkable, however, is the observation that a QNN comprised of quantum perceptrons acting on 4-level qudits that commute within each layer, is still capable of carrying out universal quantum computation (see Supplementary Note 1 and Supplementary Fig. 1 for details). Although commuting qudit perceptrons suffice, we have actually found it convenient in practice to exploit noncommuting perceptrons acting on qubits. In fact, the most general form of our quantum perceptrons can implement any quantum channel on the input qudits (see Supplementary Fig. 2), so one could not hope for any more general notion of a quantum perceptron.

A crucial property of our QNN definition is that the network output may be expressed as the composition of a sequence of completely positive layer-to-layer transition maps \({{\mathcal{E}}}^{l}\):

$${\rho }^{\text{out}}={{\mathcal{E}}}^{{\rm{out}}}\left({{\mathcal{E}}}^{L}\left(\ldots {{\mathcal{E}}}^{2}\left({{\mathcal{E}}}^{1}\left({\rho }^{\text{in}}\right)\right)\ldots \right)\right),$$ (2)

where \({{\mathcal{E}}}^{l}\left({X}^{l-1}\right)\equiv {\text{tr}}_{l-1}\left({\prod }_{j={m}_{l}}^{1}{U}_{j}^{l}\left({X}^{l-1}\otimes {|0\cdots 0\rangle }_{l}\langle 0\cdots 0| \right)\right.\) \({\prod }_{j=1}^{{m}_{l}} \left. {{U}_{j}^{l}}^{\dagger }\right)\), \({U}_{j}^{l}\) is the jth perceptron acting on layers l − 1 and l, and m l is the total number of perceptrons acting on layers l − 1 and l. This characterisation of the output of a QNN highlights a key structural characteristic: information propagates from input to output and hence naturally implements a quantum feedforward neural network. This key result is the fundamental basis for our quantum analogue of the backpropagation algorithm.

As an aside, we can justify our choice of quantum perceptron for our QNNs, by contrasting it with a recent notion of a quantum perceptron as a controlled unitary36,44, i.e., \(U={\sum }_{\alpha }\left|\alpha \right\rangle \left\langle \alpha \right|\otimes U(\alpha )\), where \(\left|\alpha \right\rangle\) is some basis for the input space and U(α) are parametrised unitaries. Substituting this definition into Eq. (2) implies that the output state is the result of a measure-and-prepare, or cq, channel. That is, \({\rho }^{\text{out}}={\sum }_{\alpha }\langle \alpha | {\rho }^{\text{in}}| \alpha \rangle U(\alpha )\left|0\right\rangle \left\langle 0\right|U{(\alpha )}^{\dagger }\). Such channels have no nonzero quantum channel capacity and cannot carry out general quantum computation.

Box 1 Training algorithm

The training algorithm

Now that we have an architecture for our QNN we can specify the learning task. Here, it is important to be clear about what part of the classical scenario we quantize. One possibility is to replace each classical sample of an unknown underlying probability distribution by a different quantum state. Hence, in the quantum setting, the underlying probability distribution will then be a distribution over quantum states. The second possibility is to identify the distribution itself with a quantum state, which we assume in this work, in which case it is justified to say that N samples correspond to N identical quantum states. We focus on the scenario where we have repeatable access to training data in the form of pairs \(\left(\left|{\phi }_{x}^{\,\text{in}\,}\right\rangle ,\left|{\phi }_{x}^{\,\text{out}\,}\right\rangle \right)\), x = 1, 2, …, N, of possibly unknown quantum states. (It is crucial that we can request multiple copies of a training pair \(\left(\left|{\phi }_{x}^{\,\text{in}\,}\right\rangle ,\left|{\phi }_{x}^{\,\text{out}\,}\right\rangle \right)\) for a specified x in order to overcome quantum projection noise in evaluating the derivative of the cost function.) Furthermore, the number of copies per training round needed grows quickly with the number of neurons (linearly with the number of network parameters), i.e., n proj × n params , where n proj is the factor coming from repetition of measurements to reduce projection noise, and n params is the total number of parameters in the network given by \({\sum }_{l=1}^{L+1}({4}^{({m}_{l-1}+1)}-1)\times {m}_{l}\), where m l is the number of perceptrons acting on layers l − 1 and layer l, and the −1 term appears because the overall phase of the unitaries is unimportant. See Supplementary Note 5 for more details and a comparison to state tomography. This means that in the near term, for large networks, only sparsely connected networks may be practical for experimental purposes. An exception would be if the problem being considered is such that the training data is easy to produce, e.g., if the output states are produced by allowing input states to thermalize by simply interacting with environment, thus producing the output states. For concreteness from now on we focus on the restricted case where \(\left|{\phi }_{x}^{\,\text{out}\,}\right\rangle =V\left|{\phi }_{x}^{\,\text{in}\,}\right\rangle\), where V is some unknown unitary operation. This scenario is typical when one has access to an untrusted or uncharacterised device which performs an unknown quantum information processing task and one is able to repeatably initialise and apply the device to arbitrary initial states.

To evaluate the performance of our QNN in learning the training data, i.e., how close is the network output \({\rho }_{x}^{\,\text{out}\,}\) for the input \(\left|{\phi }_{x}^{\,\text{in}\,}\right\rangle\) to the correct output \(\left|{\phi }_{x}^{\,\text{out}\,}\right\rangle\), we need a cost function. Operationally, there is an essentially unique measure of closeness for (pure) quantum states, namely the fidelity, and it is for this reason that we define our cost function to be the fidelity between the QNN output and the desired output averaged over the training data:

$$C=\frac{1}{N}\sum _{x=1}^{N}\langle {\phi }_{x}^{\,\text{out}}| {\rho }_{x}^{\text{out}}| {\phi }_{x}^{\text{out}\,}\rangle .$$ (3)

Note that the cost function is a direct generalisation of the risk function considered in training calssical deep networks and we can efficiently simulate it. Also note that it takes a slightly more complicated form when the training data output states are not pure (in that case, we simply use the fidelity for mixed states: \(F(\rho ,\sigma ):={\left[{\rm{tr}}\sqrt{{\rho }^{1/2}\sigma {\rho }^{1/2}}\right]}^{2}\)), which may occur if we were to train our network to learn a quantum channel.

The cost function varies between 0 (worst) and 1 (best). We train the QNN by optimising the cost function C. This, as in the classical case, proceeds via update of the QNN parameters: at each training step, we update the perceptron unitaries according to \(U \rightarrow e^{i \epsilon K}U\), where K is the matrix that includes all parameters of the corresponding perceptron unitary and \(\epsilon\) is the chosen step size. The matrices K are chosen so that the cost function increases most rapidly: the change in C is given by

$$\Delta C=\frac{\epsilon }{N}\sum _{x=1}^{N}\sum _{l=1}^{L+1}\,\text{tr}\,\left({\sigma }_{x}^{l}\Delta {{\mathcal{E}}}^{l}\left({\rho }_{x}^{l-1}\right)\right),$$ (4)

where L + 1 = out, \({\rho }_{x}^{l}={{\mathcal{E}}}^{l}\left(\cdots {{\mathcal{E}}}^{2}\left({{\mathcal{E}}}^{1}\left({\rho }_{x}^{\,\text{in}\,}\right)\right)\ldots \right)\), \({\sigma }_{x}^{l}={{\mathcal{F}}}^{l+1}\left(\cdots {{\mathcal{F}}}^{L}\left({{\mathcal{F}}}^{{\rm{out}}}\left(\left|{\phi }_{x}^{{\rm{out}}}\right\rangle \left\langle {\phi }_{x}^{{\rm{out}}}\right|\right)\right)\cdots \ \right)\), and \({\mathcal{F}}(X)\equiv {\sum }_{\alpha }{A}_{\alpha }^{\dagger }X{A}_{\alpha }\) is the adjoint channel for the CP map \({\mathcal{E}}(X)={\sum }_{\alpha }{A}_{\alpha }X{A}_{\alpha }^{\dagger }\). From Eq. (4), we obtain a formula for the parameter matrices (this is described in detail in Supplementary Note 2). At this point, the layer structure of the network comes in handy: To evaluate \({K}_{j}^{l}\) for a specific perceptron, we only need the output state of the previous layer, ρl−1 (which is obtained by applying the layer-to-layer channels \({{\mathcal{E}}}^{1},{{\mathcal{E}}}^{2}\ldots {{\mathcal{E}}}^{l-1}\) to the input state), and the state of the following layer σl obtained from applying the adjoint channels to the desired output state up to the current layer (see Box 1). A striking feature of this algorithm is that the parameter matrices may be calculated layer-by-layer without ever having to apply the unitary corresponding to the full quantum circuit on all the constituent qubits of the QNN in one go. In other words, we need only access two layers at any given time, which greatly reduces the memory requirements of the algorithm. Hence, the size of the matrices in our calculation only scales with the width of the network, enabling us to train deep QNNs.

Simulation of learning tasks

It is impossible to classically simulate deep QNN learning algorithms for more than a handful of qubits due to the exponential growth of Hilbert space. To evaluate the performance of our QML algorithm we have thus been restricted to QNNs with small widths. We have carried out pilot simulations for input and output spaces of m = 2 and 3 qubits and have explored the behaviour of the QML gradient descent algorithm for the task of learning a random unitary V (see Supplementary Note 4 and Supplementary Figs. 4–6 for the implementation details). We focussed on two separate tasks: In the first task we studied the ability of a QNN to generalise from a limited set of random training pairs \((\left|{\phi }_{x}^{\,\text{in}\,}\right\rangle ,V\left|{\phi }_{x}^{\,\text{in}\,}\right\rangle )\), with x = 1,…, N, where N was smaller than the Hilbert space dimension. The results are displayed in Fig. 2a. Here we have plotted the (numerically obtained) cost function after training alongside a theoretical estimate of the optimal cost function for the best unitary possible which exploits all the available information (for which \(C \sim \frac{n}{N}+\frac{N \, - \, n}{ND(D \, + \, 1)}\left(D+\min \{{n}^{2}+1,{D}^{2}\}\right)\), where n is the number of training pairs, N the number of test pairs and D the Hilbert space dimensions). Here we see that the QNN matches the theoretical estimate and demonstrates the remarkable ability of our QNNs to generalise.

Fig. 2: Numerical results. In both plots, the insets show the behaviour of the quantum neural network under approximate depolarizing noise. The colours indicate the strength t of the noise: black t = 0, violet t = 0.0033, orange t = 0.0066, yellow t = 0.01. For a more detailed discussion of the noise model see Supplementary Note 3 and Supplementary Fig. 3. Panel (a) shows the ability of the network to generalize. We trained a 3-3-3 network with ϵ = 0.1, η = 2∕3 for 1000 rounds with n = 1, 2, …, 8 training pairs and evaluated the cost function for a set of 10 test pairs afterwards. We averaged this over 20 rounds (orange points) and compared the result with the estimated value of the optimal achievable cost function (violet points). Panel (b) shows the robustness of the QNN to noisy data. We trained a 2-3-2 network with ϵ = 0.1, η = 1 for 300 rounds with 100 training pairs. In the plot, the number on the x-axis indicates how many of these pairs were replaced by a pair of noisy (i.e. random) pairs and the cost function is evaluated for all “good” test pairs. Full size image

The second task we studied was aimed at understanding the robustness of the QNN to corrupted training data (e.g., due to decoherence). To evaluate this we generated a set of N good training pairs and then corrupted n of them by replacing them with random quantum data, where we chose the subset that was replaced by corrupted data randomly each time. We evaluated the cost function for the good pairs to check how well the network has learned the actual unitary. As illustrated in Fig. 2b the QNN is extraordinarily robust to this kind of error.

A crucial consequence of our numerical investigations was the absence of a “barren plateau” in the cost function landscape for our QNNs45. There are two key reasons for this: firstly, according to McClean et al.45, “The gradient in a classical deep neural network can vanish exponentially in the number of layers […], while in a quantum circuit the gradient may vanish exponentially in the number of qubits.” This point does not apply to our QNNs because the gradient of a weight in the QNN does not depend on all the qubits but rather only on the number of paths connecting that neuron to the output, just as it does classically. (This is best observed in the Heisenberg picture.) Thus, indeed, the gradient vanishes exponentially in the number of layers, but not in the number of qubits. Secondly, our cost function differs from that of McClean et al.45: they consider energy minimisation of a local hamiltonian, whereas we consider a quantum version of the risk function. Our quantity is not local, and this means that Levy’s lemma-type argumentation does not directly apply. In addition, we always initialised our QNNs with random unitaries and we did not observe any exponential reduction in the value of the parameter matrices K (which arise from the derivative of our QNN with respect to the parameters). This may be intuitively understood as a consequence of the nongeneric structure of our QNNs: at each layer we introduce new clean ancilla, which lead to in general, dissipative output.