Data encoding

Classification consists of assigning a category to an observation. In machine learning, an inference model is trained to minimize the classification error on a finite set of data, also known as the training set. The actual performance of the classifier, the generalization error, is then estimated on a set of data points not used for training, also known as the test set. The functional form of the inference model is often critical to the success of the classifier. State-of-the-art models for high-dimensional datasets with complex structure are typically hierarchical or compositional.1 These ideas can be translated to the paradigm of quantum computation using the framework of tensor networks. Before describing the tensor network architectures used in this work, namely TTN and MERA, it is important to first clarify what datasets are considered in this paper to gauge the performance of these networks, and how they are prepared.

Let us first consider the case of classical data. A classical dataset for binary classification is a set \({\cal D} = \left\{ {\left( {{\boldsymbol{x}}^d,y^d} \right)} \right\}_{d = 1}^D\), where \({\boldsymbol{x}}^d \in {\Bbb R}^N\) are N-dimensional input vectors, and yd ∈ {0, 1} are the corresponding class labels. Classifying classical data on a quantum computer requires that the input vectors be encoded in a quantum state. There are a variety of ways to accomplish this and different algorithms require different encoding methods. The most efficient approach in terms of space is to encode classical data in the amplitudes of a superposition, that is, using N qubits to encode a 2N-dimensional data vector. However, in the general case and depending on the quantum classifier used, the computational cost of preparing data as a superposition can negate the speedup obtained during classification.7 A simpler method is to encode each element of a classical data vector in the amplitude of a single-qubit. This type of encoding requires N qubits to encode an N dimensional data-vector and, therefore, is less efficient in terms of space. However, the state preparation is clearly efficient in terms of time as it only requires single-qubit rotations. We opt for this type of encoding for classical data. In particular, we first re-scale the data vectors element-wise to lie in \(\left[ {0,{\textstyle{\pi \over 2}}} \right]\). Then, we encode each vector element in a qubit using the following scheme:20

$$\psi _n^d = {\mathrm{cos}}\left( {x_n^d} \right)\left| 0 \right\rangle + {\mathrm{sin}}\left( {x_n^d} \right)\left| 1 \right\rangle .$$ (1)

The final data vector is written as \(\psi ^d = \otimes _{n = 1}^N\psi _n^d\), and is ready to be used in a quantum algorithm.

Let us now consider the case of quantum data. A quantum dataset for binary classification is a set \({\cal D} = \left\{ {\psi ^d,\left. {y^d} \right)} \right\}_{d = 1}^D\), where \(\psi ^d \in {\Bbb C}^{2^N}\) are 2N-dimensional input vectors of unit length, and yd∈{0,1} are the corresponding classes. In contrast to classical data, quantum data, such as the output of a quantum circuit or a quantum sensor, may already be in superposition. That is, the quantum states are used as-is, and there is no relevant cost for the preparation.

Circuit architecture

We now discuss the quantum circuit architectures for classification. The first circuit architecture is inspired by TTNs, specifically binary trees. The TTN circuit begins by applying a set of two-qubit nearest-neighbor unitaries to the input. We then discard one of the qubits output from each unitary, halving the number of qubits in the next layer of the circuit. In the following layer we again apply two-qubit unitaries to the remaining qubits before discarding half of them. This process is repeated until only one qubit remains. The network in full consists of measuring a single-qubit expectation value on this remaining qubit

$$M_{\boldsymbol{\theta }}\left( {\psi ^d} \right) = \left\langle {\psi ^d} \right|\hat U_{{\mathrm {QC}}}^\dagger \left( {\left\{ {U_i\left( {\theta _i} \right)} \right\}} \right)\hat M\hat U_{{\mathrm {QC}}}\left( {\left\{ {U_i\left( {\theta _i} \right)} \right\}} \right)\left| {\psi ^d} \right\rangle ,$$ (2)

where \(\hat U_{{\mathrm {QC}}}\left( {\left\{ {U_i} \right\}} \right)\) is the quantum circuit made up of unitaries U i (θ i ), θ = {θ i } is the set of parameters which define the unitaries, and \(\hat M\) is the single-qubit operator whose expectation we are calculating. A circuit diagram of an eight-qubit TTN is shown in Fig. 1a. The solid lines encompass the circuit, while the dashed lines represent its conjugate transpose.

Fig. 1 TTN and MERA classifiers for eight qubits. The quantum circuit is illustrated by the regions outlined in solid lines comprising inputs ψ, unitary blocks \(\left\{ {U_i} \right\}_{i = 1}^7\) and \(\left\{ {D_i} \right\}_{i = 1}^4\), and a measurement operator M. The dashed lines represent its conjugate transpose. The solid and dashed regions together describe a tensor network operating on input ψ 1–8 and evaluating to the expectation value of observable M Full size image

The MERA network is closely related to the TTN. All of the unitaries that make up a tree network are maintained with an additional layer of two qubit unitaries added before each layer of the TTN. These additional unitaries, {D i }, each operate on one qubit of neighboring unitaries in the upcoming TTN layer. In a conventional MERA network, the addition of these unitaries allows quantum correlations on a particular length scale to be captured at the same layer of the network.10 A circuit diagram of an eight-qubit MERA is shown in Fig. 1b.

Unitary parameterization

We have explored a number of different ways to parameterize the unitaries used in these circuits. Some of the input data used is purely real, we therefore tested the effect of restricting the unitaries to be real too. That is, we chosen unitaries such that U i ∈ SO(·) ⊂ SU(·). We also consider general, complex valued unitaries U i ∈ SU(·). As has been observed in the context of the time-dependant variational principle applied to tensor networks, the use of complex weights often prevents optimization from getting stuck in local minima.21,22

We also explored a number of other methods for parameterizing the unitaries; Fig. 2 illustrates three such paramaterizations. In Fig. 2a, the unitary block is composed of two arbitrary single-qubit rotations and a CNOT ij gate, where i and j are control and target qubit, respectively. Note that in some cases the direction of the CNOT ij may be reversed in order to respect the causal structure. For example, in our eight-qubit implementations we reverse the control and target qubits for blocks U 2 , U 4 , and U 6 lying in the lower part of the circuit. In the case of the restriction to SO(4) the single-qubit rotations are simply Y-rotations.

Fig. 2 Three alternative parameterizations of the unitary blocks in Fig. 1. a Two arbitrary single-qubit rotations followed by a CNOT. The direction of the CNOT may be reversed to preserve the causal structure of the network. This simple setting can be readily implemented in available quantum computers. b An arbitrary two-qubit gate. Such general setting would in practice require compilation into low-level hardware-dependent gates. c An arbitrary three-qubit gate involving an ancilla qubit. The ancilla is traced out allowing to perform a rich set of non-linear operations. Implementation of the latter in currently available hardware would require a compilation step Full size image

In Fig. 2b, the unitary block consists of an arbitrary two-qubit gate. It is interesting to explore this much more general setting in simulations, although a practical implementation of such unitary may be costly. That is, the two-qubit unitary needs to be compiled into low-level hardware-dependent gates.

Finally, Fig. 2c shows a three-qubit gate involving an ancilla qubit. By tracing out the ancilla qubit we can effectively implement a rich class of non-linear functions, e.g. step functions,23 closely resembling the operations of classical neural networks. Again, in practice a significant overhead is expected due to compilation.

The measurement \(\hat M\) is performed on a specific qubit and consists of a simple Pauli measurement in a chosen direction. This can be implemented in practice by an additional single-qubit rotation followed by the projective measurement onto \(\left| 0 \right\rangle \left\langle 0 \right|\). This is sufficient for a binary classification task; by computing and thresholding the expectation value of M, TTN and MERA classify the input ψd into one of the two classes. In our example in Fig. 1, the measurement is performed on qubit number six.

Learning process and complexity

We now discuss the learning process. In principle, the circuit parameters would be adjusted to directly maximize the classification accuracy on the training set or, in other words, minimize the classification error. Optimizing such an objective function is highly non-trivial and it is common to optimize a bound instead. Here we choose to minimize the mean square error between predictions and true class labels

$$J({\boldsymbol{\theta }}) = \frac{1}{D}\mathop {\sum}\limits_{d = 1}^D \left( {M_{\boldsymbol{\theta }}\left( {\psi ^d} \right) - y^d} \right)^2,$$ (3)

where ψd are inputs, yd are class labels, D is the number of training data points, and θ groups all the adjustable parameters of the circuit as described above. Although there exist several approaches to carry out this optimization, artificial neural networks are commonly optimized by stochastic gradient descent algorithms. At each iteration t, we estimate the gradient ∇J(t) and choose a learning rate η(t). Parameters are then updated via a rule of the kind θ(t+1) ← θ(t) + η(t)∇J(t). This algorithm is stochastic because at each iteration the gradient is estimated on a small batch rather than on the full training set. Beside speeding up the calculation, this noisy gradient may help in escaping from local minima. Much literature and experimentation has been dedicated to improving stochastic gradient descent algorithms. In this work, we employ a variant called Adaptive Moment Estimation (Adam).24

The cost function is a function of the measurement outcome of the circuit being trained. In order to obtain these measurement outcomes, the circuit itself must be evaluated. In Table 1 we summarize the complexity of obtaining the measurement outcomes at the end of the different types of circuits in this paper. The complexity stated is in terms of the number of multiplications of scalar numbers required to perform the task. The complexities in the two-dimensional cases are stated for a grid of N × N qudits. The complexities stated for the 2D networks use the network architecture introduced in refs 10,25

Table 1 Computational complexity of hierarchical quantum classifiers under different data encoding Full size table

In the case of efficiently contractable networks we can compute the exact gradient using off-the-shelf automatic differentiation software (e.g., TensorFlow26). This applies to many 1D networks including TTNs and MERA. For networks that cannot be efficiently contracted a finite-difference method or an approximation to the true gradient must be used.27 These strategies introduce additional noise due to finite-sampling error, and intrinsic noise of near-term quantum devices. We begin exploring the impact of these with simulations in Section 2.7. Note that all of the circuits we train in this paper can be evaluated efficiently on quantum hardware.

Experimental results: Iris dataset

In this experiment, we tested the ability of a TTN to classify varieties of Iris. The Iris dataset17 consists of 150 examples in total of three varieties of Iris flowers. Each example of Iris is described by four real-valued attributes x 1–4 . We encoded the four attributes into four qubits using Eq. (1). We then parameterized unitaries using the simple gate shown in Fig. 2a, and restricted the single-qubit rotations to be real (i.e., Y-rotations). To allow for binary classification, three binary datasets were extracted from the original set. In each subset, each class comprised 1/2 of the examples. For each class, 1/3 of examples were used as a test set and used to compute the accuracy. Mean accuracy and one standard deviation computed on five random initializations are given by Table 2. As shown, TTN performed extremely well in all cases.

Table 2 Binary classification accuracy on the Iris dataset Full size table

Experimental results: Handwritten digits (MNIST)

In this experiment we tested the ability of TTN and MERA classifiers on a number of handwritten digit recognition tasks and compared the performance of different parameterizations. MNIST18 is a canonical data-set consisting of 70,000 labeled gray-scale images of handwritten digit from 0 to 9. From this dataset we generated four binary classification tasks. In the first we kept only images containing 0 or 1, and for the second task, only 2 or 7. For the third tasks we re-labeled all images as even or odd. For the final task we divided the images into those that were >4 or not. MNIST images are 28 × 28 pixels. To allow for simulation using eight qubits, we performed principal component analysis on the images for each task and kept only the eight components with highest variance. Finally, we used Eq. (1) to encode the data.

Of the 70,000 examples 55,000 were used for training, 5000 for validation and 10,000 for testing. Training was performed using the Adam optimizer24 with batches of 20 examples. Validation and test accuracy were recorded every 10 training batches, and training was stopped when validation set accuracy did not increase for 30 consecutive tests. Figure 3 shows typical learning curves for train and test datasets.

Fig. 3 Train and test accuracy vs. number of training steps. Here we show typical results for a MERA classifier parametrized using general gates and complex rotations, applied to the “Is > 4” task on the MNIST dataset with the dimension of each example reduced to eight using PCA Full size image

Mean accuracy and one standard deviation computed on five random initializations are given by Table 3. The ‘Classifier’ column describes if the circuit was a TTN, MERA, or hybrid, that is, a MERA pre-trained with TTN. The ‘Unitaries’ column describes if the circuit was parameterized using a simple, general or ancilla gate set as described by Fig. 2. The ‘Rotations’ column specifies the type of rotation used, either real, SO(4), or complex, SU(4).

Table 3 Binary classification accuracy on the MNIST dataset Full size table

Some remarks are in order. First, we note that the restriction to simple unitaries led to significantly lower accuracy than when using general unitaries. Complex rotations improved the accuracy of the classifiers in all tasks except for task ‘0 or 1’ where accuracy was already >99.5% with real rotations. It is notable that this is the case despite the input data being real-valued. Second, the MERA classifiers achieved higher accuracy than TTN classifiers in all cases, demonstrating the power of the additional unitaries. Third, the hybrid classifier achieved accuracy comparable to that of the standard MERA. On average, hybrid classifiers required 2.45 times more training steps until convergence than standard MERA. However, the number of post-training steps required was only 0.825 times the number of training steps of standard MERA. This indicates that classical pre-training may lead to a reduction in the number of training steps carried out on the quantum computer, a potential advantage in the near-term. Finally, all networks outperformed the logistic regression benchmark except those using the simple gate-set.

One may wonder whether the accuracy on some of the tasks can be made more competitive with the state-of-the-art results on MNIST. In order to efficiently simulate all the circuits, each 28 × 28 image was reduced to an eight-dimensional vector using PCA, thereby discarding a lot of information that could be useful for classification. To verify this we ran logistic regression without PCA on the most difficult of the four tasks, “Is > 4”. This model achieved a test accuracy of 87.09%, a significant improvement over the logistic regression on the PCA reduced data which achieved 70.7% instead (Table 3). We concluded that reducing the dimensionality of the data can have a detrimental effect on the model accuracy and therefore we expect TTN and MERA classifiers to perform better when using more principal components, or even raw data.

Experimental results: Quantum data

We now consider the problem of classifying quantum data, that is, quantum states generated by different physical processes. A physical process can be simulated by a quantum circuit. By setting up two different quantum circuit layouts, we can generate synthetic classification tasks. Let us first define the building block for our quantum circuit layouts.

Our building block consists of single-qubit rotations U i for all qubits i ∈ {0, …, N}, followed by all the possible CNOT ij gates where i and j are control and target qubits, respectively, and i < j. The angles of the single-qubit rotations are the only parameters of our building block. By stacking several of these building blocks, we can generate deeper and more complex circuits. In particular, we chose to identify the class with the number of building blocks in the stack (e.g., class 5 consists of 5 building blocks).

Now, for each class, we can generate a quantum state by randomizing all the single-qubit gates, and then executing the circuit on initial state \(\left| 0 \right\rangle\). This is repeated many times in order to generate a dataset. As discussed in Section 2, we assume that each quantum state in the dataset can be directly fed into the quantum computer where the classifier is executed, hence not requiring any pre-processing. The tasks of the classifier is to determine which of two circuit layouts a state was generated from.

Here, we work with circuits of N = 8 qubits. We generated datasets of D = 5000 quantum states for each of the classes y ∈ {1, 2, 3, 5, 10}. To make sure that the synthetic classification task was well defined, we first looked for a strategy that could correctly classify the states most of the time. For each state, we computed the maximum bipartite entanglement entropy, \(\max _AS\left( {\rho _A} \right) = \max _BS\left( {\rho _B} \right)\), over all possible partitions A, B of the eight qubits. Figure 4 shows histograms of this quantity for three classification tasks. By inspecting the overlap of distributions we can find an optimal threshold that would classify states correctly most of the time. This shows that the classification task is meaningful. We would like to stress that this is an intractable strategy. The only purpose is to demonstrate that, in principle, there is a feature of the state that correlates with the class. The hope is that a hierarchical quantum classifier can find equally successful strategies in a tractable way.

Fig. 4 Distribution of the maximum bipartite entanglement entropy for synthetic quantum datasets. Quantum data points were generated by random circuits with different number of building blocks y ∈ {1, 2, 3, 5, 10} as explained in the main text. From this data we created three classification tasks: a 1 vs. 10, b 3 vs. 10, and c 2 vs. 5. The subplots show histograms of maximum bipartite entanglement entropy for the three classification tasks. Such property could be used to separate classes and classify data with high accuracy, hence the synthetic classification tasks are well-posed. We stress that the computation of such property is intractable and do not expect the hierarchical classifiers to be able to exploit it when classifying input data Full size image

The classifier used for this task was a TTN like the one shown in Fig. 1. We considered two parameterizations; the first uses general gates such as the one shown in Fig. 2b. The second uses arbitrary three-qubit gates where one of the qubits is an ancilla initialized in the state \(\left| 0 \right\rangle\), as illustrated in Fig. 2c. The data described above was divided into training, validation, and test sets. Each of these sets were balanced, that is, they had an equal number of states from each class. A set of 1000 examples from each class was held out as a test set. Training was performed for 4000 iterations with batches of 40 states and test accuracy was recorded every 50 iterations. The best test accuracy was recorded for each task.

Table 4 reports mean classification accuracy and one standard deviation computed on five random initializations. Results for the TTN with general two-qubit gates are no better than random class assignment in all tasks, indicating the need for a more expressive model. Indeed, when using gates augmented by an ancilla qubit, TTN was able to classify quantum states with some accuracy, suggesting that those may play a key role. The classification accuracy is higher for the ‘1 or 10’ task; this is somewhat expected as the overlap of classes 1 and 10 shown in Fig. 4a is less than that of the other tasks shown in Fig. 4b, c.

Finally, as a proof of principle, we verified the performance of a classical logistic regression model. We fed the vector of amplitudes to the model and trained with off-the-shelf software. The test accuracy was close to 50%, that is, no better than random. We shall stress that this approach is not feasible in practice, since only providing the input in classical form would require full tomography of the quantum dataset.

Table 4 Binary classification test accuracy on synthetic quantum datasets Full size table

Experimental results: Characterizing the effect of noise on classification performance

Many machine learning models including neural networks are highly robust against the negative effects of noise. Some kinds of noise can even help with convergence and generalization.28,29 In this experiment, we tested the effect of depolarizing noise on the quantum classifier by simulating a depolarizing channel. It consists of a completely positive map Δ λ parametrized by λ from a 2N-dimensional state ρ to a linear combination of ρ and a maximally mixed state

$${\mathrm{\Delta }}_\lambda (\rho ) = \lambda \rho + \frac{{1 - \lambda }}{{2^N}}I.$$ (4)

We used one of the TTN classifiers for classes 1 and 2 of the Iris dataset (see Section 2.5) and simulated the noisy circuit using the IBM Quantum Experience. The depolarizing channel was applied to the system after the application of each unitary gate in the circuit, that is, after each single-qubit rotation and CNOT gate. The entire test set was used to evaluate accuracy.

In order to make a realistic case, we used a finite number of measurements to estimate the class predictions. For each data point, we took 401 measurements in the computational basis and obtained the most likely class by majority vote. The 401 measurements may not be sufficient to estimate the output of the circuit with high confidence when the probability assigned to both classes is close to 0.5. In other words, repeating the 401 measurements and taking the majority vote could lead to a different class assignment for the very same data point. Therefore, we repeated the computation of the accuracy 200 times and obtained error bars. Finally, we increased the amount of noise λ from 0 to 0.2 in increments of 0.01.

Figure 5 shows mean and one standard deviation of the classification accuracy on the test set. We first noticed that finite sampling led to some error even when no depolarizing noise was used. Indeed, we obtained a mean accuracy of 96.5% with λ = 0; the very same model achieved 100% accuracy under exact computation (see results for “1 or 2” in Table 2). Second, the mean accuracy reduced as we injected depolarizing noise, but it remained above 95% for depolarizing noise up to λ = 0.07 showing some level of resilience. Finally, as we increased the noise further, the standard deviation of the accuracy increased as well. This is expected: as the output state gets closer to the maximally mixed state according to Eq. (4), the probability assigned to both classes gets closer to 0.5. Hence, a larger number of measurements would be needed to estimate the class.

Fig. 5 Effect of depolarizing noise and finite sampling noise on the accuracy of the TTN Iris classifier. We show mean and one standard deviation of the classification accuracy computed on the test set. The mean accuracy remains above 95% for depolarizing noise up to λ = 0.07 showing some level of resilience in the model. As we increase the depolarizing noise further, (i) the model gets worse and mean accuracy reduces, and (ii) the standard deviation increases indicating the need for more measurements to overcome the finite sampling noise Full size image

Experimental results: Deployment on a quantum computer

In this experiment, we deployed the Iris classifier for classes 1 and 2 (see Section 2) on the ibmqx4 quantum computer available in the IBM Quantum Experience. As shown in Fig. 6, this TTN classifier has three CNOT gates and seven rotations in the Y direction. A test set of 34 unseen examples was used to determine accuracy. For each example, the circuit was run 401 times, and the samples were used to compute the most likely class. The circuit correctly classified 100% of the test set, and achieved a test cost function value of 0.0811 (Eq. (3)).