Experimental data sets on TF-DNA binding for a specific TF consist of N sequences of fixed length L and N values that express a measure of the binding affinity of the chosen TF to each sequence: \(\left\{ {\left( {\vec x_n,y_n} \right)} \right\}_{n = 1}^N\). In other words, the nth sequence is represented by the vector \(\vec x_n = \left( {x_{n,1},x_{n,2}, \ldots ,x_{n,L}} \right)\) with x n,j ∈ {A, C, G, T}, for j = 1, …, L, and y n is the corresponding measure of binding affinity. For instance, \(\vec x_n\) may be ACAACTAA, with y n = 4.95. In this work we used binding from three genomic-context protein binding microarray (gcPBM) experiments, which use fluorescence intensity as a measure of binding affinity,40 and two high-throughput systematic evolution of ligands by exponential enrichment (HT-SELEX)41,42,43 experiments, which report relative binding affinity. After preprocessing, the three gcPBM data sets consisted of N ≈ 1600 sequences of L = 10 base-pairs. The two HT-SELEX data sets consisted of N ≈ 3200 and 1800 sequences of length L = 12 after preprocessing (see Methods for a brief descrption of the preprocessing procedure). We used the following one-hot encoding to represent the sequence as a vector of binary variables: A = 1000, C = 0100, G = 0010, T = 0001, and thus transformed \(\vec x_n\) into a feature vector \(\vec \phi _n \equiv \left( {\phi _{n,1}, \ldots ,\phi _{n,4L}} \right)^{\sf T}\). This encoding scheme44 was used so that all combinations of inclusion and exclusion of the four nucleotides may be identified. Similar to previous studies44,45,46,47 the goal of the present work is to identify patterns within the data to qualitatively assess whether the strength of a TF binding to a particular unseen sequence is above a certain threshold (classification) or to rank sequences in terms of binding affinity (ranking).

To identify conditions in which machine learning with existing QA devices may be of use for studying a simplified biological problem, we report results obtained by solving a learning protocol with six different strategies: (i) an adiabatic quantum machine learning approach formulated in refs. 4,5 (DW), (ii) simulated annealing48 (SA) (using the implementation given in ref. 49), (iii) simulated QA50 (SQA), a classical algorithm that can represent the potential of a noiseless thermal quantum annealer, (iv) L 2 regularized multiple linear regression (MLR), (v) Lasso51 and (vi) a scalable machine learning tool known as XGBoost (XGB).52 DW, SA and SQA are probabilistic approaches. SQA is a (classical) path integral Monte Carlo method that has performed very similarly to QA and captures some of its main advantages.53 MLR is a deterministic method with a closed-form solution that returns the weights that best minimize the objective function (defined below). Lasso is a method for linear regression that uses an L 1 norm (see description of objective function below for more details). XGB uses boosted trees and has been applied to a variety of machine learning tasks in physics, natural language processing and ad-click prediction (e.g., ref. 54).

Given a transformed feature vector \(\vec \phi _n\) that represents a DNA sequence, the goal of each method is to compute a predicted binding score \(f\left( {\vec \phi _n} \right)\) that best matches the actual binding score. To carry out the task, an objective function must be optimized. The objective function consists of two parts: a training loss function and a regularization term that helps avoid overfitting. We may write the objective function as

$$Obj(\vec w) = R(\vec w) + \Omega (\vec w),$$ (2)

where R is the training loss, Ω is the regularization term, and \(\vec w\) is the set of feature weights to be determined by the six learning algorithms: DW, SA, SQA, MLR, Lasso and XGB. The mean squared error was used as the loss function for all six methods; namely, \(R(\vec w) = \mathop {\sum}

olimits_n \left( {y_n - f_{\vec w}\left( {\vec \phi _n} \right)} \right)^2\), where y n is the actual binding score of the nth sequence, and \(f_{\vec w}\left( {\vec \phi _n} \right)\) is the predicted binding score. The regularization term was \(\Omega (\vec w) = \lambda \left\| {\vec w} \right\|_1\) for DW, SA, SQA and Lasso, \({\mathrm{\Omega }}(\vec w) = \lambda \left\| {\vec w} \right\|_2^2\) for MLR, and \({\mathrm{\Omega }}(\vec w) = \gamma S + \frac{1}{2}\lambda \mathop {\sum}

olimits_{j = 1}^S w_j^2\) for XGB, where the \(\left\| \cdot \right\|_1\) norm is the number of 1’s (Hamming weight), the \(\left\| \cdot \right\|_2^2\) norm is the square of the Euclidean norm, and S is the number of leaves.52 The calibration of the hyper-parameters λ, γ and S is discussed below. The loss function should be minimized and the regularization term generally controls model complexity by penalizing complicated models; the strength of the regularization was determined using a 100-fold Monte Carlo cross-validation. All six methods assume a linear model for the predicted binding affinity, i.e., \(f_{\vec w}\left( {\vec \phi _n} \right) = \vec w^{\sf T} \vec \phi _n = \mathop {\sum}

olimits_j w_j\phi _{n,j}\). DW, SA and SQA return binary weights and are probabilistic methods, that is, they return a distribution of weights with different energies [values of the Hamiltonian in Eq. (1)]. In order to utilize the distribution of weights returned, while not sacrificing the discrete nature of the QUBO approach, up to twenty of the best weights were averaged (see Supplementary Material, Sec. SID for a description of how excited-state solutions were included and Fig. S2 for an example of the decrease in the objective function).

Our computational procedure consisted of three main phases: (1) calibration of hyper-parameters, (2) training, and (3) testing (Fig. 2). About 10% of the data were held out for testing during the testing phase ('test data' or \({\cal D}^{{\mathrm{TEST}}}\)); these test data were not seen during calibration and training stages. Calibration and training were carried out using the remaining 90% of the data ('training data' or \({\cal D}^{{\mathrm{TRAIN}}}\)). Due to the discrete nature of the weights returned in the QUBO approach, as well as technological limitations of the DW2X device, calibration of hyper-parameter λ was carried out by repeatedly sampling a small number of sequences, about 2% and 10% of \({\cal D}^{{\mathrm{TRAIN}}}\), corresponding to about 30 and 150 sequences, respectively. In particular, in the calibration phase we determined the hyper-parameters by using 100-fold Monte Carlo (or split and shuffle) cross-validation with training splits of 2% and 10% of the training data, varying λ from 2−3 to 26. Monte Carlo cross-validation was used so that hyper-parameters would be tuned on a similar number of sequences as used in the training phase (in contrast, n-fold cross-validation trains on \(\frac{{n - 1}}{n} \times 100\%\) of the data). The same calibration procedure was applied to tune λ for SA, SQA, MLR and Lasso: the resulting values of λ are listed in the Supplementary Material, Tables S1 and S2. In order to demonstrate good performance for XGB, γ, S, and several additional parameters needed to be tuned (see Methods). In the training phase we used a bagging (bootstrap aggregating) procedure,55 randomly sampling with replacement 2% and 10% of the training data, namely about 30 and 150 sequences for the gcPBM data sets. Each subset of about 30 or 150 sequences formed a training 'instance', and the mapping of a subset of data to the h i and J ij seen by DW and SA is given in the Methods. Each learning approach (DW, SA, SQA, MLR, Lasso and XGB) was trained on the same set of instances. To collect statistics, 50 instances were randomly selected with replacement, for each training size. In the testing phase, the predictive power was assessed in terms of classification performance (the mean area under the precision-recall curve or AUPRC) and ranking performance (the median Kendall’s τ) on the test data unseen during calibration and training phases. AUPRC is a measure of classification performance that may help discern between similar algorithms when there is a high degree of class imbalance; i.e., when the data set contains many more false labels than true labels.56 Kendall’s τ is a rank correlation coefficient that counts all mismatches equally.57 Additional methodological details are given in Methods and in the Supplementary Material, Sec. SI.

Fig. 2 Schematic overview of the data handling procedure. About 10% of the data were held out for testing (\({\cal D}^{{\mathrm{TEST}}}\)) and were unseen during the calibration and training phases. The remaining 90% of the data were used as calibration and training (\({\cal D}^{{\mathrm{TRAIN}}}\)). For DW, SA, and MLR, in the calibration step, hyper-parameter λ was tuned using a 100-fold Monte Carlo cross-validation. For XGB, several other hyper-parameters were tuned (see Methods for a list). During the training step, a procedure similar to bagging (bootstrap aggregating) was used by randomly sampling a 2% and 10% replacement of the data 50 times to give 50 training instances. In the testing step, the area under the precision-recall curve (AUPRC) of the best performing weights for each of the 50 training instances was evaluated on \({\cal D}^{{\mathrm{TEST}}}\), which was unseen during training and calibration, to evaluate generalization of classification performance. The calibration, training, and testing procedure was identical for ranking tasks, with the exception that Kendall’s τ was used as the metric of performance instead of the AUPRC Full size image

Performance on gcPBM Data

To quantify the relative performance of the algorithms in capturing DNA–protein binding preferences, we first present results for high-quality gcPBM40 data of three TFs from the basic helix-loop-helix (bHLH) family: the Mad1/Max heterodimer (‘Mad’), the Max homodimer (‘Max’), and the c-Myc/Max heterodimer (‘Myc’).44 bHLH proteins typically recognize and bind as dimers to the enhancer box (E-box), which is of the form CANNTG, where N denotes any of the four nucleotides (A, C, G, or T). Mad, Max, and Myc are part of a gene network that controls transcription in cells; a mutation of Myc has been associated with many forms of cancer.58 For the work here, these three data sets were modified to consist of about 1600 sequences of ten base pairs (bp) in length with the E-box located at the central 6 bp.

In Fig. 3 we present the AUPRC and Kendall’s τ obtained with the different algorithms when training with about 30 (2%) and 150 (10%) sequences. To compute the AUPRC, a threshold of the data was introduced: for a threshold at the pth percentile of the data, p% of the total number of sequences have binding affinities below the threshold and were set as negatives ('false'), and the (1 − p)% of the sequences that have binding affinities above the threshold were set as positive ('true'); see Supplementary Material, Sec. SID for a more detailed explanation of the procedure to threshold the data and to generate and calculate the AUPRC. During the calibration phase, we tuned hyper-parameters with a single threshold at the 80th percentile of the data, and during the testing phase we evaluated performance between the 70th and the 99th percentiles of the data. Kendall’s τ was evaluated between the predicted and measured binding affinity. A higher AUPRC indicates a better ability to correctly classify sequences that would be strongly bound by a TF, and a higher τ indicates a better ability to accurately rank the binding affinities for different sequences.

Fig. 3 Quantitative performance on held-out experimental test data set of two different types of tasks for three high-quality gcPBM data sets. a The mean AUPRC for Mad, Max, and Myc plotted versus threshold at certain threshold percentiles of the data, when training with 2% of the data (left) and 10% of the data (right). In both cases 50 instances were randomly selected for training and performance of the 50 trained weights is evaluated on the same held-out test set. Error bars are the standard deviations. b Boxplot of Kendall’s τ on held-out test data set. Red ‘+’ indicate outliers, gray line represents the median. The bottom and top edges of the box represent the 25th and 75th percentiles, respectively Full size image

For the AUPRC, when training on instances with 2% of the data (left column in Fig. 3a), DW, SA and SQA perform very similarly, with DW slightly outperforming SA on the Myc data, and are somewhat better than MLR at the 70th and 80th percentiles. MLR tends to do better at the higher thresholds: this behavior could be affected by the fact that, during the calibration phase, we selected the λ that gave the best performance at the 80th percentile. Lasso, which uses the same L 1 norm as DW, SA, and SQA, performs better than XGB but worse than the other methods. XGB, which has been successfully applied to a growing number of learning tasks, does poorly with small training sizes. When training with 10% of the data (right column in Fig. 3a), the trends of relative classification performance are quite different. XGB and MLR perform very similarly, though XGB does slightly better for the Max data set. DW tends to perform better than SA and SQA, especially at higher thresholds. DW’s mean AUPRC is normally worse than MLR and XGB’s, though there is overlap between the error bars. SA and SQA generally perform worse than the other methods, but not conspicuously so. A more thorough analysis of DW’s classification performance in comparison to SA and SQA with the same problem parameters is reported in the Supplementary Material, Figs. S3–S6 and related text in Sec. SIIA. Lasso’s performance is in general comparable to DW, SA, and SQA and generally seems to perform the worst with 10% of training data.

For Kendall’s τ (Fig. 3b), Lasso and XGB’s performance are the least favorable when training with 2% of the data. SQA generally performs the best over the three TFs, though MLR’s median τ is marginally greater than SQA’s for Mad. SA’s performance is very close to SQA’s and DW’s performance is slightly worse than the other two annealing schemes, though generally better than the typical machine learning algorithms. With 10% of the data, DW performs the worst; SA and SQA perform very similarly, with SQA being slightly better on two of the three data sets; MLR and Lasso perform very similarly, though MLR looks slightly better; and XGB performs the best.

The fact that for Mad and Max with 2% of the training data there is very little variation in Kendall’s τ for SA and SQA (and to a lesser extent, DW), is a consequence of the choice of hyper-parameters. The specific values of the hyper-parameters that gave optimal value of Kendall’s τ during the calibration phase are shown in Supplementary Material, Table S2, but we note here that the value of λ is quite high. λ controls the model complexity and is closely related to the bias-variance tradeoff, which states that it is impossible to simultaneously minimize errors from both bias and variance. A large value of λ introduces a large bias;59 consequently, for the cases where there is no or little variance, SA and SQA are essentially extracting the same pattern from all the training data. For the ranking tasks shown here with training on about 30 sequences, this gives the best performance for SA and SQA. It may be unsurprising, however, that a large value of λ be appropriate for small data sets; over-fitting may be a greater concern with smaller amounts of data.

The results presented in Fig. 3 suggest a precise case where current quantum technology may offer slight performance advantages relative to classical computational approaches; that is, when there is only a small amount of experimental training data available (about 30 sequences in our specific cases). In both classification and ranking tasks, DW performs comparably to SA and SQA and better than Lasso and XGB. MLR performs comparably with the annealing methods, but its error bars are much larger, indicating that its performance is less stable and more dependent on the training data. Moreover, the similarity between DW and SQA suggests that for small training sizes DW is functioning very nearly like a noiseless quantum annealer as captured by quantum Monte Carlo simulations. On a larger size of the training data DW’s performance decreases relative to the classical approaches for all three TFs, though results are still competitive. The decrease in the performance of all annealing methods (DW, SA, and SQA) seems to indicate a limitation on using methods with discrete weights, which enforce simpler models. Such models may be more advantageous with a small number of training samples because they prevent overfitting. However, with larger amounts of training data, a simpler model may not adequately learn the variation within the data and hence suffer worse performance. Nevertheless, the fact that Lasso uses the same L 1 norm as the annealing methods (i.e., DW, SA and SQA), yet does not perform as well, indicates an advantage of such annealing methods when training with a small number of sequences. This is consistent with the finding reported in ref. 18.

Weight logos from feature weights

Since the one-hot encoding was used with a linear model to represent DNA sequence, the weights returned by DW, SA, and MLR reflect the relative importance of the corresponding nucleotide at a position in the sequence for the binding score. The magnitude of these feature weights for DW, SA, and MLR can be visualized as a 'weight logo' and are presented in Fig. 4 for the Mad, Max, and Myc gcPBM data sets. XGB, which finds an ensemble of trees, does not assign weights to individual nucleotides and hence does not easily lend itself to visualization. Similar plots for SQA and Lasso are shown in the Supplementary Material Sec. SIIB and Fig. S7. The weight logos show the contribution of nucleotides at particular positions to the strength of binding. The contribution of a nucleotide at a particular position in the sequence is represented by its height; nucleotides with the smallest weights are at the bottom and those with the largest weights are at the top. These weight logos in Fig. 4 were obtained by averaging the weights from the 50 training instances of the same number of sequences with the AUPRC as the objective. In other words, the logo represents the average of the weights that give the AUPRCs shown in Fig. 3a. DW, SA, and MLR all perform very similarly and give weight logos that are in good agreement with the expected consensus sequence, CANNTG. This demonstrates that all methods are able to capture biologically relevant information.

Fig. 4 Comparison of feature weights visualized as weight logos for DW, SA, and MLR. Weights represent the relative importance of a nucleotide at each position for the binding affinity. These weight logos were obtained using the Mad, Max, and Myc gcPBM data sets when training with the AUPRC as the objective Full size image

Performance on HT-SELEX Data

HT-SELEX41,43 is a method for investigating the relative binding affinity of a TF for a particular sequence of DNA, an in vitro technique complementary to PBM. We present results for the Max homodimer and TCF4, another member of the bHLH family with consensus sequence CANNTG, using data from HT-SELEX experiments.47 The Max data set consisted of 3200 sequences of 12 bp in length, and the TCF4 data set was modified to contain 1800 sequences of 12 bp in length.

The procedure for splitting each data set into test and training data was similar to that described earlier for the gcPBM data sets (see Fig. 2). There was no overlap between training and testing data. The quantitative results for classification and ranking performance of the six different machine learning approaches are summarized in Fig. 5a,b, and the weight logos for DW, SA, and MLR in Fig. 5c. As with the gcPBM data, when training with about 30 sequences (1% of training data for Max and 2% of the training data for TCF4), DW, SA and SQA exhibit the best performance on the test data set. MLR matches the annealing protocols with a threshold at the 70th and 80th percentile of the data, but does worse at the higher percentiles of the data (left column in Fig. 5a,b). Lasso and XGB have the poorest performance. When training with about 150 (5% of training data for Max and 10% of the training data for TCF4) sequences, XGB performs very well, as on the gcPBM data sets with more training data, and MLR does well on the Max data set but rather poorly on the TCF4 data set; Lasso is comparable to MLR. DW’s performance is worse than the best performing method (XGB), but comparable to the other methods (right column in Fig. 5,b). XGB’s performance on the TCF4 data set is much better than the other methods, except when thresholding at the 99th percentile.

Fig. 5 Summary of results for HT-SELEX data. a Comparison of the AUPRC (top) and Kendall’s τ (bottom) when training with 1% of the data (left) and 5% of the data (right) for Max. Error bars are standard deviations over 50 instances. b Comparison of the AUPRC (top) and Kendall’s τ (bottom) when training with 2% of the data (left) and 5% of the data (right) for TCF4. c Weight logos for Max and TCF4 from training with the AUPRC as the objective Full size image

In terms of Kendall’s τ (Fig. 5a,b, bottom), all methods have similar performance when training with about 30 sequences, with the exception of XGB which does not do as well on the Max data set. When training with about 150 sequences, XGB gives the best ranking performance, as it did with the gcPBM data, and the other methods all perform similarly. Finally, the weight logos in Fig. 5c indicate that DW, SA, and MLR capture patterns in the data that give good agreement with the expected consensus sequence. The weight logos for the Max and TCF4 HT-SELEX data sets from SQA and Lasso are reported in Supplementary Material Fig. S8.