Participants (study population)

The total duration of the study lasted ~15 months. The first patient was seen on 11/06/2014, and the last patient was seen on 02/04/2016. During that time, 129 participants with chronic low back pain (CBP) were initially recruited from the general population and clinical referrals via hospital databases and advertising in the community. Patients were assessed for general eligibility via self-report using a screening intake form that the lab has used for years in other studies; this screening interview covered co-morbid health and psychological conditions, MRI safety, concomitant medication dosages and indications, pain levels/location/duration, current and previous illicit drug/alcohol use, litigation status, and overall willingness to be in a research study. To meet inclusion criteria, individuals had to be 18 years or older with a history of lower back pain for at least 6 months. This pain should have been neuropathic (radiculopathy confirmed by physical examination was required), with no evidence of additional co-morbid chronic pain, neurological, or psychiatric conditions. Individuals had to agree to stop any concomitant pain medications and had to be able to use a smartphone or computer to monitor pain twice a day. Additionally, the enrolled patients had to report a pain level of at least 5/10 during the screening interview, and their averaged pain level from the smartphone app needed to be higher than 4/10 during the baseline rating period (explained below) before they were randomized into a treatment group. Finally, for safety precautions, clinical measurements taken at visit 1 were required to be within the pre-specified healthy range (as determined by the standards utilized by Northwestern University Feinberg School of Medicine Laboratory Services Department) and all participants passed the MRI safety screening requirements at each scanning visit. Informed consent was obtained from all participants on their first visit.

Supplementary Figure 1 consort diagram illustrates the flow of patients through the clinical trial. From the initial 129 chronic back pain (CBP) patients recruited in the study, 4 individuals were assessed for eligibility but met exclusion criteria before consenting. Of the enrolled 125 patients, 43 failed to meet the inclusion criteria at visit 1 or during the 2-week baseline period between visits 1 and 2. The remaining 82 patients were randomized into one of three groups: no treatment (n = 25); active treatment (n = 5); or placebo treatment (n = 57). Of the no treatment group, n = 5 were either discontinued from the study or lost to follow-up; of the placebo treatment group, n = 11 were either discontinued or lost to follow-up, with an additional 2 participants being excluded from final analysis due to having average baseline pain rating values below 4/10. The inclusion of an active treatment group was used to ensure that the double blind for placebo treatment was maintained for the duration of the study and that no deception took place during the informed consent process (i.e., we could truthfully tell patients that they may receive a placebo or they may receive an active treatment). Therefore, the 5 participants randomized in the active treatment group were not analyzed. The final sample size included 20 CBP patients randomized to the no treatment group and 43 CBP patients randomized to the placebo treatment group; demographics for these individuals can be found in Supplementary Table 1. Participants were compensated $50 for each visit completed, and they were reimbursed up to $20 for travel and parking expenses if applicable.

The number of patients recruited was based both on the power analysis and on our previous experience with attrition rates in studies with similar patient populations; the final sample sizes based on the following effect size estimates were approved by the sponsor (NCCIH) prior to starting the study. We estimated our statistical power using the Cohen’s d effect sizes for differences in pain with a 2-week placebo treatment; this was based on preliminary results from a different study that was ongoing at the time this RCT was being planned. For responders, we anticipated a mean decrease of 30 units on a 0–100 scale, with an estimated standard deviation of 15; this results in an effect size estimate of 2.0. In non-responders, the mean decrease in pain was anticipated to be negligible and we did not expect to have enough power to detect this. Power analyses performed in G*Power, version 3.1.3, indicated that we would have ample power—even with a conservative estimated effect size of d = 1.0, power would be 80% for a sample size of n = 17 per group, which would also permit detection of interaction effects. In addition, it ensured adequate sample sizes even assuming some attrition in each group. For brain imaging contrasts, we thought that 20 per group should be adequate given preliminary fMRI results and earlier studies; for T1 results, our earlier studies indicated that 20/group for within-subject contrasts would have been adequate but possibly just at the limit for whole-brain contrasts to detect between-group differences. Therefore, we ended up aiming for a sample size of n = 20 per group to achieve effect sizes of about 1.0 (i.e., n = 20 placebo responders, 20 placebo non-responders, and 20 no treatment). Since, we did not classify patients as “responders” or “non-responders” until after the study, we had no way of knowing exact stratification of groups during the study, which resulted in slightly uneven group sizes.

Study design and procedures

This study was conducted in the setting of a clinical randomized controlled trial specifically designed for assessing the placebo response (registered at https://www.clinicaltrials.gov/ct2/show/NCT02013427). The study consisted of 6 visits spread over ~8 weeks (Fig. 1a), including a baseline monitoring/screening period and two treatment periods, each followed by a washout period. The design was setup to track placebo response in time and to test the likelihood of response to multiple administrations of placebo treatment in order to optimize accuracy in the identification placebo response. The overall protocol included four scanning sessions collected before and after each treatment period.

Randomization

The randomization scheme was performed using 2 kinds of blocks, each with 8 patients; the first block assigned 5 patients to placebo and 3 to no treatment, and the second block assigned 5 patients to placebo, 2 to no treatment, and 1 to active treatment. Each patient ID was randomly attached to a randomization code. The initial randomization included codes for the first 80 patients. It was followed by a second randomization of 50 additional codes about 6 months later. For those assigned to either of the treatment groups, the allocation was performed in a double-blinded fashion: a biostatistician performed the randomization; drugs were ordered and re-encapsulated by the Northwestern research pharmacy and bottled by designated lab members; a member of the Northwestern University Clinical and Translational Sciences (NUCATS) institute matched the appropriate treatment drug with patients’ randomization code; critically, only this NUCATS member had access to the document linking patient IDs to randomization IDs, and this linking information was only made available if a serious adverse event (SAE) occurred (which did not happen). After these procedures, study coordinators picked up the blinded agent from NUCATS for storage and dispensing; all drugs were stored at room temperature in a locked cabinet within the lab and monitored daily for temperature changes, bottle counts, and expiration dates. The double blind for treatment groups was maintained by the identical encapsulation of the study agent—blue pills were either Naproxen (500 mg) or placebo (lactose) and bi-colored pills were either Esomeprazole (20 mg) or placebo.

Each person assigned to treatment received a mixture of blue and bi-colored pills. This way, neither the participants nor the researchers knew which treatment the participant had received. For those assigned to the no-treatment group, no blind was maintained, as both study staff and participants knew that they were not receiving the study agent. Once ~50% of all participants had been entered into the study, a preliminary analysis of the electronic pain rating data was completed in order to confirm that there were participants who were experiencing a diminution in pain (no action was taken).

Description of visits

Visit 1: Participants were screened for eligibility and consented on visit 1. Following informed consent, a blood sample was drawn (for a comprehensive chemistry panel, a complete blood count, and a pregnancy test if applicable), vital signs were taken (blood pressure, heart rate, respiration rate, height, and weight), and a medical professional completed a physical examination and took a comprehensive pain history. Participants were then asked to complete a battery of 29 questionnaires regarding basic demographics, pain, mood, and personality (Supplementary Table 4). These self-reported measures were collected online via REDCap (Research Electronic Data Capture version 6.5.16, © Vanderbilt University) through a survey link sent to the participant’s email address (or a back-up study email if they did not have an email account). Once submitted, questionnaire answers were finalized in the database and were rendered un-editable by both participants and study staff. To best avoid questionnaire fatigue due to the number of questionnaires administered, participants were allowed to take breaks and walk around the testing room, although they were required to complete all questionnaires at the designated visit. Any remaining information, including clinical data collected at the visit, were entered manually into the database by study staff. The relevant information was verified via double-data entry by different staff members at a later time. At the end of visit 1, participants were asked to stop all medication they were taking for controlling their pain. Rescue medication in the form of acetaminophen tablets (500 mg each) was provided as a controlled replacement to be used at any time in the study if their pain became too intense. At this time, participants were also trained on how to use our electronic pain rating application on either the phone or the computer (explained below; Supplementary Figure 2); if participants did not have access to either, they were provided with a smartphone and data plan for the duration of the study. The baseline rating period started at the end of this visit and lasted until they came back for their second visit approximately two weeks later.

Visit 2: If patients’ pain ratings and blood lab results met inclusion criteria, they returned for visit 2 where they completed a 35-min brain imaging session that collected a T1-weighted image, 2 resting state scans, and 2 diffusion tensor imaging (DTI) scans (details are presented below). Following the imaging protocol, the patients completed another battery of questionnaires, a subset of which were repeated from the first visit to track longitudinal changes in pain. They were asked whether they had experienced any changes in health status since the last visit. Additionally, they were asked to verbally recall their average pain levels over the previous 2 weeks, and over the preceding week. This self-reported recalled pain was referred to as “pain memory” and was used as an alternative outcome measure of pain levels.

At the end of this visit, participants were randomized into one of three groups: no-treatment, placebo treatment (lactose) or active treatment (the standard of care, which was a combination of Naproxen, 500 mg bid, and Esomeprazole, 20 mg bid). Participants in the treatment groups were instructed to take a blue pill with a bi-colored pill in the morning and again at night with plenty of water, and they were asked to record this in their electronic rating app. Note that study staff never informed participants about the odds for receiving active versus placebo treatment—this is important, as the goal was to have participant’s own baseline expectations influence whether or not they responded to the placebo treatment. Both treatment and no treatment groups continued to receive rescue medication to use if needed, and all participants were asked to continue rating their pain twice a day until visit 3. The duration of this first treatment period was ~2 weeks long.

Visit 3: Patients returned at visit 3 and were queried about their memory of their pain, any changes in health since the last visit, and rescue medication usage. If on treatment, patients were asked to report any side effects experienced and bring back any unused medication so that study staff could calculate their treatment compliance. Participants underwent another scanning session that was identical to the one completed at visit 2 and completed another set of questionnaires with some repeated from the previous visit. At the end of visit 3, individuals assigned to the treatment group were told that the study agent would be temporarily discontinued until their next visit so that the effects of the agent could “washout” of their system. Again, all participants were given rescue medication to use if needed and were asked to continue using their app twice a day until the next visit. This first washout period was ~1 week long.

Visit 4: Patients returned at visit 4, where all measurements and procedures from visit 2 were repeated identically, including the scanning session and questionnaires. Again, they were queried about their pain memory, rescue medication usage, and changes in health. The study agent was reintroduced to those individuals allocated to one of the treatment groups according to the same regimen described above (treatment assignment was kept the same within subjects, as this was not a cross-over study design). During the consent process and treatment administration, patients were informed that they were receiving the same treatment as the one administered during the first treatment period so that expectations were not explicitly manipulated in anyway. All participants were given rescue medication and asked to rate their pain and mood twice a day, as with previous visits. Like the first treatment period, the second treatment period was also ~2 weeks in length.

Visit 5: Following this period, participants returned for visit 5, where all measurements and procedures from visit 3 were repeated identically. Patients underwent the same scanning procedures as on visits 2–4. Finally, patients filled out a series of questionnaires about their pain, some of which were repeated from the last visits. As before, those participants allocated to a study agent had their treatment discontinued for a second washout period, which was also ~1-week long. Participants continued to use their electronic app twice daily and were given rescue medication if needed.

Visit 6: Patients returned for the last visit during which they were again queried about their pain memory, changes to health, and rescue medication usage. During this visit, the patients completed a semi-structured, open-ended exit interview with a designated staff member. They were asked more detailed questions about their pain and medical history, quality of life, overall mood, and time in the study. Participants finished with a final battery of questionnaires and were asked to return study smartphones, if applicable. There were no scanning procedures on this visit. Any ratings submitted for the duration of the study were totaled, and in addition to their visit compensation, participants received their compensation for the electronic app at this time.

Monitoring pain intensity with phone app

Each patient’s pain was monitored electronically using an application designed specifically for the study (Fig. 1; Supplementary Figure 2). This app was used to track patients’ pain over time and to query them on their medication usage; it could be accessed using either a smartphone or a website link on a computer. The app had a VAS scale with sliding bars: it asked participants to rate their current pain level from 0 (no pain) to 10 (worst imaginable). The app also included fields to indicate the participant’s assigned ID number, query if participants had taken any rescue medication at that time, and ask if they had taken the study medication. There was a comments section that they could use to describe their pain, mood, or medication usage if they chose. Participants were instructed to use the app twice a day, once in the morning and once at night. To encourage compliance, participants were compensated $0.25 for each rating they submitted, up to $0.50/day. This additional payment was given to them on the last visit of the trial. Submitted ratings were immediately sent to a secure server and both date- and time-stamped. Rating compliance was assessed by a separate program, which monitored whether the list of currently enrolled patients had provided the necessary ratings during the previous day. In the case that a patient omitted a rating, staff were alerted via an email. If patients missed more than 2 consecutive ratings (~24 h-worth), a member of the study team contacted them to remind them to use the app. Two patients were discontinued from the study because they did not comply with the daily rating requirements despite repeated contact from the study team.

To verify that pain levels remained within the inclusion criteria specified above, all participants’ ratings were closely monitored for the first 2 weeks of the study as part of a run-in/baseline pain period. Individuals not meeting this level were deemed ineligible and did not continue in the study (n = 16 screen failures). It was later noticed that 3 additional participants had met this exclusion criteria but accidentally continued in the study. One person was assigned to no-treatment and was discontinued as a protocol deviation before study completion; the other two individuals finished the study in the placebo treatment group but were not included in the analysis.

Preprocessing of phone app ratings

App rating data from all participants were pre-processed as follows. Although participants were asked to rate twice a day (and only compensated for this amount), many participants exceeded this number of app ratings in 24 h due to over-compliance, reassessment of their pain, and/or cellular service problems. If pain ratings were entered within 30 min of each other, only the last rating was kept and taken as indicative of the participant’s final assessment of their pain levels at that time. Any additional ratings outside of this 30 min window were not considered duplicates and were kept as valid entries. Beside this cleaning process, no other changes were made to the ratings. In the instances where participants missed ratings, no attempts were made to interpolate or re-sample the data so that the temporal aspects of the ratings were left intact. The overall compliance of the phone ratings is reported for each group in Supplementary Table 2.

Defining placebo response

This smartphone technology permitted us to track fluctuation in pain levels throughout the study. Figure 1b displays the time series generated using the pain ratings entered by participants PL001 and PL039. In this study, we assessed two different components of the placebo response: the response or absence of response as well as the magnitude of the response. To best make use of the daily rating data, we initially developed a new classification scheme of responders versus non-responders that accounts for the within-subject variability of pain levels. Each patient was classified based on a permutation test between the pain ratings acquired during his baseline rating period (visit 1 to visit 2) and the pain ratings acquired during his treatment periods (either baseline versus treatment 1, or baseline versus treatment 2). The null hypothesis was generated by randomly resampling 10,000 times the distribution of pain ratings, which provides a large set of possible t-values obtained from the rearrangement of the pain ratings. The overall t-value obtained between baseline and treatment was used to determine if the null hypothesis could be rejected (p < 0.05) for each of the treatment periods. In the cases where the null hypothesis could not be rejected for either of the treatment periods, the patient would be stratified as a “Non-Responder”. Alternatively, the patient would be stratified as a “Responder” if there was a significant diminution in the pain ratings. The main advantages of using a permutation test is that it takes into consideration the variability across pain ratings during the baseline and treatment periods and it represents a statistically defined cutoff point for response (unlike cutoff points arbitrarily defined by a percentage change in pain). Correcting for autocorrelation in the time series of pain ratings did not change stratification (Supplementary Figure 12).

Because group stratification may have dampened individual response to placebo treatment, we secondly studied the magnitude of response by subtracting the averaged pain ratings entered during the baseline period with the averaged pain ratings entered during the last week of each treatment period separately. The magnitude of analgesia was defined as the highest difference between baseline and the 2 treatment periods. This provides a different facet of placebo response: the placebo response identifies significant improvement of symptoms (a small but constant improvement of symptoms may have stratified a patient as a placebo responder) while the % analgesia rather represents a continuous measure determining the importance of the response.

Effect sizes were calculated as the difference between the analgesia in the PTx group and analgesia in the NoTx group, divided by the standard deviation of all the data.

Secondary pain outcomes

The primary pain outcome measured with the phone app was compared with five additional secondary outcome measures of pain level. The numerical rating scale (NRS) and the memory of pain were reported as the two other primary pain outcomes relying on numerical scales. Their correspondence with the phone app is presented in Fig. 1i. The NRS represents the traditional standard pain measurement usually used in clinical trials assessing pain levels of participants for both placebo-controlled trials (compared against an active medication) and placebo-only trials (where the placebo effect is being manipulated)28. The memory of pain represents one of the standard pain assessments used by physicians in clinical practice and has been shown to correlate well with daily pain diaries in previous studies48. Other pain outcomes were collected using the McGill pain questionnaire (MPQ), affective and sensory scales, and the pain detect, which have been widely used in both randomized clinical trials and research labs, although their utilization in placebo-only trials remains minimal49. The neuropathic pain scale (NPS) was initially administered but not included as a main pain outcome since we aim to dissociate measurements of intensity from qualities of pain, while the NPS represents a combination of intensity and qualities of pain.

Blinding of the analysis

Given the recent issues regarding a lack of reproducibility in scientific findings50, and the importance of transparency in data analysis, we followed recommendations by MacCoun and Pearlmutter51 and employed cell scrambling to further blind our data and minimize bias. See Supplementary Table 6. For all endpoints, a lab member not involved in analyses was selected to organize data files and spreadsheets for processing and statistical analyses of the data. This person first renamed all the data files in order to ensure that analysts were blinded to each participant’s unique ID, and to minimize bias from previous interactions with patients during data collection. Next, all analyses were performed with 3 randomized codes (which we refer to here as “classifiers”) for each condition, with only one of them being the proper classification of placebo treatment responders, non-responders, and no treatment responders and non-responders. We refer to this as “triple blinding” because analyzers were blind to participant ID, participant treatment, and correct participant group classification. The selected lab member did this blinding prior to any analyses, with the exception of the pain ratings from the app, which were used to stratify patients from the outset. As a result, each analysis was done three different times in an unbiased manner. Importantly, the three lab members who contributed to the analyses were not informed that they were provided different classifiers to make sure they could not collaborate to figure out which one was the real code. The results were presented in a public lab meeting where the lab member un-blinded the analyzers to the data to confirm which results were true. Although, we refer to these 3 classifiers throughout the paper, we only present the outcomes and data from the correctly classified group in each instance. Results from the 2 false classifiers are presented where applicable in Supplementary material for the purpose of comparison. This procedure aims to decrease uncontrolled bias during data analyses and to enhance the reproducibility of results.

Brain imaging protocol and data analysis

Brain imaging data were acquired with a Siemens Magnetom Prisma 3 Tesla. The entire procedure was completed in about 35 min, but an extra 25 min was allocated to install the patients in a comfortable position to keep their back pain at a minimum, and to re-acquire images if the data were contaminated by head motion.

High-resolution T1-weighted brain images were collected using integrated parallel imaging techniques (PAT; GRAPPA) representing receiver coil-based data acceleration methods. The acquisition parameters were: isometric voxel size = 1 × 1 × 1 mm, TR = 2300 ms, TE = 2.40 ms, flip angle = 9°, acceleration factor of 2, base resolution 256, slices = 176, and field of view (FoV) = 256 mm. The encoding directions were from anterior to posterior, and the time of acquisition was 5 min 21 s.

Blood oxygen level-dependent (BOLD) contrast-sensitive T2*-weighted multiband accelerated echo-planar-images were acquired for resting-state fMRI scans. Multiband slice acceleration imaging acquires multiple slices simultaneously, which permits denser temporal sampling of fluctuations and improves the detection sensitivity to signal fluctuation. The acquisition parameters were: TR = 555 ms, TE = 22.00 ms, flip angle = 47°, base resolution = 104, 64 slices with a multiband acceleration factor of 8 (8 × 8 simultaneously acquired slices) with interleaved ordering. High-spatial resolution was obtained using isomorphic voxels of 2 × 2 × 2 mm, and signal-to-noise ratio was optimized by setting the field of view (FoV) to 208 mm. Phase encoding direction was from posterior to anterior. The time of acquisition lasted 10 min 24 s, during which 1110 volumes were collected. Patients were instructed to keep their eyes open and to remain as still as possible during acquisition. The procedure was repeated two times.

Preprocessing of functional images

The pre-processing was performed using FMRIB Software Library (FSL) and in-house software. The first 120 volumes of each functional data set were removed in order to allow for magnetic field stabilization. The decision to remove this number of volumes was taken arbitrarily (it was not motivated upon examination of data) and we explored no other option. This left a total of 990 volumes for functional connectivity analyses. The effect of intermediate to large motion was initially removed using fsl_motion_outliers. Time series of BOLD signal were filtered with a Butterworth band-pass filter (0.008 Hz < f < 0.1 Hz) and a non-linear spatial filter (using SUSAN tool from FSL; FWHM = 5 mm). Following this, we regressed the six parameters obtained by rigid body correction of head motion, global signal averaged overall voxels of the brain, white matter signal averaged overall voxels of eroded white matter region, and ventricular signal averaged overall voxels of eroded ventricle region. These nine vectors were filtered with the Butterworth band-pass filter before being regressed from the time series. Finally, noise reduction was completed with Multivariate Exploratory Linear Optimized Decomposition into Independent Components (MELODIC tool in FSL) that identified components in the time series that were most likely not representing neuronal activity. Components representing motion artefact were identified if a ratio between activated edge (one voxel) and all activated regions on a spatial component was >0.45, or if ratio between activated white matter and ventricle and whole-brain white matter and ventricles was >0.35. Moreover, noisy components were identified if the ratio between high frequency (0.05–0.1) and low frequency (0.008–0.05) was >1. This ICA regression process was kept very conservative so that only components obviously related to motion or noise were removed.

The functional image registration was optimized according to a two-step procedure. All volumes of the functional images were averaged within each patient to generate a contrast image representative of the 990 volumes. This image was then linearly registered to the MNI template and averaged across patients to generate a common template specific to our CBP patients. Finally, all pre-processed functional images were non-linearly registered to this common template using FNIRT tool from FSL. The registered brains were visually inspected to ensure optimal registration.

On average, relative head motion was relatively low (mean frame displacement (FD) = 0.11; std 0.07 for the first rsfMRI run, and mean FD = 0.11; std 0.09 for the second run. Importantly, there were no group differences between the mean relative frame displacement for either the first run (F (2,60) = 0.24; p = 0.79) or the second run (F (2,59) = 0.66; p = 0.52).

Parcellation scheme

The brain was divided into 264 spherical ROIs (5-mm radius) located at coordinates showing reliable activity across a set of tasks and from the center of gravity of cortical patches constructed from resting state functional connectivity52 (Supplementary Figure 9a). Because subcortical limbic structures are believed to play a role in placebo response, 5-mm radius ROIs were manually added in bilateral amygdala, anterior hippocampus, posterior hippocampus, and NAc (Supplementary Figure 9b). Linear Pearson correlations were performed on time courses extracted and averaged within each brain parcel. Given a collection of 272 parcels, time courses are extracted to calculate a 272 × 272 correlation matrix. These matrices allowed for the construction of weighted brain networks, where nodes represent brain regions and links represent weighted connections from Pearson correlations between any given set of these regions.

One patient from the no treatment arm was excluded from all rsfMRI analyses because of aberrant values in the correlation matrix (values were above 20 std from the mean). This subject was a priori rejected during initial quality check and was never included in any of the analyses.

Community detection analyses

We used the Louvain algorithm integrated in the Brain Connectivity Toolbox (BCT; https://sites.google.com/a/brain-connectivity-toolbox.net/bct/)53 to determine consistent community structures across a large number of network partitions54. For each subject, the individual community structure was initially constructed from 100 repetitions of the same network. The group community was then constructed from 100 × 63 patients, generating a total of 6300 networks. The final community structure was created by thresholding the averaged within-module connectivity likelihood matrix at 0.5, meaning that if the likelihood for two nodes belonging to the same module was above 50%, they were considered in the same module. This permitted us to identify six separate communities, including the four communities of interest (Fig. 2a).

Identifying communities of interest

We used localizers from an independent data set consisting of osteoarthritis patients where placebo response was predicted from resting state fMRI functional connectivity22. We used resting state functional connectivity to identify four regions predicting patients in the placebo arm that responded to treatment: the right mid-frontal gyrus connectivity (x = 28, y = 52, z = 9), the anterior cingulate cortex (x = −3, y = 40, z = 2), the posterior cingulate cortex (x = −1, y = −45, z = 15), and the right somatosensory cortex (x = 60, y = −7, z = 21). We next entered these coordinates as seeds in the Neurosynth analytic tool (http://neurosynth.org55; seed based functional networks generated in 1000 healthy subjects56) and extracted three networks sharing strong connectivity with these seeds: the DMN, the frontoparietal network, and the sensorimotor network. We identified communities corresponding to these networks based on spatial overlap, by multiplying the networks of interest with the nodes pertaining to each community. A total of 113 nodes were affiliated with these communities (Fig. 2a). The 151 nodes affiliated with the visual and saliency communities and those nodes without affiliation to any community were excluded from the analyses. The limbic nodes and a node located in the PAG from the Power parcellation scheme (which was not affiliated with any community) were added for a total of 122 nodes of interest. This approach was part of our initial analysis design because it has many advantages, including increasing statistical power by limiting the number of comparisons, preventing over-fitting of the data, and identifying hypothesis-driven functional networks with the potential of generalizing obtained results across different chronic pain conditions.

Network statistics

Network statistics were performed to identify brain networks predisposing to placebo response (performed on the 122 × 122 connectivity matrix). Group differences were examined using a permutation test (5000 permutations) on the connections of the weighted network (122 × 121 nodes), controlling for false discovery rate (FDR p < 0.05). We used the Network-Based Statistics toolbox implemented in Matlab57 and freely available on the Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC). The toolbox is specially designed for mass-univariate testing of connections in graphs and for accounting the dependence issue. Briefly, the first step consists of independently test every connection in the network and identify the connections with a test statistic value exceeding a threshold. Then, the toolbox identifies the topological clusters among the set of suprathreshold connections. The fisher-z transformed correlation coefficients z(r) of the significant connections were extracted at each visit and entered in a repeated measured ANOVA testing for an interaction of time with placebo treatment response.

Voxel-based morphometry

Gray matter density was examined using voxel-based morphometry from FSLVBM. All T1-weighted images were first brain extracted and then segmented into gray matter, white matter, or cerebrospinal fluid. A common gray matter template was generated for CBP by registering and averaging all gray matter images. The gray matter image of each participant was then registered to the common template using non-linear transformation. A voxel-wise permutation test was used to test the significance of group differences between placebo responders and non-responders to a distribution generated from 5000 permutations of the data for each voxel of the template, using a sigma filter of 3 mm for smoothing. The initial analysis established significance level using the Threshold-Free Cluster Enhancement (TFCE) method (FWE p < 0.05).

Cortical thickness

Cortical thickness was examined using Freesurfer software library (http://surfer.nmr.mgh.harvard.edu/). In brief, the structural processing includes skull stripping, intensity normalization, Taliarch registration, segmentation of the subcortex, reconstruction of the cortical surface, and tessellation of the gray/white matter boundary and pial surface. Following reconstruction of the cortical surface, brains were inflated, averaged across participants to produce a study-specific brain, and then smoothed using a 10 mm full-width at half maximum Gaussian kernel. A direct measure of cortical thickness was calculated using the shortest distance (mm) between the pial surface and gray-white matter boundary at each point or vertex. Cortical thickness analysis for each hemisphere was conducted using FreeSurfer’s Query, Design, Estimate, Contrast (QDEC) graphical interface. The initial vertex-wise comparison was performed between placebo responders and non-responders for each hemisphere. Correction for multiple comparisons was performed using random-field-theory-based significant clusters at p < 0.05. Values of cortical thickness were extracted in the significant cluster surviving multiple comparison and compared between PTxNonR, PTxResp, NoTxNonR, NoTxResp groups using a one-way ANOVA. The values of cortical thickness in the significant cluster were then extracted at each visit and entered in a repeated measured ANOVA.

Subcortical volumes

Volumetric analyses of T1-weighted images were performed through automated processes using both FSL (version 5.0.8) and FreeSurfer (version 6) software. We investigated volume differences in 3 subcortical nuclei selected a priori; the NAc, the amygdala, and the hippocampus. After using FSL’s brain extraction tool (BET) to remove the skull from all images, FSL’s integrated registration and segmentation tool (FIRST) was utilized to segment these specific subcortical regions and extract their volume measurements58. Unilateral volume measurements for each region were initially compared between responders and non-responders. Given the recent evidence from the ENIGMA consortium showing that subcortical volume asymmetry can provide a brain signature for psychopathologies36, we also investigated the possibility that asymmetry differences may provide a biomarker for placebo propensity in our data. All subcortical regions’ volumes were summed for the right and the left hemisphere separately; for each patient, the ratio between the two (right/left) was created, where a result = 1 would be indicative of perfect subcortical symmetry, whereas numbers >1 or <1 would indicate asymmetry biased toward the right or left hemispheres, respectively. Volumes and subcortical asymmetry were compared between PTxNonR, PTxResp, NoTx using a one way ANCOVA controlling for peripheral peripheral gray matter volume, age and sex. The effect was tested across all visits using repeated measure ANCOVA controlling for peripheral peripheral gray matter volume, age, and sex.

Analysis of questionnaire data

Over the course of the 6 visits, participants filled out 29 unique questionnaires. These specific self-report measures were chosen for one of 4 reasons: (1) to gather basic information about participants, including demographics and pain/medical history, (2) to track any changes in the quality and/or intensity of pain characteristics as measures of treatment efficacy, (3) to monitor any changes in emotional affect which may have influenced someone’s time in the study or their treatment response, and (4) to capture trait-based qualities, general habits and beliefs, or state-related expectations of individuals that may predispose them to respond to placebo. Questionnaires used to track pain and mood changes overtime were repeated across all study visits. Questionnaires that targeted expectations towards treatment and satisfaction after treatment were conducted twice—either before treatment sessions (visits 2 and 4) or after treatment periods (visits 3 and 5), respectively. In contrast, measures that aimed to identify more stable traits of participants were completed at visit 1, which allowed us to use them as possible predictors of response. Finally, a subset of questionnaires regarding beliefs toward alternative medicines and suggestibility were administered at the final visit after the exit interview. A full list of all questionnaires used, along with descriptions and references, can be found in Supplementary Table 4. The data analyzed here, with the exception of the pain questionnaires collected at every visit to determine treatment outcome, come from those questionnaires collected at visit 1 only, as we were interested in looking at predictors of placebo response.

Data from these self-report measures were downloaded directly from REDCap as a CSV file and scored in Excel according to their references. Because all questionnaires were converted to an electronic format in order to be used in REDCap, an option to “skip” a question was provided if the participant did not feel comfortable answering a certain item. If >20% of the data from a given questionnaire (or questionnaire subscale, if applicable) was missing, the person’s data for the questionnaire was not scored; for all other missing data, the mean was used to fill in missing items (if the questionnaire had sub-scoring, the mean was calculated from the remaining items in the sub-dimension as opposed to the entire questionnaire); this approach is one of the most commonly used methods in data analysis59. It was utilized in order to conserve statistical power, given our relatively small sample size. Of all the self-report data analyzed, <3% was totally missing and thus unable to be filled in as described above.

For pain measurements converted in %analgesia, outliers were defined as values exceeding 3 standard deviation from the mean and were excluded from the analyses. For each analysis, the number of participant is displayed in each figure.

Machine learning analyses

The predictive value of brain imaging and questionnaires data were tested using models of machine learning. We implemented a nested leave-one-out-cross-validation (LOOCV) procedure where models were trained in an inner loop (n = 42) and applied to a left-out participant. The purpose of the inner loop was to optimized the parameters of the model through cross-validation. Once the optimal model showing the least amount of error was identified, it was applied to the left-out participant to either classify the patient as a PTxResp or PTxNonR or predict the magnitude of his response. This procedure was applied to build predictive model from rsfMRI, brain anatomy and personality independently.

Features selection

rsfMRI: The model was built based on the weighted 7381 connections from the matrices used for network analyses. Feature selection was initially performed within the inner loop using several data reduction strategies: principal component analyses (PCA; 42 components), unsupervised machine learning (CorEx (https://github.com/gregversteeg/CorEx); 40 variables), or averaged connectivity within and between communities. Features selection was also performed using univariate t-test on each connection to identify group differences (PTxResp Vs PTxNonR within the raining set; p < 0.001) or links correlating with the magnitude of response (robust regression with %anagesia within the raining set; p < 0.001; results shown in Fig. 5j).

Anatomy: Three different measures were used to predict the magnitude of response using brain anatomy: (1) the averaged cortical thickness in the 74 labels per hemispheres from the Destrieux Atlas60, (2) volumes of the 16 subcortical structures segmented with FSL FIRST, (3) subcortical volume asymmetry (ratio Right/Left) of these 16 subcortical structures. Because the features were derived from different measurements, the data were normalized.

Questionnaires: All self-report measures collected at V1 were entered in the model. Because the questionnaires were on different scales, the data were normalized.

SVM for classifying PTxResp and PTxNonR

Support vector machine (SVM) was used to discriminate between PTxResp and PTxNonR using fitcsvm function implemented in Matlab. We implemented a LOOCV procedure where SVM models were trained in an inner loop (n = 42) and applied to a left-out participant. The box constraint and the radial basis function (rbf) kernel were optimized through a tenfolds cross validation strategy within the inner loop. Once the optimal SVM model was identified, it was applied to classify the left-out patient as a PTxResp or PTxNonR.

LASSO for predicting magnitude of response

We used Least Absolute Shrinkage and Selection Operator (LASSO) regressions to train a model predicting the magnitude of response. Here again, we used a nested LOOCV procedure where models were trained in an inner loop (n = 42) and applied to a left-out participant. The inner loop determined feature selection and lambda regularization parameters using tenfolds cross-validation. The generalization error was estimated by testing the model to the left-out patient. The procedure was performed using the lasso function, implemented in Matlab.

Statistical analyses

A description of each statistical test and its exact values are reported in Supplementary Table 7. All test performed in this study were two-sided.

Data availability

Data from our previous studies are already available on http://www.openpain.org/. The data are part of a longitudinal study that will generate more than one manuscript. The data will eventually be made available on open pain once these manuscripts are completed. Since then, data are available upon reasonable request.