Obtaining informed consent

All work was carried out as part of an IRB-approved protocol (BCM IRB H-32711), which utilized a main consent document for general participation in TCRB with an opt-in consent addendum for OA data release. Of 194 TCRB participants offered the option of signing the opt-in addendum participating in OA sharing out of >2,500 total participants, more than half agreed to open access data sharing at time of consent. Annotated TCRB specimen and data collection consent and the OA opt in consent documents are available at http://txcrb.org/resources.html. To address concerns about whether patients can provide truly informed consent regarding the potential risks of genomic data sharing, a subset of the OA participants (n=37) were educated on risks and societal benefits of data sharing. The educational materials are available at http://txcrb.org/privacy.html. Participants were surveyed to assess their comprehension, risk tolerance, and subjective comfort with OA data release. Each participant was again queried, post-survey, to reconfirm their choice to take part in the OA data sharing option. The majority demonstrated adequate understanding of the possible privacy and discrimination risks, yet still elected to allow their data to be openly shared. The work described in Pereira et al.9 is one clear example that many, though not all, cancer patients indeed desire to participate in activities that could have broad-reaching, positive impacts to public health for reducing cancer mortality and morbidity, and have the capability to make an informed choice.

Selecting the OA cohort

Approximately 20% of the 37 participants (n=7) who were surveyed and still consented to OA data sharing were selected for inclusion in the TCRB cancer OA dataset. To further reduce risk of reidentification, none of the OA participants had rare ethnicities or tumor types as defined by SEER (Surveillance, Epidemiology, and End Results program) statistics. Figure 1 shows the process for this and subsequent steps in creation of this dataset.

Tumor and normal collection

Blood was collected from participants in PAXgene Blood DNA tubes, and DNA was isolated using the PAXgene Blood DNA kit (PreAnalytiX, Qiagen, Valencia, CA). The tumor pancreas tissue specimens were collected shortly after resection and stored in a protease inhibitor solution (Roche Applied Science, Indianapolis, IN), RNAlater (Qiagen), or snap frozen, and stored at −80C. The blood sample specimen was used as matching normal control. DNA was isolated from 50–100 mg tissue fragments using the GentraPuregene kit (Qiagen). The quality of the DNA samples were ascertained by electrophoresis and determined to be of high quality (size >23 kb) with no visible degradation in blood or tumor samples.

To implement genomic sequencing approaches in ‘real world’ specimens, it is imperative to detect variants in clinical samples that have reduced tumour cellularity, e.g., as a consequence of neoadjuvant or other prior therapy. We devised methodologies to overcome the challenges associated with extensive desmoplastic stroma that is characteristic of the majority of pancreatic tumours, and these strategies facilitated the discovery of novel molecular mechanisms in the pathophysiology of this disease. The cellularity of each primary sample was estimated through pathological review, deep amplicon-based sequencing of exons 2 and 3 of KRAS (average depth of 1,000×), and single nucleotide polymorphism (SNP) array-based cellularity estimates using a novel algorithm qpure20. Clinical and pathological annotations for each case are shown in Table 1.

Table 1 TCRB open access clinical and pathological annotations by case. Full size table

Whole exome sequencing

Library preparation

DNA samples were constructed into Illumina paired-end pre-capture libraries according to the manufacturer’s protocol (Illumina Multiplexing_SamplePrep_Guide_1005361_D) with modifications as described in the BCM-HGSC Illumina Barcoded Paired-End Capture Library Preparation protocol. Libraries were prepared using Beckman robotic NXp and FXp model workstations. The complete protocol and oligonucleotide sequences are accessible from the HGSC website (https://hgsc.bcm.edu/sites/default/files/documents/Illumina_Barcoded_Paired-End_Capture_Library_Preparation.pdf). Briefly, 1 ug of DNA in 100 ul volume was sheared into fragments of approximately 300–400 base pairs in a Covaris plate with E210 system (Covaris, Inc. Woburn, MA) followed by end-repair, A-tailing and ligation of the Illumina multiplexing PE adaptors. Pre-capture Ligation Mediated-PCR (LM-PCR) was performed for 7 cycles of amplification using the Phusion High-Fidelity PCR Master Mix (NEB, Cat. no. M0531L). Universal primer IMUX-P1.0 and a pre-capture barcoded primer IBC were used in the PCR amplification. In total, a set of 12 such barcoded primers were used on these samples. Purification was performed with Agencourt AMPure XP beads after enzymatic reactions. Following the final XP beads purification, quantification and size distribution of the pre-capture LM-PCR product was determined using the LabChip GX electrophoresis system (PerkinElmer).

Exome capture

For exome capture, four pre-capture libraries were pooled together (~300 ng/sample, 1.2 ug/pool) and hybridized in solution using the VCRome 2.1 Design21 supplied by NimbleGen according to the manufacturer’s protocol NimbleGen SeqCap EZ Exome Library SR User’s Guide (Version 2.2) with minor revisions. Human COT1 DNA and full-length Illumina adaptor-specific blocking oligonucleotides were added into the hybridization to block repetitive genomic sequences and the adaptor sequences. Post-capture LM-PCR amplification was performed using the Phusion High-Fidelity PCR Master Mix with 14 cycles of amplification. After the final AMPure XP bead purification, quantity and size of the capture library was analyzed using the Agilent Bioanalyzer 2100 DNA Chip 7500. The efficiency of the capture was evaluated by performing a qPCR-based quality check on the four standard NimbleGen internal controls. Successful enrichment of the capture libraries was estimated to range from a 6 to 9 of ΔCt value over the non-enriched samples.

Sequencing

Library templates were prepared for sequencing using Illumina’s cBot cluster generation system with TruSeq PE Cluster Generation Kits (Cat. no. PE-401–3001). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6–9 pM in hybridization buffer in order to achieve a load density of ~800 K clusters/mm2. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Each library pool was loaded in a single lane of a HiSeq 2000 flow cell, and each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Using the TruSeq SBS Kits (Cat. no. FC-401–3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300–400 million successful reads on each lane of a flow cell, yielding 7–13 Gb per sample. For exome sequencing yields, samples achieved an average of depth of coverage of 200X over exonic regions.

Whole genome sequencing

Library preparation

For most cases, there was not sufficient quantities of biospecimen available to perform whole genome sequencing (WGS). For these cases, only WEX data exist. For two cases, however, it was possible to perform WGS. Library templates were prepared for sequencing using Illumina’s cBot cluster generation system with TruSeq PE Cluster Generation Kits (Cat. no. PE-401–3001). DNA (0.5 ug) in 70 ul volume was sheared into fragments of approximately 500–700 base pairs using Covaris S2 system (Covaris, Inc. Woburn, MA). Briefly, these libraries were denatured with sodium hydroxide and diluted to 6–9 pM in hybridization buffer in order to achieve a load density of ~800 K clusters/mm2. Illumina multiplexing PE adaptors with barcode sequences were added to the sample at the time of ligation. Pre-capture Ligation Mediated-PCR (LM-PCR) was performed for 6–8 cycles using the Library Amplification Readymix containing Kapa HiFi DNA Polymerase (Kapa Biosystems, Inc., Cat # KK2612) and universal primer-pair IMUX-P1.0 and IMUX-P3.0. 4) For purification of the fragmented DNA, 0.8X AMPure XP (Beckman, Cat. No. A63882) was used as opposed to using 1.8X for preparation of WES libraries.

Sequencing

Tumor libraries were sequenced in four lanes, and normal libraries were sequenced in two lanes of a HiSeq 2000 flow cell, yielding approximately 60X and 30X coverage respectively. Each lane was spiked with 2% phiX control library for run quality control. The sample libraries then underwent bridge amplification to form clonal clusters, followed by hybridization with the sequencing primer. Sequencing runs were performed in paired-end mode using the Illumina HiSeq 2000 platform. Using the TruSeq SBS Kits (Cat. no. FC-401–3001), sequencing-by-synthesis reactions were extended for 101 cycles from each end, with an additional 7 cycles for the index read. Sequencing runs generated approximately 300–400 million successful reads on each lane of a flow cell, yielding ~11 Gb per sample.

Read mapping and variant calling

Binary bcl files output from the HiSeq 2000 were processed with BclConvertor 1.7.1 software available from Illumina. All reads from the prepared libraries that passed the Illumina Chastity filter were formatted into FASTQ files, which were aligned to the human reference genome build 19 version GRCh37-lite using BWA 0.5.9rcl (Burrows-Wheeler Aligner)22. Default parameters were used for alignment, except for a 40 bp seed sequence with 2 mismatches in the seed and a total of 3 mismatches overall allowed. Subsequent base quality recalibration and local realignment around known indel sites was performed by GATK 3.3 (ref. 23). Variant calling was performed using on samples using Atlas-SNP2 (ref. 24), Atlas-Indel2 (ref. 25), and PInDel26. Variants from tumor and matched normal pair samples were compared with each other and with NCBI human reference genome version 19 (hg19) to generate primary calls, which then were annotated using Annovar27, COSMIC28, and dbSNP29 and split into germline or somatic calls in VCF files. The germline and somatic calls were filtered by curators for quality and are available in MAF files. FASTQ, BAM, VCF and filtered MAF and available clinical annotations are freely available on the TCRB website (Data Citation 1) and FASTA and BAMs also within SRA, the Sequence Read Archive (Data Citations 2 and 3).

Code availability

All software used to generate the sequence data and to manage biospecimen and clinical annotations is freely available. Specific software versions and references for code are provided as in-line references above.