Network protocol

We modeled a prototype phase II clinical trial. Upon approval of the study protocol and initiation of phase II, we propose a future regulator could instantiate a private blockchain and registers all participating parties in the portal providing authenticated and controlled web-based and API (application programming interface) access to the blockchain. All parties would be required to use the portal service for any and all exchange of information related to the trial, and only the information present on the blockchain would be used for review when considering approval of the treatment. The network and representative transactions are illustrated in Fig. 1.

Fig. 1 The idealized clinical trial network in the context of a blockchain-based record system. The various transactions (along each arrow) and key participants (boxed) within a clinical trial are shown Full size image

For new patient recruitment and consent acquisition by the clinical site, we propose that an Interactive Voice Response System (IVRS) generates unique verification codes for each subject to give to the trial investigator at the clinical site, and posts encrypted decoding keys for later unblinding. The decoding keys describe the various treatment types that a patient can receive, and will be saved in a password protected environment by the IVRS service provider. The trial sponsor then sends a blinded treatment distribution scheme to the trial investigator at the clinical site. The unique verification codes for each subject are appended to that subject’s CRF at the clinical site upon the office visit. All CRFs would be completed digitally and considered valid if the proper verification code is present. Once completed, the CRF will be directed to the Clinical Research Organization (CRO) involved in the trial, and this transaction will be stamped onto the growing blockchain.

When adverse events are reported, we envision the Data Safety Management Board (DSMB) or regulatory agency would be able to continuously see these events through each chain-posted CRF, and then potentially propagate these onto that trial’s page for public view, if appropriate. Once the CRO receives raw CRF data, data cleaning, and statistical analysis can begin and be done transparently. Upon completion, cleaned data and analysis scripts are sent to the trial sponsor through the portal and these transactions are subsequently recorded onto the blockchain. Any outside data collection sources that are enrolled in the clinical trial would also have to send data to the sponsor through the portal, all marked on the growing ledger.

When the trial sponsor wishes to apply for approval of the drug, the sponsor would send all of their finalized data and in house statistical analysis results to the regulator through the portal and these, like all other elements, would be subsequently added to the chain. When reviewing for approval, the regulator will only consider data that are present on this secure blockchain, and has full read access to everything that has occurred since the blockchain’s instantiation. All data that were ever transmitted in the network would be easily accessible, and its integrity and guarantee of when the transaction occurred will be assured.

Data transaction details

Whenever a transaction occurs, the sender, receiver, timestamp, file attachment, and hash of the previous block, are all recorded onto a new block. These elements are then concatenated together, and hashed using the SHA256 algorithm12, with the result instantiated as the hash string of the current block. The blockchain is constructed by creating a linked list of such blocks (Fig. 2). The previous block’s hash is kept for ordering and to make each block dependent on all blocks that preceded it in the chain, which is a useful property for quickly validating a chain6. Data storage of the blockchain will be accomplished by duplicating and distributing the chain to physically separate machines and data warehouses to be managed by the regulator (see Supplementary Methods).

Fig. 2 The growing blockchain. With each new transaction that occurs, a block is appended and keeps track of information like the timestamp, sender, receiver, file contents, hash of the previous block, and current hash, all in an immutable data structure. On-chain storage of such elements requires minimal memory allocation linearly proportional to the amount of data being uploaded, since the additional book keeping elements are fixed-length strings. Hence scalability is possible, especially given adequate allocation of hardware made possible via growing cloud storage capabilities (see Supplementary Discussion). A summary, compressed blockchain is shown. The actual blockchain will have a beginning genesis block and individual blocks for each transaction (such as a new block for each CRF instead of a single block for all CRFs as shown). The compressed chain is shown to illustrate the chronology of the trial, and what information constitutes a block Full size image

Encryption through a password based key derivation function is offered, and can ensure that sensitive information is protected if the user chooses to do so, which is especially relevant to maintaining integrity of health and medical information, and eliminating information exposure to unwanted parties. Data are thus stored as an unintelligible series of bytes at the storage level, ensuring that any sensitive information in the network is obfuscated and will not be compromised in the incident of a data breach.

File storage of any type onto the blockchain is supported, and the user is able to encrypt, send, and extract files easily. For regulators, the full transaction history since the block’s genesis is readily available with precise timestamps (Fig. 3a), and the auditing process can be done swiftly and with the confidence that all data are original or version controlled. Content since the earliest phases of the trial are sorted, fully transparent, and easily compressible, and downloadable.

Fig. 3 Portal functionality. a The public ledger shows the full transaction history since the start of the trial. Blocks are timestamped, indexed, and attached with the file of the transaction and identities of the participating parties. The regulator can easily download individual files, or all elements in bulk, and inspect when and between whom files are shared. b New versions of files are given a version number, which increments with each new version of that file. In the screenshot above, the original CRF was modified by the sponsor and automatically appended with a (v2) by the system (boxed in red). The responsible party and time of modification are readily apparent. c Internal validation and automated hash checks take place to verify the integrity of the data without the need to manually read through the data’s contents. Hash checks are performed chronologically starting with the genesis block, and verify that the hash of the contents of the block match what is to be expected (see Supplementary Methods). Validation fails when the treatmentDistribution.csv is modified at the storage level. The precise origin and location of the fault can be readily discerned. d Adverse events are auto populated from investigator uploaded CRFs to the pages of the regulator and DSMB, circumventing the normal, slower, and more error prone pathway that is normally taken for adverse event reporting. Such instances are available for inspection at the soonest possible time Full size image

Version controls

If a user needs to edit content that is already present on the blockchain, such as the case when an honest mistake is made and needs to be corrected, the user could make an update known by submitting a new transaction with the corrected data without overwriting the old data. By nature, blockchains are append only, so editing the data directly on the blockchain is not possible. We propose combining blockchain’s append only criteria with version controlling similar to the functionality of GitHub to accommodate this issue. When a new file is uploaded in a transaction, its contents are hashed and compared to any existing files on the blockchain. If there is a conflict, then the system initiates a schema in which subsequent and differing versions of a file are given incrementing version numbers automatically. Hence, a user can be assured that any downstream modifications to that user’s file by anyone else in the network will be documented and cannot be done discretely. No trust in any other parties in the network is needed for data purity, as any tampering will be version controlled and any editors of the file will be easily identified (Fig. 3b).

Simulation of a previously completed clinical trial

To test how blockchain software technologies could be used to manage the governance and data management aspects of a clinical trial, we simulated how a previously completed clinical trial testing the efficacy and safety of omalizumab13 could have been executed using blockchain. We downloaded the completed clinical trials data, including all necessary components, such as raw data, case report form (CRF) components, and protocols from the open clinical trials data repository ImmPort14 (see Data Availability). The trial simulation sequence of events and corresponding files are shown in Fig. 2. Of the 159 actual patients in the trial with CRF data, we mimicked one subject from each of the four treatment arms for the sake of clarity. Only a few selected categories from the large wealth of CRF information were mirrored for the same reason (see Methods). The statistical scripts in this simulation are not the real Python analyses because of our lack of access.

Here, we show a simulation of how different types of clinical trial events were implemented using our blockchain-based data portal. The first event occurs during encounters between a clinical investigator and the patients after being enrolled for a clinical trial. The second event we simulated is the mutation of CRF data by the trial sponsor. The third event is a storage level corruption on the machine housing the data. Finally, we demonstrate an improved and expedited version of adverse event reporting.

Patient and clinical investigator encounters

We composed truncated and digitized CRFs for the four patients we mimicked on the portal using the publicly available CRF component data. For instance, Subject 73,491 from the study came to the clinical site on the first day of the trial period (day 0). The investigator collected immunological data, such as a white blood cell concentration of 5.9 × 103 cells per μL, eosinophil percentage of 4.6%, and platelet count of 223 × 103 cells per μL (Supplementary Note 1). In our proposed schema, if the CRF were paper based, then it would be scanned in; if electronically captured, it would be directly added to the growing blockchain via the portal, as is the case in our simulation. Verification codes are appended to each CRF.

User mediated corruption

We then simulated two types of hostile conditions for the trial. The first was to simulate an effort to manipulate data that were uploaded from other trial staff. While logged in as the trial sponsor, we attempted to modify the adverse events reported in selected CRFs that recorded subjects 73,491 and 73,511 receiving the treatment drug omalizumab so as to deceptively bolster treatment approval in a potentially untrustworthy network. Subject 73,491 showed many adverse reactions during the treatment period, such as muscle strain, injection site swelling, sinus headaches, and nasal congestion among other events (Supplementary Note 1), while Subject 73,511 exhibited events such as chest tightness, injection site reactions, sinus congestion, decreased blood pressure, and a lower respiratory tract infection among other ill effects (Supplementary Note 2). As the trial sponsor, these CRFs were mutated such that no adverse events were listed (Supplementary Note 3, 4). The new tampered replacement files are appended with a version number automatically (Fig. 3b), and the corrupting party, time of modification, and changes are all easily visible. The system is capable of handling multiple versions of files in case the original one is part of a later transaction, or in case further revisions or illegitimate mutations are made. These are designated with incrementing version numbers for each new unique version. Original documents, however, are designated with no version number.

With an append only transaction scheme and version controlling, we have a means of keeping a full record of everything that happens to a file, and can easily refer to the author and old and new versions of data, similar to the concept and flexibility of GitHub. This is integral to the auditing process as regulators can track precisely what was changed, by whom, and when with the immutable timestamp. Hence, we simultaneously accommodate the user’s need for making changes clear, the regulator’s desire for monitoring data easily, and also abide by blockchain’s append only schema, which allows for the maintenance and persistence of older data.

Storage corruption

The second hostile condition we simulated was that of an intentional fault or data corruption at the storage level. In this simulation, we purposefully corrupted the treatment distribution outlining which medication plan was given to which patients (Supplementary Table 1, 2) to check if the infrastructure would detect and guard against such changes. In the blockchain ledger section of the portal, there is a validation check that successfully shows exactly where the fault and corrupted file lies (Fig. 3c). Due to the sensitivity of a hash function’s output in relation to its input, changing the data in even the smallest way in a block, such as modifying a single character in the block’s attached file, will result in a completely different hash string. This string will be fed into the input of the next block’s hash function, and the resulting string will be completely different from what it was prior to the data modification. Hence, data integrity can be checked by simply comparing the hash strings of a proposed blockchain under audit with a set of verified and correct hashes. Since each transaction is given its own block, precise determination of location and file that were corrupted are possible by simply finding the first block with an incorrect hash. Hence, storage of the desired and correct hashes is necessary, and we advocate for centralized and secured storage by the trusted regulator who will be performing the audit (see Supplementary Methods). Since only hash strings are required for the purpose of verifying integrity, the regulator need not allocate much hard disk space for the audit process. Verifying integrity can be done quickly as the regulator need only check for string equivalence, which is a quick and trivial process.

Zero-knowledge proof of purity

In this proof of concept model, we illustrate the ease at which a data repository can be checked and verified for tampering without manually reviewing each file. SHA256 hashing is a quick and highly optimized process12, and comparing two hash string for equality is trivial. Hence, verifying originality can be done quickly, and without actually opening and inspecting data, which is useful if confidential data are being audited. This serves as a zero-knowledge proof15 of data integrity because the auditor need not know the exact configurations and detailed information within a file, and yet still verify its originality. This is particularly useful for the data being handled in clinical processes. This type of methodology and proof provides another layer of security and respect for confidential data, all while quickly and automatically verifying purity of data. Furthermore, by giving the user the option of storing encrypted data on the servers, we ensure that sensitive information cannot be compromised even in the event of a data breach.

Expedited adverse event reporting

As part of the simulation in the portal, we also scrutinized adverse event reporting and how to improve it. In abidance with the goal of making the management of a clinical trial easier and more effective than the current standard, adverse events from the clinical investigator uploaded CRFs are automatically parsed and populated to the pages of the regulator and DSMB (Fig. 3d). This not only serves as a fast means of assessing safety as the trial continues, but also circumvents the slower and potentially error prone route that adverse event reporting would normally take before reaching the regulator. In the current way clinical trials are run, the CRFs would normally be sent to the CRO, who then parses out the adverse reactions and reports these events to the sponsor, who sends these reports to the DSMB. This process takes time, and is subject to modification or loss by human error or malice. In the proposed scheme and proof of concept service, this vulnerable process is circumvented, and the regulator or DSMB can be immediately notified of each subject’s adverse reactions at the soonest possible time, which can be crucial for maintaining public safety. Since the adverse events are extracted directly from the CRFs on the immutable blockchain, the regulator and DSMB can be assured that the events are legitimate.