With the growing scale of todays IT installations, component failure is becoming an ever larger problem. Yet, virtually no data on failures in real systems is publicly available, forcing researchers working on system reliability to base their work on anecdotes and back of the envelope calculations, rather than empirical data.

The computer failure data repository (CFDR) aims at accelerating research on system reliability by filling the nearly empty collection of public data with detailed failure data from a variety of large production systems.

Please join us, either by contributing data, or downloading data.

News

You are viewing a first draft of the CFDR. For feedback and comments please contact the moderators.

Available data

The table below provides an overview over the available data sets.

Name Time period System type Type of data LANL Dec 96 - Nov 05 HPC clusters The data covers node outages at 22 cluster systems at LANL, including a total of 4,750 nodes and 24,101 processors. Usage logs and error logs are available as well. HPC1 Aug 01 - May 06 HPC cluster The data covers hardware replacements at a 765 node cluster with more than 3,000 hard drives. HPC2 Jan 04 - Jul 06 HPC cluster Hard drive replacements in a 256 node cluster with 520 drives. HPC3 Dec 05 - Nov 06 HPC cluster Hard drive replacements observed in a 1,532-node HPC cluster with more than 14,000 drives. HPC4 2004 - 2006 HPC cluster Event logs collected at 5 supercomputing systems at SNL and LLNL, ranging from 512 to 131072 processors. PNNL Nov 03 - Sep 07 HPC cluster Hardware failures recorded on the MPP2 system (a 980 node HPC cluster) at PNNL. NERSC 2001 - 2006 HPC cluster I/O specific failures collected at a number of production systems at NERSC. COM1 May 2006 Internet services cluster Hardware failures recorded by an internet service provider and drawing from multiple distributed sites. COM2 Sep 04 - Apr 06 Internet services cluster Warranty service log of hardware failures aggregating events in multiple distributed sites. COM3 Jan 05 - Dec 05 Internet services cluster Aggregate quarterly statistics of disk failures at a large external storage system. ask.com Dec 06 - Feb 07 Internet services cluster Memory error data collected on a 212 node server farm at ask.com. Cray N/A Cray systems Event logs, console logs, and syslog from Cray XT series machines running Linux. Intrepid Jan 09 - Aug 09 Blue Gene/P RAS log.

How to contribute

First of all, thank you for your interest in contributing to the CFDR.

If you already have your data public on your reference web page so that any one can download it, then all you need to do is to send us a pointer to your reference web page and a brief description of the data.

But otherwise - if you want to make the first release of your data through the CFDR - then the data contribution procedure is as follows:

We need to have a necessary paperwork on file to show that we actually have permission to host this data. You need to sign or find someone at your organization to sign our contributor's agreement. If the data contains some sensitive information like user or vendor names, you need to sanitize (anonymize) it. If you don't have proper sanitization tools, we will try to help you. Please provide any available documentation or description of the data you are contributing. If no documentation is readily available, it would be helpful to create one in the form of a FAQ with answers to frequently asked questions on the data. You can take a look at the FAQ accompanying the LANL data sets to get an idea of the kind of questions people commonly ask about failure data. Make your data accessible for us, then we will host it on the CFDR server.

Thanks!

Best Practices

Currently, data collection and analysis is complicated by the fact that there is no widely accepted format for anomaly data and there exist no guidelines on what data to collect and how. We hope that the experiences from working with a variety of sites on collecting and analyzing failure data will lead to some best practices for failure data collection. Providing such guidelines will make it easier for sites to collect data that is useful and comparable across sites.

If you would like to contribute your experiences on collecting or working with failure data please contact the moderators.

Contribution to CFDR

How can I upload and release my data or tools through CFDR?

If you already have your data or tools public on your reference web page so that any one can download them, then just let us know. We will download your data, create metadata and host them in CFDR. Otherwise - if you want to make the first release of your data through CFDR, then see the "How To Contribute" section above.

About the computer failure data repository

The Computer Failure Data Repository (CFDR) started as an initiative at CMU in 2006 and was motivated by the fact that hardly any failure data from real, large-scale production systems is available to researchers. The goal of the CFDR is to collect and make available failure data from a large variety of sites enabling researchers to gain a better understanding of the characteristics of failures in the real world.

The CFDR started to become reality when Los Alamos National Laboratory (LANL) decided to publicly release a large set of failure data collected at LANL's HPC systems. The data was collected over 9 years covering more than 23,000 outages and was the first to become publicly available as part of the CFDR.

The current moderators of this site are Garth A. Gibson and Bianca Schroeder. You can contact them by e-mail.

Contact us

We would like to hear your feedback, comments, insights and experiences! Please e-mail them to the moderators.