Well after reading the Google study, I have to question the containment of the drives or the way. History for Tags: disk, failure, google, magnetic, paper, research, smart by Benjamin Schweizer (). In a white paper published in February ( ), Google presented data based on analysis of hundreds of.

Author: Mikajas Jujind
Country: Luxembourg
Language: English (Spanish)
Genre: Photos
Published (Last): 11 May 2015
Pages: 217
PDF File Size: 13.89 Mb
ePub File Size: 15.63 Mb
ISBN: 970-9-42609-603-4
Downloads: 44207
Price: Free* [*Free Regsitration Required]
Uploader: Dukazahn

Manufactures do not want you to return a drive every two months because SMART reported it, and certainly not until the warrantee runs out.

ยป Google disk reliability paper

We obtained quarterly hardware purchase records covering this time period to estimate the size of the disk population in our ARR analysis. Another aspect of the failure process that we will study is long-range dependence. The poor fit of the exponential distribution might be due to the fact that failure rates change over the lifetime of the system, creating variability in the observed times between disk replacements that the exponential distribution cannot capture.

We observe that the expected number of disk replacements in a week varies by a factor of 9, depending on whether the preceding week falls into the first or third bucket, while we would expect no variation if failures were independent.

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. The data has a of 2.

This leads us to believe that even during shorter segments of HPC1’s lifetime the time between replacements is not realistically modeled by an exponential distribution. I hate Microsoft Word with a burning, fiery passion. Ray Scott and Robin Flaus from the Pittsburgh Supercomputing Center for collecting and providing us with data and helping us to interpret the data.

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. Below we summarize the key observations of this section.


Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

Note that we only see customer visible replacement.

In order to compare the reliability of different hardware components, we need to normalize the number papere component replacements by the component’s population size. In our pursuit, we have spoken to a number of large production sites and were able to convince several of them to provide failure data from some of their systems. A natural question is therefore what the relative frequency of drive failures is, compared to that of other types of hardware failures.

To account for the missing disk replacements we obtained numbers for the periodic replenishments of on-site spare disks from the internet service provider.

In particular, customers and vendors might use different definitions. Second, datasheet MTTFs are typically determined based on accelerated stress tests, which make certain assumptions about the operating conditions under which the disks will be used e. While the datasheet AFRs are between 0.

Unfortunately, we do not have, for any of the systems, exact population counts of all hardware components. The HPC4 data set is a warranty service log of disk replacements. A Hurst parameter between 0. Will be grateful for any help!

labs google com papers disk failures pdf converter

Unfortunately, many aspects of disk failures in real systems are not well understood, probably because the owners of such systems are reluctant to release failure data or do not gather such data. Each failure record contains a repair code e. The question at this point is how and when. A series exhibits long-range dependence if the Hurst exponent, H, is. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.


Illustration of decreasing hazard rates. To answer this question we consult data sets HPC1, COM1, and COM2, xisk_failures these data sets contain records for all types of hardware replacements, not only disk replacements.

Our data set providers all believe that their disks are powered on and in use at all times. The cause was attributed to the breakdown of a lubricant leading to unacceptably high head flying heights. A chi-square test reveals that we can reject the hypothesis that the number of disk replacements per month follows a Poisson distribution at the 0. Our analysis of life cycle patterns shows that this concern is justified, since we find failure rates to vary quite significantly over even the first two to three years of the life cycle.

In our study, we focus disk_vailures the HPC1 data set, since this is the only data set that contains precise timestamps for when a problem was detected rather than just timestamps for when repair took place. Often it is hard to correctly attribute the root cause of a problem to a particular hardware component. The data records, for each of the 13, drives, when it was first shipped and disk_faolures if ever it was replaced in the field.

Failure Trends in a Large Disk Drive Population

We also thank the other people and organizations, who have provided us with data, but would like to remain unnamed. Changes in disk replacement rates during the first five years of the lifecycle were more dramatic than often assumed. While visually the exponential distribution now seems a slightly better fit, we can still reject the hypothesis of an underlying exponential distribution at a significance level of 0.

Author: admin