tation (such as work distribution, load balancing, toler-ance of failures), allowing the user to focus simply onthe algorithms that make up the heart of the computa-tion.The analysis pipeline used for this study consists of a Mapreduce job written in the Sawzall language andframework [15] to extract and clean up periodic SMARTdata and repair data related to disks, followed by a passthrough R [1] for statistical analysis and final graph gen-eration.
2.2 Deployment Details
The data in this study are collected from a large num-ber of disk drives, deployed in several types of systemsacross all of Google’s services. More than one hundredthousand disk drives were used for all the results pre-sented here. The disks are a combination of serial andparallel ATA consumer-grade hard disk drives, rangingin speed from 5400 to 7200 rpm, and in size from 80 to400 GB. All units in this study were put into productionin or after 2001. The population contains several modelsfrom many of the largest disk drive manufacturers andfrom at least nine different models. The data used forthis study were collected between December 2005 andAugust 2006.As is common in server-class deployments, the diskswere powered on, spinning, and generally in service foressentially all of their recorded life. They were deployedin rack-mounted servers and housed in professionally-managed datacenter facilities.Before being put into production, all disk drives gothrough a short burn-in process, which consists of acombination of read/write stress tests designed to catchmany of the most common assembly, configuration, orcomponent-level problems. The data shown here do notinclude the fall-out from this phase, but instead beginwhen the systems are officially commissioned for use.Therefore our data should be consistent with what a reg-ular end-user should see, since most equipment manu-facturers put their systems through similar tests beforeshipment.
2.3 Data Preparation
Definition of Failure.
Narrowly defining what consti-tutes a failure is a difficult task in such a large opera-tion. Manufacturers and end-users often see differentstatistics when computing failures since they use differ-ent definitions for it. While drive manufacturers oftenquoteyearlyfailureratesbelow2%[2], userstudieshaveseen rates as high as 6% [9]. Elerath and Shah [7] reportbetween 15-60% of drives considered to have failed atthe user site are found to have no defect by the manu-facturers upon returning the unit. Hughes
et al.
[11] ob-serve between 20-30% “no problem found” cases afteranalyzing failed drives from their study of 3477 disks.From an end-user’s perspective, a defective drive isone that misbehaves in a serious or consistent enoughmanner in the user’s specific deployment scenario thatit is no longer suitable for service. Since failures aresometimes the result of a combination of components(i.e., a particular drive with a particular controller or ca-ble, etc), it is no surprise that a good number of drivesthat fail for a given user could be still considered op-erational in a different test harness. We have observedthat phenomenon ourselves, including situations wherea drive tester consistently “green lights” a unit that in-variably fails in the field. Therefore, the most accuratedefinition we can present of a failure event for our studyis:
a drive is considered to have failed if it was replaced as part of a repairs procedure
. Note that this definitionimplicitly excludes drives that were replaced due to anupgrade.Sinceitisnotalwaysclearwhenexactlyadrivefailed,we consider the time of failure to be when the drive wasreplaced, which can sometimes be a few days after theobserved failure event. It is also important to mentionthat the parameters we use in this study were not in useas part of the repairs diagnostics procedure at the timethat these data were collected. Therefore there is no risk of false (forced) correlations between these signals andrepair outcomes.
Filtering.
With such a large number of units monitoredover a long period of time, data integrity issues invari-ably show up. Information can be lost or corrupted alongour collection pipeline. Therefore, some cleaning up of the data is necessary. In the case of missing values, theindividual values are marked as not available and thatspecific piece of data is excluded from the detailed stud-ies. Other records for that same drive are not discarded.In cases where the data are clearly spurious, the entirerecord for the drive is removed, under the assumptionthat one piece of spurious data draws into question otherfields for the same drive. Identifying spurious data, how-ever, isatrickytask. Becausepartofthegoalofstudyingthe data is to learn what the numbers mean, we must becareful not to discard too much data that might appearinvalid. So we define spurious simply as
negative countsor data values that are clearly impossible
. For exam-ple, some drives have reported temperatures that werehotter than the surface of the sun. Others have had neg-ative power cycles. These were deemed spurious andremoved. On the other hand, we have not filtered anysuspiciously large counts from the SMART signals, un-der the hypothesis that large counts, while improbable as
Leave a Comment