You are on page 1of 61

HathiTrust isa Solution

TheFoundationsofa DisasterRecovery PlanfortheShared DigitalRepository


Thisreportservesas recommendationsmadeby MichaelJ.Shallcross, 2009DigitalPreservationIntern UniversityofMichigan SchoolofInformation

ExecutiveSummary ThisreportseekstoestablishtheframeworkofaDisasterRecoveryPlanfortheHathiTrust DigitalLibrary.Whileprofessionalbestpracticesandinstitutionalneedshaveprovidedaclearmandate forHathiTrustsDisasterRecoveryProgram,commonparlancehasoftenobscuredtwoprominent featuresofsuchinitiatives.First,aDisasterRecoveryPlanisactuallycomprisedofasuiteofdocuments whichdetailarangeofissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivities totherestorationofhardwareanddata.Second,thereisnoconclusiontotheplanningprocess;itis insteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing, andmaintenance. Theprimarygoalofthepresentdocumentistoprovideafoundationonwhichfutureplanning effortsmaybuild.Tothatend,itexaminesthestrategiesbywhichHathiTrusthasanticipatedand mitigatedtherisksposedbytencommonscenarioswhichcouldprecipitateadisaster: o Hardwarefailureanddataloss o Networkconfigurationerrors o Externalattacks o Formatobsolescence o Coreutilityorbuildingfailure o Softwarefailure o Operatorerror o Physicalsecuritybreach o Mediadegradation o Manmadeaswellasnaturaldisasters. Asthislistreveals,adisasterwithinthedigitalrepositoryrefersnotmerelytodataloss,thedestruction ofequipment,ordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocausean extendedserviceoutage.Foreachscenario,thereportdiscussespossiblethreats,summarizesthe potentialseverityofrelatedevents,andthendetailssolutionsHathiTrusthasenactedthroughdirect quotationsfromtheHathiTrustWebsiteandTRACselfassessment,ServiceLevelAgreements,and literaturefromserviceprovidersandvendors.Attachedappendicesproviderelevantinformationand includecontactsforimportantHathiTrustresources,anannotatedguidetoDisasterRecoveryPlanning references,andanoverviewofkeystepsintheDisasterRecoveryPlanningprocess. TheconcludingsectionofthereportprovidesrecommendationsandactionitemsforHathiTrust asitproceedswithitsDisasterRecoveryInitiative.ThesearedividedintoShort(06mos.),Intermediate (612mos.)andLongTerm(12+mos.)objectivesandarearrangedinasuggestedorderof accomplishment. o Shorttermgoalsinclude: DescribingthenatureandextentofHathiTrustsinsurancecoverage Testingandvalidationofcurrenttapebackupprocedures Improvedphysicalandintellectualcontroloversystemhardware Establishment,distribution,andmaintenanceofphonetrees Increaseddocumentationofinstitutionalknowledge IdentificationofDisasterRecoverymeasuresinplaceattheIndianapolissite. o Intermediatetermobjectivesfocuson: CreationofaDisasterRecoveryPlanningCommittee

ii

Initiationofthedatacollectionandanalysisessentialtothecreationofrecovery strategies(Thissectionprovidesahighlevelbreakdownofvarioustasksand includesthecoordinationofactivitiesbetweentheAnnArborandIndianapolis sitesaswellaswithserviceprovidersandvendors.) o Longtermactionitemsdealwith: CompletionandimplementationofthesuiteofDisasterRecoverydocuments Initiationofstafftrainingandtestsoforganizationalcompliance. Storageofanadditionalcopyofbackuptapesataremotethirdlocation InvestigationofanalternatehotsiteinAnnArborintheeventadisaster renderstheMACCunusable Considerationofathirdinstanceoftherepository Avoidanceofvendorlockinifakeysuppliershouldgooutofbusiness. Thisreportdemonstratesthatvariousriskmanagementstrategies,designelements,operating procedures,andsupportcontractshaveendowedHathiTrustwiththeabilitytopreserveitsdigital contentandcontinueessentialrepositoryfunctionsintheeventofadisaster.Theestablishmentofthe Indianapolismirrorsite,theperformanceofnightlytapebackupstoaremotelocation,andthe redundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwill enableHathiTrusttoweatherawiderangeofforeseeableevents.Unfortunately,disastersoftenresult fromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsof aDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensure that,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedservice provider.

iii

Acknowledgements TheauthorwouldliketothankShannonZacharyforherencouragementandguidance;Cory SnavelyandJeremyYorkfortheirgenerousexpenditureoftime,energy,andknowledge;andNancy McGovernandLanceStuchellforaccesstotheiroutstandingDisasterRecoveryPlanningresources.The followingindividualshavealsobeeninvaluablesourcesofadvice,support,andinformation:JohnWilkin, BobCampe,CyndiMesa,AnnThomas,JohnWeise,LarryWentzel,LaraUngerSyrigos,BillHall,Emily Campbell,SebastienKorner,JessicaFeeman,PhilFarber,ChrisPowell,CameronHanover,Stephen Hipkiss,TimPrettyman,ReneGobeyn,andKrystalHall.ThanksalsotoDr.ElizabethYakel,MagiaKrause, andVeronicaandCoraFambrough.TheworkinthisreportwasmadepossiblebyanIMLSGrant.

iv

TableofContents ExecutiveSummary Acknowledgements Introduction


o o o o


p.1 p.2 p.2 p.3

p.ii p.iv p.1

GoalsforHathiTrustsDisasterRecoveryProgram TheMandateforDisasterRecoveryPlanninginDigitalPreservation DisasterPreparednessintheDesignandOperationofHathiTrust EssentialHathiTrustBusinessFunctions

HathiTrustsDisasterRecoveryStrategies
o o o

p.5

BasicRequirementsforDisasterRecovery p.5 DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSitesp.5 DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups p.6

Scenario1:HardwareFailureorObsolescenceandDataLoss
o o o o o o o o o

p.8 p.8 p.9 p.10 p.12 p.13 p.13 p.13 p.14

p.8

Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss HathiTrustsSolutionsforHardwareFailureandDataLoss RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructure KeyFeaturesofHathiTrustsIsilonIQClusteredStorage HardwareSupportandService EquipmentTracking HardwareReplacementSchedule TimelineforEmergencyReplacementofHathiTrustInfrastructure HathiTrustandInsuranceCoverageattheUniversityofMichigan

Scenario2:NetworkConfigurationErrors
o o o o o o

p.15 p.15 p.15 p.16 p.16 p.16

p.15

Review:RisksInvolvingNetworkConfigurationErrors HathiTrustsSolutionsforNetworkConfigurationErrors ExtentofITComSupport ITComResponsibilities ITComServicesinResponsetoOutagesorDegradationImpactingtheNetwork HathiTrustResponsibilities

Scenario3:NetworkSecurityandExternalAttacks
o o

p.17 p.17

p.17

Review:RisksInvolvingNetworkSecurityandExternalAttacks HathiTrustsSolutionsforNetworkSecurity

Scenario4:FormatObsolescence
o o o o

p.18 p.18 p.18 p.19

p.18

Review:RisksInvolvingFormatObsolescence HathiTrustsSolutionsforFormatObsolescence SelectionofFileFormats FormatMigrationPoliciesandActivities

Scenario5:CoreUtilityand/orBuildingFailure
o o o o o

p.20 p.20 p.20 p.20 p.22

p.20

Review:RisksInvolvingCoreUtilityorBuildingFailure HathiTrustsSolutionsforUtilityorBuildingFailure GeneralMaintenanceandRepairsinUniversityofMichiganFacilities TheMichiganAcademicComputingCenter(MACC) ArborLakesDataFacility(ALDF)

Scenario6:SoftwareFailureorObsolescence
o o

p.23 p.23

p.23

Review:RisksInvolvingSoftwareFailureorObsolescence HathiTrustsSolutionsforSoftwareIssues

Scenario7:OperatorError
o o o o o o

p.24 p.24 p.24 p.24 p.24 p.24

p.24

Review:RisksInvolvingOperatorError HathiTrustsSolutionsforOperatorError Ingest ArchivalStorage Dissemination DataManagement

Scenario8:PhysicalSecurityBreach
o o o o

p.25 p.25 p.25 p.26

p.25

Review:RisksInvolvingaPhysicalSecurityBreach HathiTrustsSolutionsforPhysicalSecurity SecurityattheMACC SecurityattheALDF

Scenario9:NaturalorManmadeDisaster
o o o

p.27 p.27 p.28

p.27

Review:RisksInvolvingaNaturalorManmadeDisaster HathiTrustsSolutionsforNaturalorManmadeCatastrophicEvents BasicDisasterRecoveryStrategies

Scenario10:MediaFailureorObsolescence
o o o

p.29 p.29 p.29

p.29

Review:RisksInvolvingMediaFailureorObsolescence HathiTrustsSolutionsforMediaFailure RemainingVulnerabilities

ConclusionsandActionItems
o o o o Conclusions ShortTermActionItems IntermediateTermActionItems LongTermActionItems

p.30 p.30 p.31 p.32

p.30

APPENDIXA:ContactInformationforImportantHathiTrustResources p.34 APPENDIXB:HathiTrustOutagesfromMarch2008throughApril2009 p.37 APPENDIXC:WashtenawCountyHazardRankingList p.38 APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences p.39 APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess p.45 APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008) p.52 APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardService Agreement(2006) p.53 APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009) p.54 APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006) p.55 **AppendicesFIareembeddedPDFfiles.**

vi

Introduction Intherealmofprintlibraries,adisasterisafairlyunambiguousevent:itisafire,abrokenpipe, aninfestationofpestsinshort,anythingwhichthreatensthecontinueduseandexistenceoftextsor theenvironmentinwhichtheyarestored.Thisbasicdefinitionmayalsobeappliedtothedigitallibrary, inwhichadisasterrefersnotmerelytothelossofcontentorcorruptionofdata,thedestructionof equipmentordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocausean extendedserviceoutage.Thislastpartprovestobethegreatestdifferencebetweentheprintand digitalworldsbecausethereareagreatmanythreatswhichcanleavedataintactbutincapacitatethe primaryfunctionsofadigitallibrary.ThedailyoperationofaninstitutionsuchasHathiTrustinvolvesthe anticipationandresolutionofavarietyofproblemscrashedservers,softwarebugs,networkingerrors, etc.whichonlyrisetothelevelofadisasterwhentheyexceedthecapacityofnormaloperating proceduresand/orthemaximumallowableoutageperiods.DisasterRecoveryPlanningthuspromptsus todeveloprobuststrategiestomitigateandlimittheeffectsofcommonproblemsandatthesametime forcesustothinktheunthinkable.Nevertheless,confrontingworstcasescenariosisavitalactivity;the beliefthataneventwillneverhappensimplybecauseithasneverhappenedisaninvitationtothevery disasterweseektoavoid.Hereinliesaconundrum,inthatthecreationofdetailedplansforevery eventualityisnearlyimpossibleandalsoimpractical,sincetheresultsofsuchanendeavorwouldbe needlesslycomplexaswellasexpensive.Atitsbasis,then,DisasterRecoveryPlanningdemandsan astuteassessmentofrisksothatwemayweighthecostsofpreparationsandsolutionsagainstthecosts ofapotentialevent. Sowheretobegin?WhenthesubjectofDisasterRecoveryPlanningarises,commonparlance oftenobscurestwoprominentfeaturesofsuchinitiatives.First,aDisasterRecoveryPlanisactually comprisedofasuiteofdocumentswhichdetailavarietyofrelatedissues,fromcrisiscommunications andthecontinuityofadministrativeactivitiestotherecoveryofhardwareanddataandtherestoration ofcorefunctions.Second,thereisnoconclusiontotheplanningprocessorapointatwhichaplanis done;thereisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation, training,testing,andmaintenance.Theessentialfirststepisthereforeathoroughknowledgeofthe organization,itsgoals,anditsmandateforaDisasterRecoveryProgramsothatlatereffortscanfocus onthearticulationofpoliciesandthedevelopmentofsolutions.Asapreliminarystepinthiseffort,this reportlookstoestablishabasicfoundationfromwhichfutureplanningeffortsmaygrow. GoalsforHathiTrustsDisasterRecoveryProgram WhileamoreformalstatementofHathiTrustsgoalsandrequirementsforitsDisasterRecovery Programmustbeelucidated,therepositorysmissionstatementprovidesagoodindicationofitsmain objectiveintheformationofaDisasterRecoveryPlan.Aspartofitsaimtocontributetothecommon goodbycollecting,organizing,preserving,communicating,andsharingtherecordofhuman knowledge,HathiTrustseekstohelppreservetheseimportanthumanrecordsbycreatingreliableand accessibleelectronicrepresentations.1Thisstatementclearlyjoinsthetwinimperativesofpreservation andaccesswithanadditionalrequirement:reliability.Thedevelopmentandimplementationofa DisasterRecoveryPlanwillensurethatdigitalobjectswillretaintheirauthenticityandintegrityoverthe longtermandthatpartnerlibrariesanddesignatedusersmayrelyonHathiTrustservices(ortheirtimely resumption)andcontentinthefaceofcatastrophicevents.
1

HathiTrust.Mission&Goals(2009)retrievedfromhttp://www.hathitrust.org/mission_goalson8July2009.

20090824

TheMandateforDisasterRecoveryPlanninginDigitalPreservation HathiTrustsmandateforacomprehensiveandproactiveDisasterRecoveryPlanstemsfroma numberofsignificantsources,amongwhichwemayincludeitsmissionandgoals.TheInstitutional DataResourceManagementPolicy(2008)oftheUniversityofMichigansStandardPracticeGuidealso providesanimpetusforthecreationofaDisasterRecoveryProgram.Whilenotnecessarilyinclusiveof theMichiganDigitizationProjectmaterialsstoredinHathiTrust,thisdocumentunderscoreshow importantitisthatdataresourcesbesafeguarded[and]protectedandcontingencyplans[]be developedandimplemented.2Initsdiscussionofthelatterpoint,thepolicyspecifiesthat: DisasterRecovery/BusinessContinuityplansandothermethodsofrespondingtoanemergency orotheroccurrencesofdamagetosystemscontaininginstitutionaldata[]willbedeveloped, implemented,andmaintained.Thesecontingencyplansshallinclude,butarenotlimitedto, databackup,DisasterRecovery,andemergencymodeoperationsprocedures.Theseplanswill alsoaddresstestingofandrevisiontodisasterrecovery/businesscontinuityproceduresanda criticalityanalysis.3 Whiledatabackupproceduresandahostofriskmanagementpracticesarealreadyanintegralpartof HathiTrustsoperation,therepositorynowlookstoformalizetheotherstrategiessuggestedbythe InstitutionalDataManagementPolicy.Beyondtheexamplelaidoutbythisdocument,HathiTrusts mandateforDisasterRecoveryderivesfromtheprofessionalliteraturedetailingbestpracticesinthe fieldofdigitalpreservation.TheReferenceModelforanOpenArchivalReferenceSystemidentifies DisasterRecoveryasanessentialcomponentofitsArchivalStoragefunctionandhighlightsthe importanceofsuchplansinachievingthegoaloflongtermpreservationofadigitalarchivesholding.As outlinedintheOAISdocument,theDisasterRecoveryfunctionprovidesamechanismforduplicating thedigitalcontentsofthearchivecollectionandstoringtheduplicateinaphysicallyseparatefacility.4 HathiTrusthassuccessfullymetthisrequirementbyperformingnightlytapebackupsandestablishinga mirrorsiteatIndianaUniversityinIndianapolis.TheTrustedRepositoriesAudit&Checklist:Criteriaand Checklist(2007)isevenmoreexplicitinitsrequirementthatrepositoriesdocumenttheirpoliciesand procedureswithsuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff sitebackupofallpreservedinformationtogetherwithanoffsitecopyoftherecoveryplan(s).5 Professionalbestpracticesaswellasinternalneedsandgoalsthusprovidethemandatewhichunderlies HathiTrustsdevelopmentofaformalDisasterRecoveryPlan. DisasterPreparednessintheDesignandOperationofHathiTrust OneoftheprimarygoalsofHathiTrustistoprovidetransparencyinallofitsoperations, includingitsworktocomplywithdigitalpreservationstandardsandreviewprocesses.6Nowhereisthis commitmentmoreclearthaninitseffortstoanticipateandmitigateriskswhichcouldthreatenthe
2

UniversityofMichigan.InstitutionalDataResourceManagementPolicy(2008)StandardPracticeGuide, retrievedfromhttp://spg.umich.edu/on8July2009. 3 Ibid. 4 ConsultativeCommitteeforSpaceDataSystems.ReferenceModelforanOpenArchivalInformationSystem (2002)p.48. 5 OCLCandCRL.SectionC3.4TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49. 6 HathiTrust.Accountability(2009)retrievedfromhttp://www.hathitrust.org/accountabilityon25June2009.

20090824

contentsandfunctionsoftheSharedDigitalRepository.Asafirststepinaddressingthedisaster preparednessrequirementinsectionC3.4oftheTRACCriteriaandChecklist,7thisdocumentservestwo purposes.First,itprovidesanoverviewofthepolicies,procedures,resourcesandcontractsthatenable HathiTrusttoaddressthechallengesandthreatsendemictothefieldofdigitalpreservation.Materialis thereforeciteddirectlyfromtheHathiTrustWebsite(http://www.hathitrust.org),themostrecent versionofHathiTrustsreviewofitscompliancewiththeminimumrequiredelementsoftheTRAC CriteriaandChecklist,8andrelevantliteratureprovidedbykeyvendorsandserviceproviders.9Second, thisreportexaminesHathiTrustscurrentlevelofdisasterpreparednessanddefinescurrentand forthcomingeffortsinitsdevelopmentofadynamicandproactiveDisasterRecoveryProgram.Perthe recommendationsoftheTRACCriteriaandChecklist,thisdocumentrecordsthemeasuresand precautionsalreadyinplaceinregardstospecifictypesofdisastersthatcouldbefallHathiTrust.These eventsincludehardwarefailure,dataloss,networkconfigurationerrors,externalattacks,coreutility failure,formatobsolescence,softwarefailure,physicalsecuritybreach,andmanmadeaswellasnatural disasters.Whileaformal,writtenplandetailingindividualrolesandresponsibilitiesintherepositorys responsetoeachofthesescenariosisstillforthcoming,theevidencegatheredinthisreportrevealsthat crucialelementsofaDisasterRecoveryPlanarealreadyinplacewithinHathiTrust.10 EssentialHathiTrustBusinessFunctions AsthedevelopmentoftheDisasterRecoveryPlanproceeds,itisimportanttobearinmindthat itsgoalisnotmerelytherestorationofhardwareanddatabutalsotherecoveryandcontinuityof essentialrepositoryfunctions.Thefollowinglistrepresentscorefunctionsthatneedtobeaddressedby HathiTrustsDisasterRecoveryPlanandassuchshouldnotbeconsideredacomprehensive representationoftherepositorysfunctions.Bydirectingplanningeffortstowardspecificfunctions (ratherthantheorganizationsactivitiesasawhole),HathiTrustmayprioritizeandfocusitsrecovery responsesandresourcestoensurethatthemostessentialfunctionsgobackonlinefirst.Subsequent discussionofDisasterRecoverystrategiesandriskmanagementsolutionsinthisreportarepresented undertheassumptionthatthecontinuityofthesefunctionsisaprimaryobjective.Theprioritizationof thesefunctionsremainstobedeterminedbyanappropriateauthority.11
7

Repositoryhassuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoffsitebackup ofallpreservedinformationtogetherwithanoffsitecopyoftherecoveryplan(s).Therepositorymusthavea writtenplanwithsomeapprovalprocessforwhathappensinspecifictypesofdisaster(fire,flood,system compromise,etc.)andforwhohasresponsibilityforactions.Thelevelofdetailinadisasterplanandthespecific risksaddressedneedtobeappropriatetotherepositoryslocationandserviceexpectations.Fireisanalmost universalconcern,butearthquakesmaynotrequirespecificplanningatalllocations.Thedisasterplanmust, however,dealwithunspecifiedsituationsthatwouldhavespecificconsequences,suchaslackofaccesstoa building.OCLCandCRL.TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49. 8 HathitrustDigitalLibraryReviewofCompliancewithTrustworthyRepositoriesAudit&Certification:Criteriaand ChecklistMinimumRequiredElements,revisedMay20,2009.Availableat http://hathitrust.org/documents/trac.pdf 9 ContactinformationforrelevantUniversityofMichigandepartmentsandserviceprovidersaswellasforexternal vendorsmaybefoundinAppendixA. 10 AlistofresourcesrelatedtodisasterrecoveryandtheplanningprocessmaybefoundinAppendixD(Annotated ListofDisasterRecoveryPlanningResources). 11 ThislistofessentialHathiTrustbusinessfunctionswasdevelopedinconjunctionwithJeremyYork.

20090824

Ingest Ingestdigitalobjects(SIPs)viaGRINtheGoogleReturnInterface(ora modifiedingestportalforlocalcontent) ValidateingestedcontentwithGROOVEtheGoogleReturnObjectOriented ValidationEnvironment(oramodifiedversionforlocalizedingest) ArchivalStorage Preserveindefinitelydigitalobjectsandmetadata(AIPs)intheSharedDigital Repository(includesensuringtheintegrityandauthenticityofmaterials).This functionaddressestheneedsofpartnerlibrariesaswellasindividualusers. Recordchangestoandactionsonitemswhiletheyareintherepository Maintainapersistentobjectaddressforitemswithinrepository Dissemination Provideaccesstodigitalobjectsforusers Allowforthetextsearchesthroughavarietyoffields Enablelargescalefulltextsearches Permitthecreationofpublicandprivatecontentcollections Disseminatedigitalobjects(DIPs)tousers(viathepageturneraccesssystem anddataAPI) DistributedatasetsandHathiTrustAPIstodevelopers ResearchanddevelopadditionalapplicationsandresourcesforHathiTrust Administration Providetransparentanduptodateinformationtousersandthegeneralpublic viahttp://www.hathitrust.org/ Communicateinformationandcoordinateactivitiesamongstpartnerlibraries andHathiTrustboardsandcommittees. DataManagement UpdateandmanagetheRightsandGeoIPdatabases BuildandmaintainCollectionBuilderandLargeScaleSearchSolrindexes Determineappropriateuseraccesstotextsviadatabasequeries SynccontentwiththeIndianapolissiteandbackupcontenttotape

20090824

HathiTrustsDisasterRecoveryStrategies BasicRequirementsforDisasterRecovery RoyTennanthasidentifiedthreerequisitecomponentsofadigitalDisasterRecoveryPlan:(1) theuseofaneffectivedataprotectionsystem(i.e.RAID),(2)redundantpowerandenvironmental systems,and(3)regularbackupofinformationtotapeand,ideally,toaremotemirroredsite.12 HathiTrusthasincorporatedalltheseelementsintoitsdesignandoperation.ItsIsilonIQstoragecluster providesahighdegreeofdataredundancywithitsN+3parityprotection;theMichiganAcademic ComputingCenterprovidesfullyredundantpowerandenvironmentalsystemsforHathiTrust infrastructure;andnightlytapebackupsandthereplicationofdatatoafullyoperationalmirrorsite locatedatIndianaUniversityinIndianapoliswiththesamelevelsofpowerandenvironmental conditioningprovidemultiplecopiesaswellasgeographicdistributionofcontent. o HathiTrustisintendedtoprovidepersistentandhighavailabilitystoragefordeposited files.Inordertofacilitatethis,theinitiativestechnologyconcentratesoncreatinga minimumoftwosynchronizedversionsofhighavailabilityclusteredstoragewithwide geographicseparation(thefirsttwoinstancesofstoragewillbelocatedinAnnArbor, MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredina separateAnnArborfacility). Eachofthesestorageortapeinstancesisphysicallysecure(e.g.,inalockedcageina machineroom)andonlyaccessibletospecifiedpersonnel.Eachseparatestorage systemisalsoequippedwithmechanismstoprovidemirroredmanagementandaccess functionality,andemploy100%dataredundancyinanefforttopreventdataloss.13 DetailsonparityprotectionandtheHathiTrustserverenvironmentareavailablebelow(seeScenario1 andScenario5,respectively). DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSites HathiTrust'sfirstlineofdefenseintheeventofadisasterisitshotmirrorsiteinIndianapolis. WhileingestofmaterialisrestrictedtotheAnnArborlocation,bothsitespossesstwowebservers,a MYSQLdatabaseserver,andanIsilonIQstoragecluster(currentlycomposedof21nodes,servers composedofCentralProcessingUnitsaswellasstorage).Duringnormaloperations,thisarrangement allowsHathiTrusttobalanceahighvolumeofwebtrafficacrossbothsitessuchthatindividualuser requestsmaybehandledbyeithersiteinatransparentmanner.Shouldthetolerancesforfailurebe exceededatasite(asinadisastersituation)thefailovercapabilitybuitintotheHathiTrustarchitecture enablestheremainingsitetoprovideaccesstothedesignatedcommunitywithoutnoticeableservice disruptions.AsnotedintheMay2009HathiTrustUpdate,withthefulloperationofbothlocations,We arenowensuringthatusersdonotfeeltheeffectsofsinglesiteoutages,suchasroutinemaintenance,

12

Tennant,Roy.DigitalLibraries:CopingwithDisasters.LibraryJournal,15November2009.Retrievedfrom http://www.libraryjournal.com/article/CA180529.htmlon13July2009. 13 HathiTrust.Technologyretrievedfromhttp://www.hathitrust.org/technologyon15June2009.

20090824

bytakingadvantageofsiteredundancy.14However,becauseingesttakesplaceonlyinAnnArbor,the lossofkeycomponentstherewouldinhibittherepositorysabilitytoacquirenewcontent. HathiTrustutilizesIsilonSystemsSyncIQApplicationSoftwaretosynchronizedataatthe IndianapolissitewithnewlyingestedorupdatedmaterialfromtheAnnArborsite.Thesyncto Indianapolisrunson24separatesubsetsofthedataandeachonerunsevery2hours,withthe exceptionofSundays.Inotherwords,subset1runsatmidnightonMonday,subset2runsat2a.m.,and soon.ThemaximumtimefordatatobereplicatedfromAnnArbortoIndianapoliswouldthereforebe threedaysplustheruntimeofthesyncprocess(whichtendstotakelessthanthreehours.)15 o SyncIQisanasynchronousreplicationapplicationthatfullyleveragestheunique architectureofIsilonIQstoragetoefficientlycopydatafromaprimaryclustertoone locatedatasecondarylocation.16 o Allnodes[inboththesourceandtargetIsilonIQclusters]concurrentlysendand receivedataduringreplicationjobsinrealtime,withoutimpactingusersreadingand writingtothesystem.17 o Arobustwizarddrivenwebbasedinterfaceisfullyintegratedinto[Isilons proprietary]OneFSmanagementtooltocontrolallthefunctionality,including scheduling,policysettings,monitoringandloggingofdatatransferredandbandwidth utilization.18 o Onlyfilesthathavechangedwillbereplicatedtothetargetclusters.Thiswilloptimize transfertimesandminimizebandwidthused.19 o Intheeventthesecondarysystemisnotavailableduetoasystemornetwork interruption,thereplicationjobwillbeabletorollbackandrestartatthelastsuccessful copyoperation.20 o Uponacriticalfailureorlossofnetworkconnection,analertwillbesenttoall recipientsconfiguredtoreceivecriticalalerts.21 DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups HathiTrustsabilitytorecoverfromadisasterisalsoensuredbythenightlyautomatedtape backupsperformedbytheTivoliStorageManager(TSM)clientapplicationinstalledontheingestservers connectedtotheHathiTruststorageclusterandmanagedbyMichigansITCSTSMGroup.TheTSM BackupServiceStandardServiceLevelAgreement22outlinestheobligationsandresponsibilitiesofboth theserviceproviderandHathiTrust:
14

HathiTrust.UpdateonMay2009Activities(2009)retrievedfrom http://www.hathitrust.org/updates_may2009on2July2009. 15 Snavely,Cory(Head,UMLibraryITCoreServices).Personalemailon13July2009. 16 BackupandRecoveryWithIsilonIQClusteredStorage,2007p.11 17 Ibid. 18 Ibid. 19 Ibid. 20 Ibid. 21 Ibid 22 PleaserefertoAppendixF(TSMBackupServiceStandardServiceLevelAgreement).

20090824

TheprogressiveincrementalmethodologyusedbyTivoliStorageManageronlybacks upneworchangedversionsoffiles,therebygreatlyreducingdataredundancy,network bandwidthandstoragepoolconsumptionascomparedtotraditionalmethodologies basedonperiodicfullbackups.23 o ITCSisresponsibleforallofthecentralserverhardware,tapehardware,networking hardware,andrelatedcomponents.ITCSisalsoresponsibleforhardwaremaintenance aswellassoftwaremaintenance,administration,andsecurityauditsonthecentral (nonclient)TSMservers.(TSMBackupServiceSLA,sec.4.1) o ITCSprovides7x24oncallmonitoringandsupport,andstrivestokeeptheserversup inproductionatalltimes.Thetargetuptimeis99.9%ofthetime.TheTSMhardware designismodularandshouldallowustotakepiecesoutofservicewithoutaffecting customers.Wheneverpossible,systemmaintenancewillbeperformedduringstandard weekendmaintenancewindowsasdefinedbyITCS.(sec.4.2) o Inanemergency,customerscancontacttsm@beepage.itd.umich.edu(thiswillgoto theoncallstaffspagerinrealtime).(sec.4.6) o ITCSisresponsibleforphysicalsecurity.Machineaccessaudits,OSsecurity,and networksecurityontheTSMserverendarealsotheresponsibilityofITCS.(sec.4.9) o Theservice[]includesdatacompression,dataencryptions,anddatareplication. (sec.1.0) o ITCSwillmaintainatleasttwoTSMsitesandwillmirrordatabetweenthesitesto provideredundancyintheeventofadisaster.CurrentlythosesitesaretheArborLakes DataFacility(ALDF)at4251PlymouthRd.andtheMichiganAcademicComputingCenter (MACC)locatedat1000OakbrookDr.(sec.4.10) o Bothfacilitiesaresecure,climatecontrolledsitesdesignedandbuiltforhighavailable productionservices.24 o Intheeventofacustomerdisasterwithlargescale(afullserverormore)dataloss, ITCSwillworkwiththecustomertooptimizetherestoretimetobestofourability.We willonlybeabletodevoteresourcestotheextentthatothercustomersarenot affected.Restoringlargefileservers(multipleTerabytes)cantakeseveraldays.If customerswanttominimizethisamountoftimetorestore,wecanpurchaseadditional resourcesforthispurpose.Contactusdirectly,andwellworkoutascenariowith costinginformation.IntheeventofaMAJORcampusoutageaffectingalargenumberof customers,ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritize customerrestores.(sec.4.11) o DisasterRecoveryplanningistheresponsibilityofthecustomerunit.(sec.5.8) HavingestablishedthemainDisasterRecoverystrategiesemployedbyHathiTrust,wemaynowproceed toinvestigatethemeansbywhichitanticipatesandmitigatesthemostcommonthreatsfacingdigital repositories. o
23

IBM.IBMTivoliStorageManager:FeaturesandBenefits(2009)retrievedfromhttp://www 01.ibm.com/software/tivoli/products/storagemgr/features.html?S_CMP=rnavon16June2009. 24 InformationTechnologyCentralServicesattheUniversityofMichigan.FrequentlyAskedQuestionsaboutthe TSMBackupService(2009)retrievedfromhttp://www.itcs.umich.edu/tsm/questions.phpon16June2009.

20090824

Scenario1:HardwareFailureorObsolescenceandDataLoss Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss Thefollowingtablehighlightsthevariouseventswhichposearisktothehardwareanddataof HathiTrust.Thesethreatsmaystemfromflawsormalfunctionsintheequipmentitselforasaresultof externaleventsthatincludephysicalsecuritybreachesandnaturalormanmadedisasters.The arrangementofthesepotentialrisksreflectstherelativeseverityoftheirrespectiveconsequences.


Severity Highimpact Event Lossatasinglepointoffailure Anadditionalfailurepasttoleranceswhenonlyonesiteisoperational Serviceisunavailableandcannotberestoreduntilcomponentisrepaired/restored Failureofacomponentpastredundancytolerance Systemnolongerhasredundancy:additionallossorfailureofcomponentswill resultinlossofsystem.Thisisaparticularproblemifonesiteisalreadydown. Lossofdbserver(homeofRightsdb)orofbothWebserversatasitewillrender thatlocationinaccessible LossoffourdrivesornodesineitherIsilonstorageclusterwillresultinthelossof thatinstance.Theclusterwillbeofflineandunabletohandlereadorwrite requests;alltrafficwouldhavetobehandledbytheremainingsite. LossofUMArborLakessitewouldpreventperformanceoftapebackups. LossofUMMACCsitewoulddepriveIUsiteofdataredundancy Lossofingestserverswouldpreventnewcontentfromenteringrepository Failureofredundantsystemcomponents Includesredundantcomponentswithineachsiteaswellasgeneralredundancy betweentheIUandUMsites o HTinfrastructurehasbeendesignedtoavoidsinglepointsoffailureandto ensuredataandequipmentredundancy o Servicecontinuesinanuninterruptedandtransparentmanner

ModerateImpact

LowImpact

HathiTrustsSolutionsforHardwareFailureandDataLoss ThethreatsfacedbyHathiTrustshardware(andassociatedapplicationsaswellasthedata storedtherein)arecomprisedofthefailureofredundantfeatures,failurethatexceedscomponents toleranceforredundancy,andsinglepointsoffailure.Whilethefailureofredundantcomponentsmay happenmorefrequently(i.e.,thelossofanindividualdrivewithintheIsilonIQcluster),suchlossesdo nothavealargeimpactontherepository;eventswhichcompromisesinglepointsoffailurewillhave muchgreaterconsequencesforthecontinuityofHathiTrustoperations.Atthesametime,whilea componentmayhaveredundancyononelevel(forexample,therearefiveserversdedicatedtoingest), thatcomponentsimultaneouslymaybeconsideredatahigherleveltobeasinglepointoffailure(i.e., becausetheingestserversarehousedinasinglechassis,theentireunitisvulnerabletoaneventsuch asafire).Thisdualityhighlightstheneedforvigilanceandforesightinmanagingtherepositorys infrastructure. BecauseHathiTrustreliesheavilyuponhardwaretofulfillitsmissionanddeliverservicestoits designatedcommunityofusers,theselectionofequipmentanddevelopmentofsystemarchitecture 20090824 8

hasaimedatminimizingthedangersposedbysinglepointsoffailurethroughtheintroductionof strategicredundancies.Thebasicmeansforavoidingthedisastrouseffectsofhardwarefailureordata losshavebeentheestablishmentoftheIndianapolismirrorsiteandthenightlybackupofcontentto tape.(Formoredetail,pleaserefertotheprecedingsection).Whilethesestrategiesaccountfor extraordinaryevents,HathiTrustsserverreplacementscheduleallowstherepositorytoanticipatethe resultsofnormalequipmentuseanddepreciation.Stepstosafeguardthelongtermfunctionalityof HathiTrusthavethereforebeencomplementedbyaconsiderationofbestpracticesfordisaster preparedness. RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructure ThefollowingsectionsprovideageneraloutlineofHathiTrustsredundantcomponentsand singlepointsoffailure.Giventhecomplexityoftherepositorysinfrastructure,unknownor unanticipatedscenariosmayexist;futureDisasterRecoveryPlanningwillthusinvolveaperiodicreview ofkeyfeaturesandvulnerabilities. o SiteRedundancy:TheestablishmentofthemirrorsiteinIndianaprovidesHathiTrust withafullyredundantoperation.Becausebothinstancesprovidefullaccesstocontent inadditiontootherrepositoryfunctions,userswillnotexperiencealossordegradation ofserviceintheeventthatserviceislostfromonesite.KeyexceptionstoHathiTrusts siteredundancyarenotedbelow. o RedundantComponentsatEachSite:Thefollowingcomponentsprovideeachsitewitha toleranceunderwhichlimitedfailureswillnotdisruptmajorHathiTrustfunctionsand userservices. Webservers:eachsitehastwoserverssothatifonefails,theothermay continuetohandletraffic.ThesealsohosttheGeoIPdatabase. IsilonIQclusters:thecurrentconfigurationof21nodesfeaturesN+3parity protection;thisdataredundancypermitsthesimultaneousfailureof3driveson separatenodesorthelossofthreeentirenodeswithoutservicedegradation. Ingestservers:theAnnArborsitepossessesfiveserverssothatingestmay continue(albeitataslowerrate)intheeventofanyfailures. LargeScaleSearch(LSS)Solrindex:currentlyhousedonthewebservers,butwill soonbemaintainedonfivenewserversinAnnArbor. o SinglePointsofFailure:25Thesearecomponentsofasystemwhich,iflost,willprevent theentiresystemfromfunctioning.Eventhosecomponentswithwhollyredundantpeer devices(suchastheweboringestservers)maybeconsideredsinglepointsoffailureif theyhaveexceededtheircapacitytosustainlosses(i.e.,ifonewebserveratasitehas alreadybeenlost). SinglePointsofFailureattheComponentLevel:Becauseonlyoneofthese componentsexistsateachHathiTrustsite,alosswillresultinsystemfailure. MYSQLdatabaseserver:housestherightsdatabase,ingesttracking database,andtheCollectionBuilderSolrindex Servernetworkswitches Outboundnetworkswitches SinglePointsofFailureattheSystemLevel:Whileanygivencomponentmay havevariousdegreesofinternalredundancy(suchasmultiplepowersuppliesor
25

ContentinthissectioniscourtesyofCorySnavely(personalemailfrom13July2009).

20090824

multipledrives)itmightstillfailasawholeandthusresultinthelossofa particularinstanceofHathiTrust.Thefollowingarecomponentslocatedateach sitewhich,whilepossessedofinternalredundancies,arestillsubjectto completeloss(asintheeventofafire)andmaythusrenderasiteinoperable. IsilonIQstoragecluster:theentireclustercouldbelostinalargescale event.Additionally,thelossofafourthdriveornodewillexceedthe clustersfailuretoleranceandresultinaservicedisruption. Webservers:shouldonefail,theremainingserverwillbeasinglepoint offailure. Bladeserverchassis:sinceweb,ingest,anddatabaseserversarehoused inonechassis,theentireunitcouldpotentiallyfail. LSSindex:inthenearfuture,theserversinAnnArborwillbethesole instanceoftheLargeScaleSearchindex. MirlyndatabaseandMirlyn2Solrindex26:thesearecurrentlykey componentsoftheUMLibraryinfrastructure;shouldthesebe unavailable,accesstoanduseofHathiTrustwillbecompromised. KeyFeaturesofHathiTrustsIsilonIQClusteredStorage TheIsilonIQstorageclusterstoresandprovidesdigitalobjectsforHathiTrustspartnerlibraries andmembersofitsdesignatedcommunity.Theclusterprovidesahighdegreeofinherentredundancy, whichgivesbothHathiTrustsitesaconsiderabledegreeoftoleranceinregardstothefailureofvarious aspectsofthestorageunits.Asoneexample,IsilonsproprietaryOneFSoperatingsystempermitsthe individualstoragenodestheindividualserversthatarethebuildingblocksoftheclustertofunction ascoherentpeerssothatanyonenodeknowseverythingcontainedontheotherunitsinthecluster. o Isilon'sOneFSoperatingsystem[]intelligentlystripesdataacrossallnodesina clustertocreateasingle,sharedpoolofstorage.27 o Becauseallfilesarestripedacrossmultiplenodeswithinacluster,nosinglenode stores100%ofafile;ifanodefails,allothernodesintheclustercandeliver100%ofthe fileswithinthatcluster.28 o Adistributedclusteredarchitecturebydefinitionishighlyavailablesinceeachnodeis acoherentpeertotheother.Ifanynodeorcomponentfails,thedataisstillaccessible throughanyothernode,andthereisnosinglepointoffailureasthefilesystemstateis maintainedacrosstheentirecluster.29
26

MirlynisthenameoftheUniversityofMichiganscurrentOnlinePublicAccessCatalog,whichissupportedby theAlephintegratedlibrarysystem.Mirlyn2isabetaversionofUMsrecentlyimplementednextgeneration catalog,basedontheVuFindplatform,whichwillbecomethemainlibrarycatalogonAugust3,2009. 27 IsilonSystems,Inc.IsilonIQOneFSOperatingSystem(2009)retrievedfrom http://www.isilon.com/products/OneFS.phpon17June2009. 28 IsilonSystems.UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClustered StorageSystems(2008)p.7.Incomputerdatastorage,datastripingisthetechniqueofsegmentinglogically sequentialdata,suchasasinglefile,sothatsegmentscanbeassignedtomultiplephysicaldevices.[]ifonedrive failsandthesystemcrashes,thedatacanberestoredbyusingtheotherdrivesinthearray. (http://en.wikipedia.org/wiki/Data_striping,retrievedon16August,2009). 29 IsilonSystems.BreakingtheBottleneck:SolvingtheStorageChallengesofNextGenerationDataCenters (2008)p.8

20090824

10

HathiTrustsIsilonIQclustersensureahighdegreeofdataredundancywiththeirN+3parityprotection. N+3providestriplesimultaneousfailureprotectionsothatuptothreedrivesonseparateIsilonIQ nodes,orthreeentirenodes,canfailatthesametimeandalldatawillstillbefullyavailable. o TraditionalRAID5parityprotectionresultsindatalossifmultiplecomponentsfail priortothecompletionofarebuild.FlexProtect,incontrast,automaticallydistributesall dataanderrorcorrectioninformationacrosstheentireIsilonclusterandwithitsrobust errorcorrectiontechniquesefficientlyandreliablyensuresthatalldataremainsintact andfullyaccessibleevenintheunlikelyeventofsimultaneouscomponentfailures.30 o Eachfileisstripedacrossmultiplenodeswithinacluster,with[three]paritystripesfor eachdatablock.31 ThefilesystemmayalsoperformaDynamicSectorRepair(DSR)atthetimeofanyfilewriting.Ifit encountersabaddisksector,thefilesystemwilluseparityinformationelsewhereinthesystemto rebuildthenecessaryinformationandrewriteanewblockelsewhereelseonthedrive.Thebadsector willberemappedbythedrivesothatitisneverusedagainandthewriteoperationwillbecompleted. TheIsilonrestriperisametaprocess/infrastructurethathasfourprimaryphasestohelp manageandprotectdataintheeventthatcomponentsoftheclustersustainapartialfailureor malfunction.Theprocessesrunasbackgroundoperationsanddonotrequiresystemdowntime.3233 o FlexProtectrepairsdata(i.e.,intheeventofadriveloss)usingparity. IsilonOneFSwithFlexProtectcanboasttheindustryleadingMeanTimeto DataLoss(MTTDL)forpetabyteclusters.34 FlexProtectintroducesstateoftheartfunctionality,whichrebuildsfaileddisks inafractionofthetime,harnessesfreestoragespaceacrosstheentirecluster tofurtherinsureagainstdataloss,andproactivelymonitorsandpreemptively migratesdataoffofatriskcomponents.35 o AutoBalancerebalancesthedatainaclusteraccordingtobusinessrules,inrealtime, nondisruptively.36 Assoonasthe[neworrepaired]nodeisturnedonandnetworkcablesare connected,AutoBalanceimmediatelybeginstomigratecontentfromthe existingstoragenodestothenewlyaddednodeacrosstheclusterinterconnect backendswitch,rebalancingallofthecontentacrossallnodesinthecluster andmaximizingutilization.37
30

IsilonSystems,Inc.IsilonIQOneFSOperatingSystem(2009)retrievedfrom http://www.isilon.com/products/OneFS.phpon30June2009. 31 IsilonSystems.UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClustered StorageSystems(2008)p.7 32 IsilonXSeriesSpecifications(productbrochure) 33 InformationontheIsilonrestripercomesfromapersonalemailsentbyKipCranfordofIsilonSystems,Inc.on1 June2009. 34 IsilonSystems.DataProtectionforIsilonScaleOutNAS(2009)p.4 35 IsilonSystems,Inc.IsilonIQOneFSOperatingSystem(2009)retrievedfrom http://www.isilon.com/products/OneFS.phpon15June2009. 36 McFarland,Anne.IsilonAcceleratesDeliveryofDigitalContentTheClipperGroupNavigator(2003). 37 IsilonSystems.TheClusteredStorageRevolution(2008)p.13

20090824

11

Collectcleansuporphanednodesanddatablockstopreventfragmentationofdata. MediaScanverifiesdisksectors. ThefunctionofMediaScanistoscaneveryblockinthefilesystemlookingfor baddisksectors.Ifitencountersabadsector,itwillperformaDynamicSector Repair(DSR)anduseparityinformationelsewhereinthesystemtorebuildthe necessaryinformationandrewriteanewblocksomewhereelseonthedrive. MediaScanperiodicallyreviewsdatablocksanddisksectorsthatmaynothave beenaccessed,fromafilelevel,inmonthsoryearsandtherebyhelpstokeep thedrivesashealthyaspossible. o AsoftheOneFS5.0release,allfilesystemmetadatacanbecheckedbythe IntegrityScanrestriperphase.ThisprocesswillallowHathiTrusttocompletelycheckfile dataandmetadataviaassociatedchecksums. OtherinstancesofinherentredundancyincludenonvolatileRAM,afullyjournaledfilesystem,and softwareapplicationsthatmanageclientconnectionsintheeventofanodesfailure. o OneFSisafullyjournaledfilesystemwithlargeamountsofbatterybackednon volatilerandomaccessmemory(NVRAM)withineachnode,whichensurestheintegrity ofthefilesystemintheeventofunexpectedfailuresduringanywriteoperation.38 o TheIsilonSmartConnectmodule[ensures]thatwhenanodefailureoccurs,allin flightreadsandwritesarehandedofftoanothernodeintheclustertofinishits operationwithoutanyuserorapplicationinterruption.[]Ifanodeisbroughtdown foranyreason,includingafailure,thevirtualIPaddressesontheclientswillseamlessly failoveracrossallothernodesinthecluster.Whentheofflinenodeisbroughtback online,SmartConnectautomaticallyfailsbackandrebalancestheNFSclientsacrossthe entireclustertoensuremaximumstorageandperformanceutilization.39 HardwareSupportandService HathiTrustequipmentiscoveredbysupportandserviceagreementswithitsvariousvendors (SunMicrosystems,Dell,CDWG,etc.).Agoodexampleofonesuchagreementisfoundinthe PlatinumsupportprovidedbyIsilonSystemsandwhichincludes: o Extended24x7x365Telephone&OnlineHardwareandSoftwareSupport o 24x7ProactiveMonitoring&AlertsEmailHome(forHardwareandSoftware) o ReturnPartstoFactoryforRepairand4hourReplacementPartsDelivery o SupportIQ(EnhancedServiceabilityDiagnostics)andSystemEventTracking o OnsiteTroubleshooting o IsilonHardwareInstallation o SoftwareProductDocumentation,ReleaseNotes,andaccesstoProductTechnicalNotes o RemoteDiagnosis(ProvidedUserGrantsAccess) o Maintenance&PatchReleases o o
38

IsilonSystems.UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClustered StorageSystems(2008)p.9 39 IsilonSystems.DataProtectionforIsilonScaleOutNAS(2009)p.6

20090824

12

MinorandMajorUpgradeReleases(IncludesPerformanceImprovements,New Features,ServiceabilityImprovements).40 EquipmentTracking LITCoreServices(CS)maintainsaninventoryofserversonawikipageaccessibletoitsstaff. Detailsincludeeachserversname,location,onlineandretiredates,upgrades,notesonstorage,andits primaryservice.Additionalinformationisprovidedrelatedtospecifications,supportcontracts,andkey contactinformation.TheCSserverinventoryiscurrentlyoutofdate. HardwareReplacementSchedule o HathiTrustreplacesstorageregularly,approximatelyevery34yearsorastheusable lifeofstorageequipmentdictates(HTTRACC1.7) o HathiTruststaffupgradehardwareonaregularbasis(i.e.,everythreeorfouryears), andtohelpdetectmorerapidgrowthindemands,thewebserverandstorage infrastructureshavetheirownperformancemonitoringthatindicateoverload conditions.(HTTRACC1.10) TimelineforEmergencyReplacementofHathiTrustInfrastructure Shouldaseriouseventrequirethereplacementofpart(orall)oftheHathiTrusttechnical infrastructure,thefollowingtimelineprovidesageneralestimateofthetimerequiredtoorder,ship, andinstallnewequipment.AcursoryreviewofthetimenecessaryforHathiTrusttorecoverfroma majordisasteratthemainAnnArbororIndianapolisdatacentersuggeststhatalargeeventcouldidle aninstanceoftherepositoryforatleastamonthandahalf.Inadditiontotheserversandswitches mentionedabove,criticalcomponentsincludefour30Apowerdistributionunits(PDUs)perrackand fourracksperdatacenterasofthiswriting. o SubmissionofPurchaseOrders: Forordersunder$5,000,theMPathwaysapplicationallowstheUniversity Librarysbusinessmanagertosendpurchaseordersdirectlytovendors. Forordersover$5,000,ProcurementServicesnormallytakesonetotwo businessdaystoapprovethepurchase,buttheprocessmaytakeuptoaweekif questionsariseoradditionalpurchaseinformationisneeded. o DeliveryofEquipment: Productsthevendorhasinstockandavailableforimmediateshipmenttake13 daystobedelivered. Itemsthatneedtobeconfigured(suchasservers)usuallytake12weeks. Isilonstoragewilltake3weekstobedeliveredinaworstcasescenario. o Installation: 3daysFTEforIsilonIQclusterinadditiontothetimerequiredforotherservers, switches,PDUsandrackunits. o
40

IsilonSystems.SupportAdvantageOfferings(2009)retrievedfrom http://www.isilon.com/support/?page=planson30June2009.

20090824

13

DataRestoration:about.5TB/hour(15days,asofJune2009)41 WhileHThasabout110TBofdatainitsstorage,thebackuptapesmaintained bytheTSMGroupcontainroughly176TBofinformationduetothedata encryptionusedtoprotecttheintellectualrightsofthematerial(asof06/2009). Thelengthoftimerequiredforabaremetalrestorationwillbeinfluencedby tapemounts,networkspeed,restoringtotheNFSshares,decryption,etcetera. Ifthelibrary/HTweretopurchaseanadditionaltapedrive(atroughly$20,000), theprocesscouldbespedup,perhapstoabout1TB/hour. Intheeventofalargescaledisasterinwhichmultiplecampusunitsrequire extensivedatarestoration,theTSMBackupServiceSLAstatesthatITCS managementwillworkwithcustomerstodeterminehowtoprioritizecustomer restores.(sec.4.11)ThisdeterminationwillreflecttheUniversityofMichigans organizationalpriorities42: Priority1:Healthandsafetyoffaculty,staff,students,hospitalpatients, contractors,renters,andanyotherpeopleonUniversitypremises. Priority2:Deliveryofhealthcareandhospitalpatientservices Priority3:Continuationandmaintenanceofresearchspecimens, animals,biomedicalspecimens,researcharchives. Priority4:Deliveryofteaching/learningprocessesandservices Priority5:SecurityandpreservationofUniversityfacilities/equipment. Priority6:Maintenanceofcommunity/Universitypartnerships. Fractionalrestoreswould,forthemostpart,runatcomparablespeedsunlesstherewas aneedtorestorealargenumberofrandomfiles,inwhichcasetherewouldbea decreaseinspeedduetotapeseekandmounttimes. DelaysinrecoverycouldbeincreaseddramaticallyiftheMACCdatacenterorits infrastructurehassustaineddamageandneedsrepair.

HathiTrustandInsuranceCoverageattheUniversityofMichigan TheOfficeofFinancialOperationsreviewsandaddsfinancialassetsgreaterthan$5,000tothe assetmanagementsystemoftheUniversityofMichigan.ThePropertyControlOfficeisthenresponsible fortaggingfinancialassetswithuniqueUniversityofMichiganidentifiersandtrackingthem.Risk ManagementServicesadministerstheUniversityspropertyinsuranceandwillprovidethe reimbursementofreplacementcostsforitemsselfinsuredbyMichigan.AsofJuly2009,thenatureand extentoftheUniversityofMichigansinsurancecoverageforHathiTrusthardwareremainedunder review.ThemaincontactwithRiskManagementServicesinthismatterhasbeenCyndiMesa,Headof UMLibraryFinance.


41 42

Hanover,Cameron(ITCSTSMGroupStorageEngineer).Personalemailon23June2009. UniversityofMichiganAdministrativeInformationServices.EmergencyManagement,BusinessContinuity,and DisasterRecoveryPlanning(2007)retrievedfromhttp://www.mais.umich.edu/projects/drbc_methodology.html on6July2009.

20090824

14

Scenario2:NetworkConfigurationErrors Review:RisksInvolvingNetworkConfigurationErrors ThefollowingtablesummarizestherisksfacingHathiTrustastheresultofnetworkconfiguration errors.ConsiderationisgiventonetworkconnectionswithinUMdatacentersaswellasatUMsHatcher GraduateLibrary(siteofkeyadministrativeanddevelopmentactivities).Thearrangementofthese eventsreflectstherelativeseverityoftheirrespectiveconsequences.


Severity Highimpact ModerateImpact Event Lossofservernetworkswitchoroutboundnetworkswitch LossofaccesstoUMnetBackbone ExtendedlossofpoweratHatcherLibrarycouldleadtolossoflocalserversand disruptionofadministrativeandoperationalactivities. LossofpowerthatthreatensabilitytoconnecttoLocalAreaNetwork (LAN)/Backbone o Thelibraryremains(fornow)apriorityrecipientofelectricityfromtheUM powerplant o CampusdatacentershaveUPSsandredundantbackuppower Failureoflocal/serversideconnections o Shouldproblemsarisewithconnectionstoindividualnodes,theclustered architectureoftheIsilonsystemwillallowread/writerequeststobe handledbyalternatenodes. o IfconnectionsfailatoneHTsite,trafficcanbehandledbyremainingsite.

LowImpact

HathiTrustsSolutionsforNetworkConfigurationErrors HathiTrustscontinuedaccesstotheInternetviatheUMnetBackboneisessentialforits continuedprovisionofservice.Therepositoryreceivesnetworkinfrastructuremaintenancethrough UMsITCS/ITCom;withitsrobustdisasterplanninginadditiontothelessonslearnedfromtheMidwest blackoutof2003,ITComguaranteescontinuednetworkaccessinallbutthemostcatastrophic scenarios.Intheeventofawidespreadpoweroutage,HathiTrustwouldbeabletomaintainaccessto theUMnetBackbonesincedatacentersareequippedwithredundantpowersuppliesandtheHatcher GraduateLibraryiscurrentlycategorizedasapriorityrecipientofpowerfromtheuniversity.ITCSalso has17generatorswhichcanbeusedtomaintainpowertonetworkswitchesintheeventofablackout. TheresponsibilitiesandobligationsofbothpartiesareoutlinedintheCustomerNetworkInfrastructure MaintenanceServiceAgreement.43 ExtentofITComSupport o ITComagreestoprovidetheUnitNetworkInfrastructureMaintenancetoincludedata switches,routers,accesspoints,hubs,uninterruptiblepowersupplies(UPSs),firewalls, andotheridentifiedandagreeduponcomponents.(ITCSsec.1.0)
43

PleaserefertoAppendixG(ITCS/ITComCustomerNetworkInfrastructureMaintenanceServiceAgreement).

20090824

15

ITComResponsibilities o Provideandmaintainthenecessarymaterialsandelectroniccomponentstooperate theUnitNetworkInfrastructure.(sec.5.2) o ProvideconfigurationandNetworkInfrastructureAdministrationsupportnecessaryto repairandmaintaintheUnitNetworkInfrastructurehardwareandsoftwarecoveredby thisagreement.(sec.5.3) o Monitor24hours/dayand365days/year(24x365),supportedprotocolstothe backboneinterfaceoftheUnitsnetworkuptoandincludingtheextensiontothefirst huborswitch.(sec.5.6) o Monitor24hours/dayand365days/year(24x365),networkinterfaceson uninterruptiblepowersupplies(UPS)thatsupporttheUnitnetworkswitches.Provide notificationintheeventthataUPSisactivated,(inputpowerislostordegradedand systemswitchestobatterypower),deactivated,(inputpowerisrestored),or unreachable.ProvidenotificationtotheUnitNetworkAdministratorwhenbatteries degradetothepointofneedingreplacement.(sec.5.7) o ProvidemaintenanceonthestationcablingasinstalledbyITCom,oranapprovedUM vendorwhichmetITCominstallationspecifications.(sec.5.8) o ProvidePreventativeMaintenance(clean&vacuum)oneachCustomerUnitswitch coveredinthisagreementyearly.(sec.5.9) ITComServicesinResponsetoOutagesorDegradationImpactingtheNetwork o Aresponsewithin30minutesoftheITComNOCnotificationortheUnitscall,to provideinformationtotheUnitonspecificstepsthathavebeen/willbetakentoresolve theproblem.(sec.7.2.1) o Anonsitevisit,ifnecessary,withintwo(2)hoursoftheresponse(i.e.,themaximum onsiteresponsetimewillbetwoandahalf(21/2)hours).Anupdatewillbeprovided totheUnitNetworkAdministratorifonsiteandabestguessETRwillbeprovidedbased onavailablefacts.ITComwillcontinuetoprovidetheUnitwithupdateseverytwohours duringanoutage.(sec.7.2.1) o IfanoutageisidentifiedwithintheagreementservicehoursITComwillresolvethe outageeveniftherepairtimeextendsbeyondtheserviceagreementhours.(sec. 7.2.1)(Repairsoutsideoftheagreementhoursresultinadditionallaborexpenses.) o ConductmonitoringviaSNMPPOLLINGatoneminuteintervals.(sec.7.2.1) HathiTrustResponsibilities ITComsresponsibilitiesendatthefirstnetworkswitchandfromtheretoitsservers,HathiTrust isresponsibleformaintainingnetworkconnectivityandsecurity.TherepositoryusesInternet2for communicationandsynchronizationbetweentheAnnArborandIndianapolissites.EachIsilonnodehas dual10GBInfinibandportsforinternal(i.e.,intracluster)communicationanddual1GBEthernetfor externalcommunication. Scenario3:NetworkSecurityandExternalAttacks 20090824 16

Review:RisksInvolvingNetworkSecurityandExternalAttacks Thefollowingtablegivesageneraloverviewofthebasicthreatanexternalattackornetwork securitybreachposestoHathiTrust;entriesarearrangedbyseverity.Thelist,however,isnotexhaustive andnoattempthasbeenmadetopublicizepotentialvulnerabilities.


Severity Highimpact Events UnauthorizedaccesstoHathiTrustcontentleadstotheinfringementofcopyrights. Lossofdataorfunctionalityforanextendedperiodoftimeasaresultofmalicious activity. HathiTrustservicesaretemporarilyunavailableasaresultofmaliciousactivity. ThedeliveryofHathiTrustservicesslowsastheresultofmaliciousactivity. Asecurityweaknessexistswithinthesystembutremainsunexploited.

ModerateImpact LowImpact

HathiTrustsSolutionsforNetworkSecurity MaliciousactivityagainstHathiTrustcouldinvolveunauthorizedaccesstoasystemordata, denialofservice,orunauthorizedchangestothesystem,software,ordata.Asanacademicentity,the repositoryisseenaslessofatargetforsuchactionsthancommercialorgovernmentaltargets;despite thisperceivedlowerrisk,HathiTrusthasnotbeenlulledintoafalsesenseofsecurity.Therepository takesseriouslythepotentialforviolationsofitsnetworkandoperatingsystemsecurityandtherefore hasinstitutedaprogramofperiodicsoftwareupdatesinadditiontothemaintenanceofanITCom supportedfirewall,authenticationrequiredaccess,andothermeasures(suchasthrottlingsoftwareto deterdenialofserviceattacks).Becausecontentiscurrentlyacceptedfromtrustedsources(namely, GoogleandlegacydigitalcollectionsfromHathiTrustpartners)theGROOVEprocessdoesnotincludea virusdetectionphase.Asdigitalobjectsareingestedfromagreaternumberofsources,additional securitymeasuresshouldbeconsidered. o HathiTruststaffapplysecurityupdatestotheoperatingsystemandtonetworking devicesassoonastheybecomeavailableinordertominimizesystemvulnerability.As withnewsoftwarereleases,securityupdatesaretestedinadevelopmentenvironment beforebeingreleasedtoproduction.Softwarepackagesthatpresentalowersecurity riskandthathaveagreaterpotentialtoaffectapplicationbehavior(webservers, languageinterpreters,etc.)aregenerallyinstalled,configuredandtestedmanuallyto allowforgreatercontrolinmanagingupdates.Softwareupdatesarenotapplied automatically;moreover,updatesthatpresentapotentialforhavinganimpacton systembehaviorareappliedandtestedfirstinthedevelopmentenvironment.Ifno impactsareseen,HathiTruststaffapplytheseupdatesinproductionafteratesting periodofatleastoneweek.(HTTRACC1.10)

20090824

17

Scenario4:FormatObsolescence Review:RisksInvolvingFormatObsolescence Thefollowingtableoutlinesthethreatsposedbyformatobsolescenceandarrangesthem accordingtotheirpotentialseverity.


Severity Highimpact Events Applicationsandhardwarearenolongerabletoreadordisplaydigitalobjects. Errorsintranslatingandreadingfilesarenotunderstoodoracknowledgedby repositoryusers. ProblemswiththetranslationoffileformatsresultinDIPsthatdonotfaithfully reflecttheoriginaldigitalobjects. Formatsandassociatedapplicationschangebutretaincompatibilitywitholder versionsofthefileformats.

ModerateImpact LowImpact

HathiTrustsSolutionsforFormatObsolescence AnawarenessandacknowledgementofthedangersofformatobsolescencehasledHathiTrust toimplementproactivepoliciesandprocedurestoensurelongtermaccesstotherepositoryscontent. Therepositoryonlyacceptsspecificformatsthatmeetrigorousspecificationsand,throughtheprior experienceofUniversityofMichiganpersonnel,hasdevelopedprotocolsforthesuccessfulmigrationof contentfromoneformattoanother.Inaddressingthethreatofformatobsolescence,thepreservation oftheintegrityandauthenticityofdepositedcontenthasbeenanoverarchingconcern. SelectionofFileFormats o HathiTrustiscommittedtopreservingtheintellectualcontentandinmanycasesthe exactappearanceandlayoutofmaterialsdigitizedfordeposit.HathiTruststoresand preservesmetadatadetailingthesequenceoffilesforthedigitalobject.HathiTrusthas extensivespecificationsonfileformats,preservationmetadata,andqualitycontrol methods,includedintheUniversityofMichigandigitizationspecifications,datedMay1, 2007.44(HTTRACB1.1) o HathiTrustcurrentlyingestsonlydocumentedacceptablepreservationformats, includingTIFFITUG4filesstoredat600dpi,JPEGorJPEG2000filesstoredatseveral resolutionsrangingfrom200dpito400dpi,andXMLfileswithanaccompanyingDTD (typicallyMETS).HathiTrustsupportstheseformatsbecauseoftheirbroadacceptance aspreservationformatsandbecausetheformatsaredocumented,openandstandards based,givingHathiTrustaneffectivemeanstomigrateitscontentstosuccessive preservationformatsovertime,asnecessary.TheRepositoryAdministratorshave undertakensuchtransformationsinthepast;moreover,HathiTrustoffersenduser servicesthatroutinelytransformdigitalobjectsstoredinHathiTrusttopresentation formatsusingmanyofthewidelyavailablesoftwaretoolsassociatedwithHathiTrusts
44

Specificationsareavailableat http://www.lib.umich.edu/lit/dlps/dcs/UMichDigitizationSpecifications20070501.pdf

20090824

18

preservationformats.HathiTrustgivesattentiontodataintegrity(e.g.,through checksumvalidation)aspartofformatchoiceandmigration.45 o Eachformatconformstoawelldocumentedandregisteredstandard(e.g.,ITUTIFF andJPEG2000)and,wherepossible,isalsononproprietary(e.g.,XML).(HTTRACB4.2) FormatMigrationPoliciesandActivities o HathiTrustiscommittedtomigratingtheformatsofmaterialscreatedaccordingto[its] specificationsastechnology,standards,andbestpracticesinthedigitallibrary communitychange.(HTTRACB1.1) o HathiTruststaffmembersconductmigrationsfromonestoragemediumtoanother usingtoolsthatvalidatechecksumsinternally.(Digitalobjectsarestoredbothonline andontape,andtheonlinestoragesystemconductsregularscanstodetectandcorrect dataintegrityproblems.)Atotalfilecountisdonefollowingalargedatatransfer,and regularlyscheduledintegritychecksfollow.(HTTRACC1.7) o [HathiTrust]hasmigratedlargeSGMLencodedcollectionstoXML,andLatin1 characterencodingstoUTF8Unicode.Oursuccessinmigratingfromolderformatsto newerformatsdemonstratesourcommitmenttoourcollectionsandourabilitytokeep materialsinourrepositoryviable.Allmigrationsaredocumentedinchangelogs.(HT TRACB4.2)

45

HathiTrust.Preservation(2009)retrievedfromhttp://www.hathitrust.org/preservationon16June2009.

20090824

19

Scenario5:CoreUtilityand/orBuildingFailure Review:RisksInvolvingCoreUtilityorBuildingFailure ThefollowingtablesummarizesthedangersautilityorbuildingfailureposestoHathiTrustand rankseventsbytheirpotentialseverity.


Severity Highimpact Events ExtensivestructuraldamagerenderstheMACC(orkeyelementsofits infrastructure)unusableandnecessitatestheestablishmentofahotsitetorecover andcontinueoperations. Additionalfailurepasttoleranceinbackupcoolingorpowerinfrastructure Failureofbackuppowerpastredundancytolerance(failureof2generators) o Datacentercoordinatormayinitiateloadshedandshutdownhalfofthe MACC(butlibraryrackswillremainoperational) Structuraldamagerendersfacilitytemporarilyunsafeand/orunusable. Lossofpower Lossofenvironmentalcontrolunitswithinredundancy

ModerateImpact

LowImpact

HathiTrustsSolutionsforUtilityorBuildingFailure ThecontinueddeliveryofHathiTrustsservicesdependsuponthemaintenanceofpower, environmentalcontrol,andsecurityinitsserverenvironmentattheMichiganAcademicComputing Center(MACC)andotherlocationsthathostcomponentsoftherepository.Inthisrespect,HathiTrustis heavilyreliantupontheinfrastructureoftheMACCaswellasthatoftheArborLakesDataFacility,home tooneinstanceoftheTSMGroupsbackuptapelibrary.Bothlocationsprovidecloselymonitoredand highlyredundantenvironmentsthathelpensurethatHathiTrustsinfrastructureremainssecureand operable.Atthesametime,administrativeanddatamanagementfunctionscriticaltothedevelopment andmaintenanceoftherepositorytakeplaceintheUniversityofMichigansHatcherGraduateLibrary. TheserviceandcooperationofMichigansPlantOperationsDivisionarethereforecriticalforthe continuedaccesstoanduseofthisstructureintheoperationofHathiTrust. GeneralMaintenanceandRepairsinUniversityofMichiganFacilities FacilitiesandmaintenanceissuesontheUniversityofMichigancampusarereportedtothe PlantOperationsDivision,theDepartmentofPublicSafety(DPS),andOccupationalSafetyand EnvironmentalHealth(OSEH)inadditiontotheimpactedfacilitysmanager.Repairworkiscoordinated bytheUniversityLibraryfacilitiesmanagerinconjunctionwithadministratorsandworkersfromPlant Operations. TheMichiganAcademicComputingCenter(MACC) TheMACChostsmanyofthekeycomponentsoftheMichigansUniversityLibrarysystemandas wellasthetechnicalinfrastructureofHathiTrust.TheUniversityofMichigandoesnotownthebuilding inwhichthedatacenterislocatedbutinsteadoperatestheMACCinconjunctionwiththeMichigan InformationTechnologyCenter(MITC)Foundationandotherpartners.TheMACCServerHostingService

20090824

20

LevelAgreement46liststheresponsibilitiesofthedatacenteraswellastherepository;ofparticular significancearetheMACCsagreementsto: o Provideacontrolledphysicalenvironmenttosupportservers[with]roomaverage temperatureofbetween65and75degreesand3550%relativehumidity[and] monitoredenvironmentals(temperature,humidity,smoke,water,electrical.(sec.4.1) o Provideadequate,conditioned,60cycleelectricalservicewithadequatebackup electricalcapacitytosupportcircuits,service,andoutlets[andalsoto]provide UninterruptiblePowerSupply(UPS)andgeneratorbackup(sec.4.2) o Provide7x24telephonecontactforemergenciesandforemergencyaccesstofacility. (sec.4.4) Inadditiontofeaturessuchasredundantelectricalandenvironmentalsystems,theMACC maintainsafulltimecoordinatorandstaffwhoprovide24x7responsestofailuresormalfunctionsinthe serverenvironment.Alertspromptedbyissueswiththeenvironmentalsystemsorpoweraresenttothe UniversityofMichiganNetworkOperationsCenter(NOC)duringnonbusinesshours. o Overview: TheMACC'sredundancyisdesignedtoensurethesafetyandsecurityofthe datahousedwithin.Itconsistsof: Adualpowerpathfromthepropertylinetothepowerdistribution units Dieselpoweredgeneratorsforelectricalbackup Flywheels(notbatteries)toprovidepowerwhilethegeneratorscome on Stateoftheartgeneratorsandflywheelsforbackuppower Threeextracomputerroomairconditioners Twoextradrycoolers Glycolloopforcoolingwithtwoparallelpathwayswithcrossovervalves atregularintervals.47 Astateoftheartmonitoringsystemkeepstrackof1,700differentparameters andautomaticallynotifiesstaffofanyirregularity.48 o EnvironmentalControlsandMonitoring TheMACChas18ComputerRoomAirConditioningunits(CRACs).Atanygiven time,only15arenecessarytomaintaintherequiredtemperatureandhumidity. [Thus,thecomputerroomhasN5+1redundancyinitscoolingability.]Italsois equippedwithanumberofportablecoolerstoaddressspecificcoolingneeds. Theheatfromtheroomistransferredtoanunderfloorglycolloopthat releasestheheattotheoutdoors.49
46 47

PleaserefertoAppendixH(MACCServerHostingServiceLevelAgreement). MichiganAcademicComputingCenter.VitalStatistics(2009)retrievedfrom http://macc.umich.edu/about/vitalstatistics.phpon16June2009. 48 .MichiganAcademicComputingCenter(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June 2009. 49 .VitalStatistics(2009)retrievedfromhttp://macc.umich.edu/about/vitalstatistics.phpon16June2009.

20090824

21

Thelayoutofthefacilityallowsthefrontonthecomputerrackstobefacing thecoldaisles.Theseaisleshaveperforatedfloortilesthroughwhichthecool airispumpeddirectlytothecomputerslocatedthere.Heatisdischargedfrom thebacksofthecomputers,whichcreatesthehotaisles.Thisalternating arrangementfacilitatesthecoolingprocess,asthehotairproducedbythe computerscanbesiphonedoffbeforeitminglestoomuchwiththecoolerairof thefacility.50 TwoseparatesmokedetectionandfirealarmsystemsprotecttheMACC.One isforthebuilding;theotherisfortheMACCitself.Thetwosystemswork togethertoactivatealarmsystemsandnotifythefiredepartmentandkey personnel.Intheeventofanactualfire,thefiresuppressionsystempipeswill notfillwithwaterunlessthereisapressuredropcausedbymeltingofoneor moreofthesprinklerheads.51 o BackupPower Threegenerators,eachroughlythesizeofarailcar,providebackuppower. Onlytwoofthethreearerequiredtorunthefacilityintheeventofapower outage.52 TheMACCusesenvironmentallyresponsibleflywheelsinsteadofbatteriesfor powerbackupwhilethegeneratorscomeonline.Thecombinationofgenerators andflywheelsprovidesthefacilitywithafullyredundantuninterruptiblepower system(UPS).53 TheMACChasacontractwiththeUMPlantOperationsDivisionforthedelivery ofdieselfuelforitsgeneratorsintheeventofanextendedblackout.54 Intheeventthatabackupgeneratorisdisabled,theMACCcoordinatorwill initiateloadshed,inwhichonehalfoftheMACCwillbeshutdownsothatthe otherhalf(andrequisiteenvironmentalsystems)maycontinuetooperate.The HathiTrustandUMLibraryracksareamongthosewhichwillretainpower shouldthisresponseprovenecessary.55 ArborLakesDataFacility(ALDF) TheALDFhousestheTSMGroupsinfrastructureandoneinstanceofthebackuptapelibrary thatformsanintegralpartofHathiTrustsDisasterRecoverystrategy.Asthehomeofcritical componentsoftheUMnetBackbone,theALDFprovidesasafeandsecurelocationforonesetofthe repositorysbackuptapes.Intheinterestofsecurity,thisreportwillomitfurtherinformationonthe exactnatureofthefacilityspowerandenvironmentalsystems.
50 51

Ibid. Ibid. 52 .MichiganAcademicComputingCenter(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June 2009. 53 Ibid. 54 Gobeyn,Rene(MACCDataCenterCoordinator).Personalinterviewon23June2009. 55 Ibid.

20090824

22

Scenario6:SoftwareFailureorObsolescence Review:RisksInvolvingSoftwareFailureorObsolescence Thefollowingtabledetailsvariousrisksinherenttosoftwarefailureorobsolescenceandranks themaccordingtotheirseverity.


Severity Highimpact ModerateImpact Events Softwarebugescapesdetectionindevelopmentenvironmentandresultsincrash ofapplication. LowImpact Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfull accesstodigitalobjects. Improperversionofsoftwareisintroducedtosystem(couldhaveagreateror lesserimpactdependingonresultsoferrorandrepositorysabilitytodetectit). Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfulluse ofsystemcapabilities(i.e.,rotationofimagesoradditionalfunctionality)

HathiTrustsSolutionsforSoftwareIssues ThedevelopmentanduseofHathiTruststoolsandresourcesdependsonhighlyfunctional softwareapplications.Repositorypolicieshavethereforebeencraftedtoensurethattheseapplications arethoroughlytestedandregularlyupdatedtominimizethethreatofserviceoutagesasaresultof softwarefailureorobsolescence.HathiTrustfurthermoreemploysopensourceapplicationsthatare wellsupportedandenjoywidespreaduseanddevelopmentwithinthedigitallibrarycommunity. o Changesinsoftwarereleasesofallcomponentsofthesystem(fromingesttoaccess) aredevelopedandtestedinanisolateddevelopmentenvironmenttopreparefor releasetoproduction.Whenreadyforrelease,developersrecordthechangesmade andincrementversionnumbersofsystemcomponentsasappropriateusingaversion controlsystem.Newversionsofsoftwarearereleasedusingautomatedmechanisms(in ordertopreventmanualerrors).Majorchangesandupgradesinhardwarearchitecture arerecordedinmonthlyreportsofunitactivity,andthusaretraceabletothatlevelof detail.(HTTRACC1.8). o Additionally,subsetsofproductiondataareavailableinthedevelopmentenvironment toallowdeveloperstoensurepropersystembehaviorbeforereleasingchangesto production.(HTTRACC1.9) o Inordertodesign,buildandmodifysoftwareforthedesignatedendusercommunity, HathiTrustconductsanactiveusabilityprogramandseeksinputfromtheStrategic AdvisoryBoardofHathiTrust.Similarly,withregardtosoftwaredevelopmentinsupport ofthearchivingneedsoftheParticipatingLibraries,HathiTrustfocusesonthe developmentofhighlyfunctionalingestandvalidationmechanisms.HathiTrustalso seeksandrespondstoguidancefromtheStrategicAdvisoryBoardwithregardto archivingservices.(HTTRACC2.2)

20090824

23

Scenario7:OperatorError Review:RisksInvolvingOperatorError ThefollowingtablesummarizesriskstoHathiTrustposedbyoperatorerror;eventsareranked accordingtotheirpotentialseverity.


Severity Highimpact Events Operatorerrorresultsintheirreparablelossofdataordamagetoequipment. Operatorerrorresultsinlossofkeyrepositoryfunctions(ingest,storage, dissemination,etc.)foranextendedperiodoftime. Operatorerrorremainsundetectedandcausespersistentproblemsinthesystem buthasnolongtermconsequences. Operatorerrorisdetectedbynormalproceduresorviaanactivitylogandcanbe readilycorrected.

ModerateImpact LowImpact

HathiTrustsSolutionsforOperatorError Inanyhumanenterprise,occasionaloperatorerrorisunavoidable;HathiTruststrivestoensure thatanysucheventsaredetectedandresolvedinatimelyfashion.56Tohelpavoidoccurrencesand mitigatetheirpotentialimpact,HathiTrusthasautomatedmanyproceduresandalsoreliesupon applicationassertions,whichcannotifyadministratorswhenprocessesarenotoperatingcorrectly.Even ifanerrorisintroducedtothefilesystemandthenbackedup,theTSMclientsavesuptosevenversions ofafileforuptosixmonthssothatanearlierversioncanberetrieved. Ingest:TheGoogleReturn(ObjectOriented)ValidationEnvironment(GROOVE)processis entirelyautomatedtoavoidtheintroductionofoperatorerrortotheprocess;stepsinclude: o Identificationofmaterialforingest o DecryptionandunzippingoffilesFormatverificationandvalidationwithJHOVE o LunBarcodeandMD5checksumvalidation o CreationofHathiTrustMETSdocuments o EstablishmentofHathiTrusthandles(persistentURLs) o Extensionofthepairtreefiledirectory(asnewmaterialentersthesystem) ArchivalStorage:Filesstoredwithintherepositoryarenotaccesseddirectlyormanipulatedby staffsothatneitherthezippedimageandOCRfilesnortheMETSdocumentmaybeaccidently alteredordeleted. Dissemination:Thepageturnerapplicationreferencesthestoredimageandthencreatesa .png(forTIFFs)or.jpg(forJPEG2000s)filefordisplaytotheviewer. DataManagement:Newversionsofsoftwarearereleasedusingautomatedmechanisms(in ordertopreventmanualerrors).(HTTRACC1.8)
56

PleaserefertoAppendixB(HathiTrustOutagesfromMarch2008throughApril2009).

20090824

24

Scenario8:PhysicalSecurityBreach Review:RisksInvolvingaPhysicalSecurityBreach MaintainingthephysicalsecurityoftheHathiTrustinfrastructureisyetanothercrucialelement intherepositoryseffortstomanagerisksandtherebylessenthechancethatadisastertypeevent occurs.Risksinvolvethedamageanddestructionofequipmentandcouldevenextendtounauthorized systemaccess.MultiplelevelsofsecurityexistatboththeMichiganAcademicComputingCenter (MACC)andtheArborLakesDataFacility(ALDF)toprotectHathiTrustfromtheactsofvandalism, destructionormalicioustampering.Detailsonthepotentialimpactsofaphysicalsecuritybreachare coveredinScenario1:HardwareFailureandScenario3:NetworkSecurity. HathiTrustsSolutionsforPhysicalSecurity o Eachof[theHathiTrust]storageortapeinstancesisphysicallysecure(e.g.,inalocked cageinamachineroom)andonlyaccessibletospecifiedpersonnel.57 SecurityattheMACC TheMACCServerHostingSLAstatesthedatacenterstaffwill: o Provideservicesnecessarytomaintainasafe,secure,andorderlyenvironmentforall tenantsoftheMACC.(sec.4.7) o ProvideaccesscontrolviaHiDcardandbiometricreadersforthoselistedonthe TenantStaffAuthorizedforAccesslist.(sec.4.5) TheMACCWebsiteandtheMichiganAcademicComputingCenterOperatingAgreement58provide additionaldetailsconcerningtheresourcesandproceduresthathelpprotectHathiTrustsequipmentat theMACC.TheMACCDataCenterCoordinatorpersonallyoverseestheenforcementofsecurity protocolsandconductsregularauditsofsecuritylogsand,whennecessary,reviewssurveillancevideo footage. o SecuritySystems Stateoftheartsecuritydevicessuchasirisscanners,cameras,closedcircuit televisionandoncallstaffkeepthedataandmachineshousedintheMACC safe.59 Accesstothedatacenterwillbebytwofactorauthentication(accesscardand irisscan)orescorted,supervisedaccess.Accesstothebuildingwillbebyaccess card.(MACCOA,sec.5.3.1) Camerasthroughoutthecorridor,securitytrap,andfacilitywillbemonitored andmaintainedbytheDataCenterCoordinator.(sec.5.2.1) o SecurityProcedures
57 58

HathiTrust.Technology(2009)retrievedfromhttp://www.hathitrust.org/technologyon15June2009. PleaserefertoAppendixI(MichiganAcademicComputingCenterOperatingAgreement). 59 MichiganAcademicComputingCenter.VitalStatistics(2009)retrievedfrom http://macc.umich.edu/about/vitalstatistics.phpon17June2009.

20090824

25

TheOperationsAdvisoryCommitteewillestablishproceduresforgranting accesscardstothefacilitytothosewhosejobsrequirehandsonaccessto systems.Allrequestsforaccesscardswillbevettedandapprovedbythe OperationsAdvisoryCommitteeattheirnextmeeting.(sec.5.3.2) Everyoneontheaccesslistforthedatacenterwillberequiredtoattenda trainingsessionbeforeworkinginthedatacenterandsignanaccessagreement statingpoliciestheymustobservewhileinthedatacenter.(sec.5.3.8) SecurityattheALDF AsnotedintheTSMBackupServiceSLA,theUniversityofMichigansITCSisresponsiblefor physicalsecurityattheALDF.(sec.4.9)Whilethisdocumentwillnotdetailspecificfeaturesofthe ALDFsoperation,multiplelevelsofsecurityandoversightareemployed.

20090824

26

Scenario9:NaturalorManmadeDisaster Review:RisksInvolvingaNaturalorManmadeDisaster ThefollowingtabledetailstheriskstoHathiTrustposedbyanaturalormanmadedisaster; eventsarerankedbyorderoftheirseverity.DuetopossibleoverlapbetweenthisscenarioandScenario 1(HardwareFailure),readersareencouragedtoconsultthatearliersection.


Severity Highimpact Events Widespreaddamagetoadatacenterand/oritsinfrastructurethatforcesan instanceoftherepositorytofindanewhotsitewithsufficientpowersupply, environmentalcontrols,andsecurity. Damagetoworkareasforcestafftorelocatetoanewcenterofoperations. Extensivelossordamagetohardwarerequireslargescalereplacement. Withtheextendedlossofonesite,HathiTrustlosesredundancy(andpossiblysome functionality:i.e.theabilitytoingestnewmaterialinAnnArbor)andthusacentral componentofitsdisasterrecoveryandbackupplans. AnactofviolenceorterrorismoccursatornearHathiTrustfacilities. Aneventresultsinanextendedoutageatonesitethatexceedstherecoverytime objective. Hardwaresustainssomedamageandsiteisabletocontinueoperationina reducedcapacity. Anactualorthreatenedactofviolenceorterrorismforcesthetemporary evacuationorquarantineofHathiTrustfacilities. LocalconditionsresultinatemporaryoutageataHathiTrustsite.

ModerateImpact

LowImpact

HathiTrustsSolutionsforNaturalorManmadeCatastrophicEvents TheUniversityofMichiganAnnArborCampusEmergencyProcedures(revisedJanuary2008) hassetprocedurestoaddressbuildingevacuations(intheeventoffire),tornadoes,severeweather, flooding,chemical/biological/radioactivespills,aswellasbombthreats,civildisturbances,andactsof violenceorterrorism.60Inallcases,staffwillfollowthedirectionsofPublicSafetyandnotreenter buildingsorresumeworkuntiladvisedtodosobyDPSorOSEHorsomeonefromonsiteincident command. Intheeventofaseverenaturalormanmadedisaster,therepairandrestorationofthephysical locationsofHathiTrustinfrastructurewouldneedtobecoordinatedbetweentherepositoryandthe appropriatefacilitymanagers.Suchactivitywouldrelyuponthedisasterrecoveryplansinplaceatthe MITCBuilding(homeoftheMACC)andUniversityofMichigan(whichincludestheHatcherGraduate LibraryandtheALDF).Itmustbenotedthataneventwhichcausessignificantdamagetoanimportant structureortoabuildingsinfrastructurecouldresultinthelossofaninstanceoftherepositoryforan extendedperiodoftime.Insuchacase,HathiTrustwouldneedtosetupanalternatehotsiteuntil structuralrestorationiscomplete(oranewfacilityhasbeenfound).

60

PleaseseeAppendixC(WashtenawCountyHazardRankingList).

20090824

27

BasicDisasterRecoveryStrategies Intheimmediateaftermathofalargescalemanmadeornaturaldisaster,therepositorys immediaterecoverywillbeenabledbyitsbasicsystemarchitecture: o theinitiativestechnologyconcentratesoncreatingaminimumoftwosynchronized versionsofhighavailabilityclusteredstoragewithwidegeographicseparation(thefirst twoinstancesofstoragearelocatedinAnnArbor,MIandIndianapolis,IN),aswellasan encryptedtapebackup(writtentoandstoredinaseparatefacilityoutsideofAnn Arbor).61 TheestablishmentofthemirrorsiteinIndianapolisandtheretentionofmultiplebackuptapesattwo locationsinAnnArborensurethataseriouseventateitherlocationwillnotimpedethecontinued functioningoftherepositoryattheother.Considerationmustbegivenastohowdataatthe Indianapolissitewillbebackedupandhowkeyrepositoryfunctions(suchasingest)willproceedifthe AnnArborinstanceisofflineforanextendedperiodoftime.Likewise,alongtermoutageattheIU locationwouldrequireHathiTrusttoestablishathirdsitefordatabackup(i.e.,alocationwhere additionalcopiesofbackuptapescouldbestored).

61

HathiTrust.Technologyretrievedfromhttp://www.hathitrust.org/technologyon15June2009.

20090824

28

Scenario10:MediaFailureorObsolescence Review:RisksInvolvingMediaFailureorObsolescence ThefollowingtablesummarizesriskstoHathiTrustposedbythefailureofthemediausedforits databackups.Whiletherisksfromthisarelimited(bothcopiesofthetapebackupswouldhavetobe impactedfordatatobeunavailable),theissueshouldnonethelessbeaddressedwithregulartest restorationsand/orinspectionsofthemedia.


Severity Highimpact ModerateImpact LowImpact Events Physicaldegradation(i.e.intapebinder,substrate,ormagneticcontent)affects bothcopiesofolderbackuptapes. Becausebackuptapesarenotregularlytestedoraudited,thephysicalsubstrateof tapesmaydegradeovertime. Badtapeisdetectedduringatapebackup.

HathiTrustsSolutionsforMediaFailure GiventhenatureofHathiTrustsstoragesystem,thisscenarioisonlyaconcerninregardstothe digitalmagnetictapesusedbytheTSMGroupforbackups. o Twotapecopiesofallbackupdataaremadeandthesearestoredinseparateclimate controlledconditionsintapelibrariesattheMACCandtheALDF. o Contentistransferredtonewtapeduringdatadefragmentation(whichoccurswhen existingtapesare80%full), o Ifadegradedorotherwisebadsectionoftapeisdetectedduringabackupprocedure thattapeisimmediatelymarkedasreadonly. Dataisthenceforthwrittentoadifferenttape;existingdataonthebadtapewill becopiedtoproperlyfunctioningmedia. Ifdatacannotbereclaimedfrombadtape,theTSMGroupwouldcontact HathiTrustsothatthebackupofcontentcanbeproperlycompleted. RemainingVulnerabilities ThereissomereasonforconcerninthisareabecausetheTSMGroupdoesnothavearegular programtomonitoritsmediaforphysicaldegradationorimpairmentafterdatadefragmentation.While thetapesarereportedtobehighlydependable,problemssuchasstickyshed(thehydrolysisofthe tapesbinder)couldbecomeanissuewitholdertapes.Aregularprogramoftapevalidationortest restorationswouldprovideanopportunitytocheckonthephysicalconditionanddataintegrityofthe tapes.Likewise,thecreationofascheduleforthereplacementofoldertapescouldavoidfuture problemswithmediadegradation.

20090824

29

ConclusionsandActionItems Conclusions Asthisreportdemonstrates,avarietyofriskmanagementstrategiesinadditiontodesign elements,operatingprocedures,andserviceandsupportcontractsendowHathiTrustwiththeabilityto preserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofarangeof disasters.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackups, andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpractices andwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Asitis,disastersoftenresult fromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsof aDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensure that,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedservice provider. IntheefforttosecureHathiTrustslongtermcontinuity,thepresentdocumentstandsmerelyas apreliminarystepintheestablishmentofalegitimateDisasterRecoveryPlan.ThedataonHathiTrusts policies,procedures,andcontractsconsolidatedhereinshouldfacilitatethedatacollectionrequisiteto theinitialphasesoftheplanningprocess,butthecoreactivitiesofformulatingtechnicaland administrativeresponsestrategiesanddelegatingrolesandresponsibilitiesremaintobeundertaken. Thefollowingsectionoutlinesrecommendationsandactionitemsderivedfromresearchintothe repositoryaswellasfromdiscussionswithCorySnavelyandotherHathiTruststaffmembers.Itemshave beenseparatedintoanapproximatetimelineofactivityrangingfromShortTermthroughLongTerm andthearrangementwithineachcategoryrepresentsasuggested(butbynomeansdefinitive)orderof accomplishment.ForamoredetailedexplanationofactionitemsrelatedexplicitlytoDisasterRecovery Planning,pleaserefertotheoverviewoftheplanningprocessinAppendixEorconsultAppendixDfora listofmorecomprehensiveguidesandresources. (NB:*=Denotesanongoingactivity.) ShortTermActionItems(06months) a. ResolvethenatureandextentoftheinsurancecoverageforHathiTrustequipment. b. ArrangewithTSMGroupadministratorstoperiodicallyperformavolumeauditof backuptapestoensuredataintegrity. c. InstituteperiodictestrestoreswithTSMGrouptoensurethattheprocesswillrun smoothlyintheeventofadisaster. d. Discussthecreationofalongtermreplacementscheduleforbackuptapeswiththe TSMGrouptoavoidthepossibilityofmediadegradation. e. Improvecontroloversystemcomponents i. Updatethehardwareinventorytoincludeallimportantsystemcomponents; documentmodels,serialnumbers,UMIDs,associatedsoftwareandversion number,dateofpurchase,originalcost,aswellasvendorcontactinformation andproductsupportcontracts.*

20090824

30

ii. Establishasoftwareinventorytodocumentnecessaryapplicationsintheevent ofhardwareloss;shouldincludepurpose,acquisitiondate,cost,license number,andversionnumber.* iii. CreateamapidentifyingwherecomponentsareintheMACCandwithin individualracks* iv. Reviewandassesspointsoffailureaswellastheadequacyofredundant components.* f. Establishphonetrees i. Includekeycontactsfordifferenttypesofdisaster ii. Prioritizephonetreestotargetindividualswho 1. Makedecisions 2. Havevitalinformation 3. Canofferassistanceinresolvingsituations iii. Distributeinformationandexplainprotocolstoallrelevantstaff* iv. Developaregularmaintenance/updateschedule(onceevery46months)* g. Thoroughlydocumentandmakeavailable(asneeded)importantinstitutional knowledgesothatHathiTrustmaycontinuetofunctionintheeventoftheextended absenceorlossofkeystaff.* h. IdentifydisasterpreparednessanddisasterrecoverymeasuresinplaceatIndianapolis. IntermediateTerm(612months) a. FormaDisasterRecoveryPlanningCommitteetoresearchanddevelopplansandto overseetheirimplementation. b. CommunicateandcoordinateplanningactivitiesbetweenAnnArborandIndianapolis.* i. Considertheformationofsubcommitteesforlocalizedresearchand developmentofplansandanexecutivecommitteetooverseethe implementationandmanagementofplans. c. DraftaDisasterRecoveryPlanningpolicystatementtodefinethemandate, responsibilities,andobjectivesfortheplan. d. Initiatethedatacollectionandanalysisphaseoftheplanningprocess. i. Identifycorerepositoryfunctionsandassociatedhardwareandinfrastructure elements. ii. Determinethepotentialimpactfromthelossofthosefunctions iii. Definethelevelsoffunctionalityrequiredforpartialaswellasfullrecovery. EstablishwhatlevelisneededforHTtofulfillitsmissionandtheneedsofits users. iv. DefineHathiTrustsRecoveryTimeObjective(RTO:themaximumallowable outageperiodforservices)andRecoveryPointObjective(RPO:thepointintime towhichdatastoresmustbereturnedfollowingadisaster). v. Determinetheavailabilityofresourcesintheeventofadisasterandestablish therepositorysprioritizationwithmajorserviceprovidersandvendors(i.e., TSMGroup,ITCom,Isilon,etc.). 31

20090824

e. Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrols asneededtoanticipateandmitigatethoserisks.* f. Developrecoverystrategiestobringcorefunctionsbackonlineassoonaspossible withinasetcostrange. i. Establishalogicalprogressionintherestorationofservicesandassociated components. ii. Identifytheresourcesrequiredfortheseefforts. iii. Consideralternativesolutions,includingpartial(vs.full)recovery g. Communicateplanninggoalsandeffortstokeycontactsfromserviceprovidersand vendorstobettercoordinaterecoveryefforts.* h. InitiatetheproductionofcoreDisasterRecoverydocuments(seeAppendixEformore information).Thefollowinglistisnotexhaustive;datacollectionandanalysiswillhelp determineifallorotherplans(i.e.,awebcontinuityplan)areneeded. i. BusinessContinuityPlan:detailsHathiTrustscorefunctionsandthepriorities forreestablishingeachintheeventofadisruption. ii. ContinuityofOperationsPlan:focusesonrestoringanorganizations(usuallya headquarterselement)essentialfunctionsatanalternatesiteandperforming thosefunctionsforupto30daysbeforereturningtonormaloperations. iii. ITContingencyPlan:addressesexplicitlythedisasterplanningforcomputers, servers,andelementsofthetechnicalinfrastructurethatsupportkey applicationsandfunctions. iv. CrisisCommunicationsPlan:establishesproceduresforinternalandexternal communicationsduringandafteranemergency. v. CyberIncidentResponsePlan:definestheproceduresforrespondingtocyber attacksagainsttheHathiTrustITsystem. vi. OccupantEmergencyPlan:definesresponseproceduresforstaffintheeventof asituationthatposesapotentialthreattothehealthandsafetyofHathiTrust personnelortheirenvironment.(ThisrequirementisaddressedbyUniversityof MichiganBuildingEmergencyActionPlans.) vii. DisasterRecoveryPlan:bringstogetherguidanceandproceduresfromtheother planstoenabletherestorationofcoreinformationsystems,applications,and services.ThisplandefinesrolesandresponsibilitieswithinDisasterResponse Teams. viii. DisasterRecoveryTrainingPlan:establishesthesituationsandprocedurestobe coveredbyHathiTrustsDisasterRecoverytraining. LongTerm(12+months) a. CompleteandimplementDisasterRecoveryPlans. i. Distributephysicalcopiesoftheplansasneededandincludeatleastonecopy inanoffsitelocation. ii. Integrateelementsofresponsestrategiesintosystemarchitecturetofacilitate theirdeploymentintheeventofadisaster.* 32

20090824

b. DisasterRecoveryCommitteeshouldmonitorchangesinbestpracticesandtechnology, updateplans,andoverseeorganizationalreadiness.* i. InitiatestafftrainingsothatindividualsarefamiliarwithDisasterRecovery proceduresandcommunicationprotocols.* ii. Instituteregulartestsofdisasterpreparednesswithsimulateddisasters involvingdifferentcomponentsofHathiTrustoperations.* iii. EstablishascheduleformaintenanceandrevisionstotheDisasterRecovery documents.* iv. CoordinateDisasterRecoveryPlanimplementation,training,andreviewwith Indianapolis.* c. Storeanadditionalcopyofbackuptapesatathirdsitetoincreaseexposureandlimit thechancethatawidespreadeventinAnnArborcouldimpactbothlocalcopies. d. ExplorethepossibilityofestablishingathirdsiteforHathiTrustsdigitalobjectsto increaseexposureandaddressconcernsovertherelativegeographicalproximityof IndianapolisandAnnArbor. e. DeterminethefeasibilityofmovingoperationstoahotsiteinAnnArborshoulda disasterrendertheMACCunusable. i. Identifysuitablesitesandconsidermakingpreliminaryarrangements. ii. Identifyandpriceoutequipment/infrastructurenecessarytocontinue operations. f. PlanforintegrationofnewsystemcomponentsshouldthesuddencollapseofIsilon leaveHathiTrustwithoutservice/support. g. Consideranincreasetosystemsecuritymeasuresascontentbecomesacceptedfroma widerrangeofsourcesandasHathiTrustbecomesahigherprofileorganization.

20090824

33

APPENDIXA:ContactInformationforImportantHathiTrustResources IndianaUniversityMirrorSite AndrewPoland(Staff,InformationTechnologyServices) o ajpoland@iupui.edu o (317)2740746 TroyDeanWilliams(VicePresidentforInformationTechnology,IUatBloomington) o trowill@indiana.edu o (812)8565323 UniversityofMichigan MichiganAcademicComputingCenter(MACC):Housesmuchofthetechnicalinfrastructureofthe UniversityLibrarysdigitalresources. ReneGobeyn(MACCDataCenterCoordinator) o rgobeyn@umich.edu o (734)9362654 ITComUMNOC(NetworkOperationsCenter) o TROUBLE@UMICH.EDU o (734)6478888 ITCSITCom:ResponsibleformaintainingnetworkconnectionstotheUMnetBackboneandInternet; ITComprovidesmaintenanceandsupportservicesforhardwareandsoftware. MikeBrower(SeniorProjectManager,UMLibraries) o mbrower@umich.edu o (734)9369736 KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations) o kahall@umich.edu o (734)6473214 ITComUMNOC(NetworkOperationsCenter) o TROUBLE@UMICH.EDU o (734)6478888 TivoliStorageManagerGroup:Responsiblefornightlyautomatedtapebackupsofstorageservers. AndrewInman(ServiceManager) o ainman@umich.edu o (734)6156286 CameronHanover(StorageEngineer) o chanover@umich.edu o (734)7647019 GeneralSupport:tsmadmin@umich.edu Emergencycontact:adsm@beepage.itd.umich.edu o Messagewillgotooncallstaffspagerinrealtime NotificationofTSMandrelatedoutagesviaUMODgroupfln@umich.edu ArborLakesDataFacility:HousesoneinstanceoftheTSMbackuptapelibrary. ITComUMNOC(NetworkOperationsCenter)

20090824

34

ProcurementServices:Approvesdepartmentalpurchasesover$5,000;buyersalsoworkas intermediarieswithvendors. SteveWorden(UMHardwarePurchasingSpecialist) o sfworden@umich.edu o (734)6458972 ShellyEauclaire(SeniorBuyer,PurchasingServices) o seauclai@umich.edu o (734)6158767 IanPepper(UMDellComputersContractAdministrator) o ipepper@umich.edu o (734)6474981 JeffRabbitt(AlternateDellContractAdministrator) o rabbit@umich.edu o (734)6449232 PropertyControl:Responsiblefortrackingandtaggingtheuniversitysassets. MaryEllenLyon(BusinessOperationManager) o melyon@umich.edu o (734)6473351(t,th) o (734)7631197(m,w,f) OfficeofFinancialAnalysis: DavidStorey(InventoryCoordinator):DeliversUMpropertytagstoequipmentattheMACC. o dstorey@umich.edu o (734)6474264 RiskManagementServices:Providesinsurancecoverageofuniversityassets. KathleenRychlinski(AssistantDirector,RiskManagementServices) o kmrychli@umich.edu o (734)7631587 NonUniversityContactInformation IsilonSystems JimRamberg(RegionalTerritoryManager) o jim.ramberg@isilon.com o Desk:(847)3306399 o Cell:(630)5612463 SunMicrosystems ChristineSluman(ServiceSalesRepEducation) o Christine.Sluman@Sun.COM o (303)5573660,ext.60519 20090824 35

o trouble@umich.edu o (734)6154209 KenPritchard(ALDFfacilitymanager) o kenprit@umich.edu o (734)6152812

CDWG UniversityofMichiganAccountTeam o hansenandadam@cdwg.com HansenChennikkra(AccountManager) o hansche@cdwg.com o (866)3393639 AdamSullivan(AccountManager) o adamsul@cdwg.com o (866)3394118 DellComputers BrianUllestad(HigherEducationAccountManager) o Brian_Ullestad@Dell.com o 18002747799ext.7249522

o (303)9491567(Cell) LarryZimmerman(MichiganAccountManagerSales) o larry.zimmerman@sun.com o (248)8803756

20090824

36

APPENDIXB:HathiTrustOutagesfromMarch2008throughApril200962 April2009:HathiTrustexperiencedreducedperformancefrom11:00pmEDTonThursday,April 23to8:22amEDTonFriday,April24duetoadatabaseproblematoneofthesitesandfrom 5:30pmto9:00pmEDTonThursday,April30duetounintendedconsequencesfroma networkingconfigurationchange. March2009:HathiTrustwasunavailableonTuesday,March3from7:008:00amESTandon Thursday,March5from7:007:45amESTforoperatingsystemanddatabasesoftwareupgrades. February2009:OnSunday,February22at8:40amEST,apowersurgeresultingfromelectrical systemmaintenancecausedHathiTrustdatabaseandwebserverstogooffline.Stafflearnedof theproblematapproximately6:00pmEST,andservicewasrestoredby6:30pmEST. January2009:AbriefoutageisscheduledinJanuaryforastoragesystemsoftwareupgrade. December2008:OnFriday,December19at7:30amEST,HathiTrustwasdownbrieflytoapply securityupdatestoadatabaseserver.Servicewasrestoredat7:40amEST. November2008:OnTuesday,November4at7:30amEST,HathiTrustwasdownbrieflytoapply securityupdatestoadatabaseserver.Servicewasrestoredat7:45amEST October2008:Nooutagesreported. September2008:OnThursday,September18atapproximately9:30amEDT,HathiTrustbecame inaccessibleduetoasoftwareproblemonastoragesystem;theproblemwasrelatedtoour workwithdatasynchronization.Supportwascontactedandtheproblemwasresolvedat 10:45amEDT August2008:OnTuesday,August26atapproximately9:00amEDT,adatabaseserverwas broughtdowntomovetoIndianapolis.Priortoshuttingthisserverdown,wedidnotupdatea manualfailoverconfiguration,causingvolumestobeinaccessibletosomeusers.Theproblem wasresolvedat11:15amEDT. July2008:ServicewasunavailableonThursdayJuly31from7:007:30amEDTforastorage systemsoftwareupgrade. June2008:Nooutagesreported. May2008:Nooutagesreported. April2008:Nooutagesreported. March2008:Nooutagesreported.

62

HathiTrust.Updatesfromhttp://www.hathitrust.org/updatesretrievedon16June2009.

20090824

37

APPENDIXC:WashtenawCountyHazardRankingList ThefollowinglistranksavarietyofnaturalandmanmadeeventswithinWashtenawCounty, Michigan,basedupontheirfrequencyofoccurrenceandtheextentoftheirpotentialimpact(onthe countyspopulation). Rank 1 2 3 4 5 6 7 8 9 10 11 Hazard Convectiveweather(severewinds, lightning,tornados,hailstorms) Hazardousmaterialsincidents: transportation Hazardousmaterialsincidents:fixedsite Severewinterweatherhazards(ice/ sleet/snowstorms) Infrastructurefailures Transportationaccidents:airandland Extremetemperatures Floodhazards:riverine/urbanflooding Nuclearattack Petroleumandnaturalgaspipeline accidents Firehazards:wildfires Frequency Onceor more/yr. Onceor more/yr. Onceor more/yr. Onceor more/yr. Onceevery 5yrs. Onceor more/yr. Onceevery 5yrs. Onceevery 10yrs. Hasnot occurred Onceevery 10yrs. Onceor more/yr. Population Impacted 250,000 2,000 10,000 250,000 30,000 100 10,000 2,000 250,000 1,000 0

Source:WashtenawCountyHazardMitigationPlan(availableonlineat http://www.ewashtenaw.org/government/departments/planning_environment/planning/planning/haz ard_html)

20090824

38

APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences Thetopicofdisasterrecoveryplanningfortheprintandanalogresourcesoflibrarieshasbeen widelydealtwithinprofessionalliterature,butcomparativelylittleinformationexistsconcerningthe developmentandimplementationofplansforthedigitalcontentofculturalinstitutions.Thefollowing bibliographydetailsresourceswhichprovideguidance,examples,andexplanationsoftheobjectivesand strategiesfordigitalDisasterRecoveryPlans.ItconsistsprimarilyofmaterialcompiledbyLanceStuchell (ICPSRIntern)andNancyMcGovern(ICPSRDigitalPreservationOfficer)andisincludedherewiththeir permission. UniversityofMichiganResources UniversityofMichiganAdministrativeInformationServices(MAIS):EmergencyManagement, BusinessContinuity,andDisasterRecoveryPlanning. o http://www.mais.umich.edu/projects/drbc_methodology.html o ThissitebroadlyoutlinestheneedforandfunctionsofEmergencyManagement, BusinessContinuity,andDisasterRecoveryPlanningatUM.Italsocontainstemplates designedtohelpunitsplan,test,andauditdisasterandcontinuityprograms. ProvostandExecutiveVicePresidentforAcademicAffairs:StandardPracticeGuide:Institutional DataResourceManagementPolicy o http://spg.umich.edu/ o ThispolicydefinesinstitutionaldataresourcesasUniversityassetsandmakes recommendationsonidentifying,preserving,andprovidingaccesstotheseassets.The digitalresourcesofthelibrarymaybeidentifiedassuch,basedupontheiruseby departmentsacrosstheuniversity. ICPSRDisasterPlanningResources: o DigitalPreservationOfficerNancyMcGovernispartofaDisasterRecoveryinitiativeat ICPSRandoverthepastseveralyearsherteam(includingLanceStuchell)hasproduced avarietyofdocumentsandtemplatestohelpotherinstitutionsworkthethroughthe planningprocess. o Documentsareavailableuponrequestandshouldbepostedinthenearfuture(asof July2009)totheICPSRWebsite(http://icpsr.umich.edu/). DisasterRecoveryExperts: o ReneGobeyn(MACCDataCenterCoordinator) ManagedandcoordinatedDisasterRecoveryforU.S.militarydatacenters rgobeyn@umich.edu o KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations) HelpeddevelopcurrentITCSDisasterRecoveryplans kahall@umich.edu 20090824 39

ExternalResources GeneralGuidetoDisasterPlanning o ContingencyPlanningGuideforInformationTechnologySystems:Recommendationsof theNationalInstituteofStandardsandTechnology,NISTSpecialPublication80034, June2002. http://csrc.nist.gov/publications/nistpubs/80034/sp80034.pdf AnindispensableresourcewhichwasusedheavilybyICPSRinitsDisaster Recoveryplanning.Itcoverseverythingfrominitialdatacollectionandpolicy formationtothestructureofdisasterresponseteamsandthearticulationof recoverystrategies. ExamplesandToolsfortheDocumentationOutlinedbyNISTGuide: o FullDisasterRecoveryPlan: UnitedStatesDepartmentofAgricultureDisasterRecoveryandBusiness ResumptionPlans http://www.ocio.usda.gov/directives/doc/DM3570001.htm o BusinessContinuityPlan(BCP): MAIS:EmergencyManagement,BusinessContinuity,andDisasterRecovery Planning http://www.mais.umich.edu/projects/drbc_templates.html Thissiteprovidesseveralresourcesthatdealwithcontinuityplanning. o ContinuityofOperationsPrograms(COOP): FEMA:ContinuityofOperations(COOP)Programs http://www.fema.gov/government/coop/index.shtm Containsalotofusefulinformationongovernmentpolicy,templates, andtrainingresourcestoassistinthecreationofaCOOP. Ready.gov:ContinuityofOperationsPlanning http://www.ready.gov/business/plan/planning.html GuidelinesforcomposingabusinessCOOP,includingwhatoutside actorsshouldbeinvolvedintheplanningprocess. TheFloridaDepartmentofHealth:ContinuityofOperationsPlanforInformation Technology http://www.naphit.org/global/library/basement_docs/FL_DisasterReco very_template.doc Lengthy(40pages)anddetailedCOOPtemplatewrittenforanIT environment. FloridaAtlanticUniversityLibraries:ContinuityofOperationsPlan http://www.staff.library.fau.edu/policies/coop2007.pdf AdetailedworkingCOOP,whichincludesreactionstospecificdisaster scenarios. o ITContingencyPlan: 20090824 40

SeetheUSDADisasterRecoveryPlanforanexampleofanITContingencyPlan. CyberIncidentResponsePlan: MultiStateInformationSharingandAnalysisCenterCyberIncidentResponse Guide http://www.msisac.org/localgov/documents/FINALIncidentResponseGu ide.pdf Theguideprovidesastepbystepprocessforrespondingtoincidents anddevelopinganincidentresponseteam.Itmayalsoserveatemplate inordertodraftaCyberIncidentResponsePolicyandPlan. CrisisCommunicationPlan: Ready.gov:WriteaCrisisCommunicationPlan http://www.ready.gov/business/talk/crisisplan.html Thissiteprovidesguidelinesforcomposingabusinessdisaster communicationplanandincludessuggestionsfortheplansWeb presence. NCStateUniversity:CrisisCommunicationPlan http://www.ncsu.edu/emergencyinformation/crisisplan.php ThisisthepolicyandplanfortheUniversityasawhole.Whilemuchof thispolicydealswithcommunicationatahighlevel,usefulsections detailvitalcontactswithintheorganization(includingwhotocontact first),andhowtomanageexternalcommunications. OtherthoroughuniversitypoliciesandplansincludetheLSU:Crisis CommunicationPlanandtheMissouriS&T:CrisisCommunicationPlan. HeritageMicrofilmFloodUpdateEmail ThisemailwassentinresponsetotheJune2008floodingthatoccurred intheMidwest. ItupdatesclientsontheoutageofNewspaperArchive.comwhich resultedfromafloodinducedwidespreadpowerfailure.Itisan excellentexampleofanexternalcrisiscommunicationtousers. DisasterRecoveryPlans(DRP): TheUniversityofIowa:ITServicesDisasterRecoveryPlan http://cio.uiowa.edu/ITplanning/Plans/ITSdisasterPrep.shtml Thispolicydetailsthedatacollectionandassessmentwhichinformsthe UIplanandalsoincludesemergencyprocedures,responsestrategies, andacrisiscommunicationplan. UniversityofArkansas:ComputingServicesDisasterRecoveryPlan http://www.uark.edu/staff/drp/ Acompleteandthoroughplanthatoutlinestheinitiationofemergency andrecoveryprocedures,andaddresseshowtheplanwillbe maintained. AdamsStateCollege(CO):InformationTechnologyDisasterRecoveryPlan http://www.adams.edu/administration/computing/drplan100206.pdf 41

20090824

Thisplanhasathoroughsectiononriskassessment. DigitalPreservationEuropeRepositoryPlanningChecklistandGuidance http://www.digitalpreservationeurope.eu/platter.pdf DesignedforusewiththePlanningToolforTrustedElectronic Repositories(PLATTER),thisdocumentoutlinesconsiderationsfora DisasterRecoveryStrategicObjectivePlan(SOP)andplacesthemin contextwithotherrepositoryplans. OccupantEmergencyPlan(OEP): ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergency ActionPlans(EAP). http://www.umich.edu/~oseh/guideep.pdf DisasterRecoveryTrainingGuides: dPlan.org Providesusefulinformationontrainingandanonlineformthatwould beusefulinassigningtrainersandmonitoringthetrainingprocess. CalPreservation.org:DisasterPlanExercise http://calpreservation.org/disasters/exercise.html Providesrolesandteachingpointsforaroleplaytrainingexercisethat focusesonadisasterinalibrary.

PolicyPlanningTools: o AssociationofPublicTreasurersoftheUnitedStatesandCanada:DisasterPolicy CertificationGuidelines www.aptusc.org/includes/getpdf.php?f=Disaster_Policy.pdf Thisplanningdocumentandtemplatefordisastermanagementpolicies providesoutlinesandexamplelanguageonseveralfacetsofastrongpolicy, includingthepossiblelossofabuilding,thereplacementofcomputer resources,andtestingandtrainingforthedisasterplan.Italsooutlinesthe needtoidentifypossiblethreatstoassets. ExamplesofDisasterPlanningPolicies: o ArkansasSecretaryofState:DisasterPlanningPolicy http://www.sos.arkansas.gov/elections/elections_pdfs/register/oct_reg/016.14. 01020.pdf Thispolicyoutlinesareasofresponsibilitybetweendepartmentsandunits,and includestraining,communication,andrecoveryplanupdates. o WashingtonStateDepartmentofInformationServices:DisasterRecoveryandBusiness ResumptionPlanningPolicy http://isb.wa.gov/policies/portfolio/500p.doc ThisdocumentillustratespolicyformationforanITDisasterRecoveryPlan.It providesguidelinesforDisasterRecoveryPlanningaswellasmaintenance, testing,andtraininginvolvedwiththerecoveryplan. 42

20090824

FloridaStateUniversity:InformationTechnologyDisasterRecoveryandDataBackup Policy http://oti.fsu.edu/oti_pdf/Information%20Technology%20Disaster%20Recovery %20and%20Data%20Backup%20Policy.pdf ThisdocumentincludespolicyfordatabackupaswellasDisasterRecovery.Part ofthepolicyincludesadefinitionofBestPracticeDisasterRecoveryProcedures, aswellasanoutlineoftheuniversitysownITrecoveryplanningand implementationprocedures. ExampleofaRelevantDisasterPlanningProgram: o OCLCDigitalArchivePreservationPolicyandSupportingDocumentation http://www.oclc.org/support/documentation/digitalarchive/preservationpolicy. pdf ThisdocumenthasacleararticulationofOCLC'sdisasterpolicy,alongwithan outlineofdisasterpreventionandrecoveryproceduresandatimeframeforthe restorationofservicesintheeventofadisaster. Thepolicyincludesagooddefinitionofadisasterpreventionandrecoveryplan: Asetofresponsesbasedonsoundprinciplesandendorsedbysenior management,whichcanbeactivatedbytrainedstaffwiththegoalof preventingorreducingtheseverityoftheimpactofdisastersandincidents. OCLCembedsitsdisasterplanwithinitsoverallpreservationpolicy,stating: Thegoalofdisasterpreventionistosafeguardthedata(contentand metadata)intheDigitalArchiveandtosafeguardtheDigitalArchivessoftware andsystems.Fordisasterpreventionandrecovery,alldata(contentand metadata)isconsideredofequalvalue. DesigningaDisasterPlanningProgram: o MichiganStateUniversity:StepbyStepGuidetoDisasterRecoveryPlanning http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm Thisprogrambreaksdownthedisasterplanningprocessintosteps,and providesinformationrelevanttoindividualunitswithinauniversitysetting.The MSUDisasterRecoveryPlanningHomepage(http://www.drp.msu.edu/)also offersavarietyofresources. o MinnesotaStateArchives:DisasterPreparedness http://www.mnhs.org/preserve/records/docs_pdfs/disaster_000.pdf Thisdocumentisadetailedguidetothedisasterplanningprocess.Whilemostly dealingwithpaperrecords,thedocumentclearlyidentifiesdifferentrolesand responsibilitiesformembersoftheplanningandrecoveryteam. o CiscoSystems:DisasterRecoveryBestPracticesWhitePaper http://www.cisco.com/warp/public/63/disrec.pdf o

20090824

43

ThepaperoutlinesDisasterRecoveryusingtheframeworkoftheabove resources,buttailorsittoanITpointofview.Ithasusefulinformationonhow toprepareandrecoverbothhardwareandsoftwareassets. o AT&T:KeyElementstoanEffectiveBusinessContinuityPlan http://www.business.att.com/content/article/Key_to_Effective_BC_Plan.pdf Ashortpaperthatsummarizesbusinesscontinuityplanningintheprivate sector. GeneralInformation o FederalEmergencyManagementAdministration:EmergencyManagementGuidefor Business&Industry http://www.fema.gov/business/guide/index.shtm ApracticalguidewithstepbystepadviceoncreatingaDisasterRecovery program.Includesinformationontheformationonaplanningcommittee, organizationalanalysis,anddetailsonspecifichazards. o SpecialLibrariesAssociationInformationPortal:DisasterPlanningandRecovery http://www.sla.org/content/resources/infoportals/disaster.cfm Anexhaustivelistofresources,thispageincludesarticlesondigitaldisaster recoverystrategiesaswellasinformationonplanning,examplesofplans,and linkstoawiderangeofresourcesinthepublicandprivatesector.

WrittenResources: Wellheiser,JohannaandJudeScott.AnOunceofPrevention:IntegratedDisasterPlanningfor Archives,Libraries,andRecordCentres.Lanham,MD:ScarecrowPress,2002. o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004233950&local_base=AA_ PUB Cox.RichardJ.FlowersAftertheFuneral:ReflectionsonthePost9/11DigitalAge.Lanham,MD: ScarecrowPress,2003. o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004341258&local_base=AA_ PUB Matthews,GrahamandJohnFeather,eds.DisasterManagementforLibrariesandArchives. Burlington,VT:Ashgate,2003. o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004354795&local_base=AA_ PUB

20090824

44

APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess VariousresourcesagreethatthereisnoonewaytogoaboutinitiatingaDisasterRecovery programordraftingaDRplan.Anorganizationmustproceedaccordingtoitsfunctionsandresourcesas wellastheneedsofitsdesignatedcommunityofusers.Thefollowingdiscussiondrawsheavilyuponthe ICPSRDisasterPlanningPolicyFramework(writtenbyNancyMcGovernandLanceStuchell)andthe ContingencyPlanningGuideforInformationTechnologySystemspublishedbyNIST(2002).Assuch,it representsaconsolidationandsimplificationofinformationpresentedinmoredepthelsewhere.Alist ofplanningresources(withlinkinformationtofulltexts)isavailableinAppendixD. BasicPreceptsofDisasterRecoveryPlanning 1) DisasterRecoveryPlanningisacontinuousactivitythatinvolvesmonitoringinternalconditions aswellasevolutionsintechnologyandthreats;respondingtonewdevelopmentsthatarise; revisingplanssothattheyremainrelevantandeffective;trainingstaffaccordingtoplans;and testingorganizationalreadiness. a. Thereisnosingledocumentwhichcontainstheplan;rather,aDisasterRecoveryPlan consistsofasuiteofdocumentsthatrequirearegularscheduleoftestingandrevision tobeeffective. b. ThereisnopointatwhichaDisasterRecoveryPlanisfinished. 2) DisasterRecoveryPlanningneedstobeanorganizationwideactivity a. DisasterrecoverymustbeoneofthebasicfunctionsofHathiTrust. b. Aneffectiveplanneedsfulladministrativesupport. c. Policiesandproceduresmustcomplementandconformtodisasterresponseplans establishedbytheuniversity,city,andDepartmentofHomelandSecurity. 3) Disasterrecoverycannotbelimitedtothehardwareandsoftwarecomponentsordata collectionsofHathiTrust;planningmustalsoaccountfortheimpactofhumanemergencieson therepositorysoperations. EssentialStepsinDisasterRecoveryPlanning 1) EstablishaDisasterRecoveryPlanningCommittee. a. Thisgroupwillresearchanddeveloptheplanandhelpwithitsimplementationaswell asmonitorthetraining,testing,andrevisingofplanstoensureorganizational complianceandreadiness. b. Thecommitteeshouldinvolveindividualsrepresentingthevariousmissioncriticalunits withinthelibrary(fromadministrationtoCoreServicestotheDigitalPreservation Librarian)whowillparticipateinthedevelopmentofpolicyandrecoveryplanning. c. Itisessentialthatthecommitteeinvolveindividualswiththeauthoritytosupportand enforcerecommendations. d. ThecommitteesactivitiesshouldinitiatetheformationofaDisasterResponseProgram. 2) DraftaDisasterRecoveryPlanningPolicyStatement

20090824

45

a. Enablestheorganizationandotherstounderstandthescopeandnatureofthe DisasterRecoveryPlan. b. Establishestheorganizationalframeworkandresponsibilitiesfortheplanningprocess. c. Keypolicyelements(asdetailedintheNISTreport): i. Rolesandresponsibilitieswithintheorganizationinregardstoplanning ii. MandateforDisasterRecoveryaswellasanystatutoryorregulatory requirements iii. Scopeasappliestothetype(s)ofplatform(s)andorganizationalfunctions subjecttoDisasterRecoveryPlanning iv. ResourcerequirementsfortheDisasterRecoveryprogram v. Trainingrequirements vi. Exerciseandtestingschedules(atleastonemajorannualtest) vii. Planmaintenanceschedule(elementsshouldbereviewedannually) viii. Frequencyofbackupsandstorageofbackupmedia. 3) ConductDataCollectionandAnalysis(i.e.BusinessImpactAnalysis) a. Determinecriticalfunctionsandidentifyspecificsystemresourcesrequiredtoperform them.Minimumrequirementsforfunctionalityshouldbeestablished. b. Determinerisksandvulnerabilitiesfacingtherepositoryssystemsandinfrastructure. c. Identifyandcoordinatewithinternalandexternalpointsofcontacttodeterminehow theydependonorsupporttherepositoryanditsfunctions;considerhowonefailure mightcascadeintoothers. i. IdentifyresourcesthatarecrucialtoHathiTrust(I.e.,Mirlyn) ii. Determinetheallowableoutage/disruptiontimefortheseresources d. Developrecoverypriorities;balancethecostofinoperabilityagainstthecostof recovery i. DetermineHathiTrustspositionwithintheprioritiesoftheuniversityaswellas withitsmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon, etc.)tobetterunderstandhowthatprioritizationwillimpactrecoveryefforts. ii. Establishthemostcrucialfunctionswhichmustberestoredfirst. iii. DetermineHathiTrustsRecoveryTimeObjective(RTO,i.e.,themaximum allowableoutageperiod)andRecoveryPointObjective(RPO,i.e.,thepointin timetowhichdatafilesmustberestoredafteradisaster). iv. Reviewpotentialresources(financial,personnel,etc.)withinHathiTrustaswell asthoseavailableviacontracts,serviceproviders,andproductsupport.This stepshouldinvolvetheclarificationofHathiTrustspositionwithinthe universitysaswellaskeyserviceprovidersandvendorspriorities. 4) Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsas neededtoanticipateandmitigatethoserisks.

20090824

46

5) Developrecoverystrategiesthatrespondtothepotentialimpactsandmaximumallowable outagetimesestablishedinthedatacollectionphase.Effortsshouldfocusonsolutionsthatare costeffectiveandtechnicallyviable. a. Strategiesshouldbedesignedtobringcorefunctionsbackonlineassoonaspossible withinanestablishedcostrange. b. Recoveryeffortsmustbeprioritizedaccordingtothenatureofcorefunctionsaswellas logicalorderofprocedures. c. Alternativesolutionsshouldbeconsideredbaseduponcost,availabilityofresources, outagetimes,levelsoffunctionality(partialvs.full),andabilitytointegratemethods withexistinginfrastructure. d. Determinethepracticalityofpartial(vs.full)recoveryinordertobringservicesbackon lineinatimelyandcosteffectivemanner. e. Recoverystrategiesandresourcesshouldbeincorporated(aspossible)intothe repositoryssystemarchitecturesothatintheeventofadisaster,theresponsemay proceedinanefficientandstraightforwardmanner. 6) FormalizeandrecordcollecteddataandrecoverystrategiesinDisasterRecoveryDocuments.In theprocessofproducingthiswiderangeofdocuments,anorganizationisforcedtoconsider anddocumentpoliciesandproceduresrelatedtoavarietyofkeyadministrativeandtechnical issues.Thedecisionofwhichplanstoinclude(andwhichtoexclude)mustbedeterminedbased uponareviewofHathiTrustsneedsandobjectives.Additionaldocuments(aWebcontinuity plan,forinstance)maybenecessarybasedupondatacollectionandanalysis. a. BusinessContinuityPlan i. Businesscontinuityistheabilityofabusinesstocontinueitsoperationswith minimaldisruptionordowntimeintheeventofnaturalormanmadedisasters. ii. Suchplanningallowsanorganizationtoensureitssurvivalbyconsidering potentialbusinessinterruptionsandestablishingappropriate,costeffective responses. iii. TheBusinessContinuityPlandetailsHathiTrustscorefunctionsandthe prioritiesforreestablishingeachintheeventofadisruption.Itshouldaddress keyadministrativeandsupportfunctionsaswellasthosewhichdirectlyinvolve therepositorysdesignatedcommunity. iv. Theplanshouldthoroughlydocumentthenatureofkeyfunctions, interdependences,theimpactoftheirloss,andalternativemeanstoensure theircontinuationintheeventofadisaster.MAISoffersausefulBusiness Continuityplanningtemplateat http://www.mais.umich.edu/projects/drbc_templates.html. b. ContinuityofOperationsPlan(COOP) i. TheCOOPfocusesonrestoringanorganizations(usuallyaheadquarters element)essentialfunctionsatanalternatesiteandperformingthosefunctions forupto30daysbeforereturningtonormaloperations.

20090824

47

ii. ThisplanmayincludetheBusinessContinuityPlanandDisasterRecoveryPlan asappendices. c. ITContingencyPlan i. TheITContingencyPlanaddressesdisasterplanningforcomputers,servers,and elementsofthetechnicalinfrastructurethatsupportkeyapplicationsand functions. ii. Itshouldaccountforthefollowing: 1. Documenthardwareandsoftware 2. Developanemergencycontactlist 3. Backupandstorealldatafilesoffsite 4. Proactivelymonitorequipmentanddata 5. Installandupdateantivirussoftwareonbothcomputersandservers 6. Developrecoveryscenarios 7. Communicateandmonitortheplan iii. TheplanallowsHathiTrusttoformalizeanddocumentproceduresandpolicies alreadyinplaceanddetailstherepositorysadherencetothesegoals. d. CrisisCommunicationsPlan i. CommunicationisavitallyimportantaspectofDisasterRecoveryPlanningand anorganizationsactualresponseinadisaster. ii. TheCrisisCommunicationsPlanestablishesproceduresforinternalandexternal communicationsduringandafteranemergency. iii. Thedifferentphasesofcrisiscommunicationencompasstheinitialnotification ofanevent,damageassessment,andplanactivationaswellasstatusreports (asneeded)andtheeventualcompletionofrecoveryefforts. iv. Activationofthecommunicationsplanmustbetheresponsibilityofaspecific individual. v. TheDisasterResponseTeamcoordinateswiththeCrisisCommunicationTeam toensurethatinformationprovidedaboutanemergencyisclear,concise,and consistent. e. CyberIncidentResponsePlan i. Thisplandefinestheproceduresforrespondingtocyberattacksagainstthe HathiTrustITsystem. ii. Itprovidesaformalframeworkfortheidentification,mitigation,andrecovery frommaliciouscomputerincidents,suchasunauthorizedaccesstoasystemor data,denialofservice,orunauthorizedchangestosystemhardware,software, ordata. 20090824 48

f.

OccupantEmergencyPlan i. TheOccupantEmergencyPlandefinesresponseproceduresforlibrarystaffin theeventofasituationthatposesapotentialthreattothehealthandsafetyof personnel,theenvironment,orHathiTrustproperty. ii. HathiTrustmayutilizetheframeworkprovidedbyUMBuildingEmergency ActionPlansforthiselement.

g. DisasterRecoveryPlan i. TheprimaryfocusoftheDisasterRecoveryPlanistherestorationofcore informationsystems,applications,andservices. ii. Theplanbringstogetherguidanceandproceduresfromtheotherplans(i.e., BusinessContinuityPlan,ITContingencyPlan,CrisisCommunicationsPlan,etc.) pertainingtoemergenciesthatresultininterruptionsofservicethatexceed acceptabledowntimes,asdefinedintheBCP. iii. Theplanshoulddetailestablishedrecoverystrategiesforspecificdisaster situationsaswellastheteamsinvolvedintheirexecution. iv. Personnelshouldbechosentostaffdisasterresponseteamsbasedontheir skillsandknowledge.Ideally,teamswouldbestaffedwiththepersonnel responsibleforthesameorsimilaroperationundernormalconditions.Itsalso importantthatteammembersshouldbefamiliarwiththegoalsandprocedures ofotherteamstofacilitateinterteamcoordination.Eachteamisledbyateam leader(withasuitablealternate)whodirectsoverallteamoperationsandacts astheteamsrepresentativetomanagementandliaisonswithotherteam leaders.DisasterResponsecannotbeindividualspecificoroverlyrelianton specificpeople.Teamsmustassigneachroleatleastonealternateintheevent thatcorepeopleareunavailableatthetimeofadisaster. v. NISTsuggeststhatacapablestrategywillrequiresomeorallofthefollowing functionalgroups.ForHathiTrust,manyofthesearealreadyinplaceintheform ofUniversityofMichiganunitsandserviceproviders. 1. Anauthoritativeroleforoveralldecisionmakingresponsibility 2. SeniorManagementOfficial 3. ManagementTeam 4. DamageAssessmentTeam 5. OperatingSystemAdministrationTeam 6. SystemsSoftwareTeam 7. ServerRecoveryTeam(e.g.,clientserver,Webserver) 8. LAN/WANRecoveryTeam 9. DatabaseRecoveryTeam 10. NetworkOperationsRecoveryTeam 11. ApplicationRecoveryTeam(s)

20090824

49

12. TelecommunicationsTeam 13. HardwareSalvageTeam 14. AlternateSiteRecoveryCoordinationTeam 15. OriginalSiteRestoration/SalvageCoordinationTeam 16. TestTeam 17. AdministrativeSupportTeam 18. TransportationandRelocationTeam 19. MediaRelationsTeam 20. LegalAffairsTeam 21. Physical/PersonnelSecurityTeam 22. ProcurementTeam(equipmentandsupplies) h. DisasterRecoveryTrainingPlan i. Thisplanwillestablishthesituationsandprocedurestobecoveredby HathiTrustsDisasterRecoverytraining. ii. Thecontentsoftheplanshouldreflecttherangeofresponsibilitiesheld betweenadministrators,departmentheads,andstaffwithinHathiTrust. iii. TheplanshouldaccommodateDisasterRecoveryPlanningCommitteemembers aswellasthoseoftheDisasterResponseTeam.Forthelatter,itshouldidentify keyrolesandresponsibilitiesinrecoveryefforts. iv. Theplanshouldallowinhousetrainingtobesupplementedbyexternal opportunities. v. Aregularlyscheduledemergencydrillsshouldalsobeincludedtotestthe readinessofstaffandtheappropriatenessofresponseprocedures. 7) Implementelementsdevelopedinplanningprocess.Proceduresandpoliciesrelatedto communication,technologicalsolutions,etc.mustbeincorporatedintoHathiTrustsoverall designandoperationsothatDisasterRecoverybecomesacriticalorganizationalfunction. 8) Instituteregularprogramoftrainingandtestingtobesurethatstaffunderstandandaccept policiesandproceduresandtoensurethatHathiTrustispreparedforadisaster. 9) ConductregularreviewandmaintenanceofDisasterRecoverydocumentstorespondto changesinpersonnel,organizationalstructureorfunctions,andevolutionsintechnologyand/or threats. MainPhasesinaDisasterResponse: 1) Notification/Activation:Thisphasecoverstheinitialactionsonceasituationhasbeendetected oristhreatened.Itincludesdamageassessmentandtheimplementationofanappropriate responsestrategy. a. Properdiagnosisandcommunication(bothinternalandexternal)ofadisasteris essential.

20090824

50

b. Thenatureofindividualeventswilldeterminewhoneedstobeinvolved(i.e.,facilities management,coreservices,etc.). 2) Recovery:Thisphasefocusesonthereturntoapreestablishedleveloffunctionality(plans shoulddetailpartialaswellasfullrecoveries). a. Responseteamsimplementrecoverystrategiesandadheretoproceduresandprotocols outlinedinDisasterRecoveryDocuments 3) Reconstitution:Afterrecoveryeffortsarecomplete,normaloperationsmustberestored.This mayinvolvethereconstructionoffacilitiesand/orinfrastructureaswellasthetestingof restoredelementstoensuretheirfullfunctionality.

20090824

51

APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008) (RightclicktoopentheAdobeDocumentObjectlocatedbelow)

20090824

52

APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardSA(2006) (RightclicktoopentheAdobeDocumentObjectlocatedbelow)

20090824

53

APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009) (RightclicktoopentheAdobeDocumentObjectlocatedbelow)

20090824

54

APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006) (RightclicktoopentheAdobeDocumentObjectlocatedbelow)

20090824

55

You might also like