1

Prepared for NERCOMP Annual Conference March 2014

Digital Stewardship and Higher Education IT:
Lessons from the National Agenda
Presented by: Micah Altman, <escience@mit.edu> Director of Research, MIT Libraries Non-Resident Senior Fellow, Brookings Institution

DISCLAIMER
These opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,

Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
Capturing Contributor Roles in Scholarly Publications

Preview
Who are the NDSA?

Why develop an agenda for digital stewardship?
What should national stewardship priorities be? … research& foundations of stewardship … digital content … technical infrastructure … organizational roles Lessons for Higher Ed IT
4

Collaborators & Co-Conspirators
• The 160+ institutional members of NDSA, and the 10000+ hours contributed by their representatives to NDSA working groups, meetings and reports • National Agenda Authors:
Micah Altman, Jefferson Bailey, Karen Cariani, Jim Corridan, Jonathan Crabtree, Blaine Dessy, Michelle Gallinger, Andrea Goethals, Abigail Grotke, Cathy Hartman, Butch Lazorchak, Jane Mandelbaum, Carol Minton Morris, Trevor Owens, Meg Phillips, John Spencer, Helen Tibbo, Tyler Walters, Kate Wittenberg, Kate Zwaard
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 5

Who are the NDSA?

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

6

About the NDSA
• Founded in 2010, the National Digital Stewardship Alliance (NDSA) is a consortium of institutions that are committed to the long-term preservation of digital information. • Our mission is to establish, maintain, and advance the capacity to preserve our nation's digital resources for the benefit of present and future generations. • NDSA member institutions represent all sectors, and include universities, consortia, professional associations, commercial enterprises, and government agencies at the federal, state, and local levels. The Library of Congress provides organizational support and substantive collaboration as Secretariat. • Based on collaborative community effort -- there are no fees for NDSA membership. Each member institution commits to to NDSA principles, and contributes efforts to working groups, reports, surveys, meetings and other NDSA initiatives.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 7

NDSA Initiatives
Working Groups
Extending Knowledge Tools for Practice • Preservation Storage Survey • Levels of Preservation • Web Harvesting Survey • Digital Preservation in a Box • Preservation Staffing Survey • Digital Preservation on • Geospatial Selection & Wikipedia Appraisal report • Content case studies Dissemination • NDSA Interview Series • National agenda for digital stewardship • NDSA Innovation Awards • NDSA Social Media
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 8

Recent Outputs

NDSA Member Organizations
• 165 Member Organizations • From all sectors • Committed to digital stewardship

digitalpreservation.gov/ndsa/memberslist.html
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 9

Why develop an agenda for digital stewardship?
Digital Stewardship and Higher Education IT: Lessons from the National Agenda

10

Why a national agenda for digital stewardship?
• Effective digital stewardship is vital for:
– maintaining authentic public records – growing a reliable scientific evidence base – providing durable access to our cultural heritage

• Knowledge of ongoing research, practice, and organizational collaborations is distributed widely across disciplines, sectors, and communities of practice
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 11

How was this accomplished it?
• Contributed community effort

- Development: contributions from the (now 150+) institutional members through working group participation, workshop discussion, commentary - Writing: LC Staff, chairs of NDSA working groups, coordination committee - Reviewing: expert reviewers in the preservation community
• Integrating diverse perspectives from multiple disciplines & sectors • The persistence, organization, and commitment of the Library of Congress in its role as Secretariat
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 12

Why Now - Climate
Strong trends towards: • More production of digital content • More publishing, filtering and access • More learners and collaborators • More attention to public information

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

13

Trends in Higher Education Technology will Increase Need for Information Stewardship
more information, in new forms, created by more people need to manage, understand, and retain information for teaching, research, and evaluation Requires curation at scale
14

• Adoption Trends
– Growing Ubiquity of Social Media – Integration of Online, Hybrid, and Collaborative Learning – Rise of Data-Driven Learning and Assessment – Shift to Students as Creators – Evolution of Online Learning

• Significant Challenges
– – – – – Low Digital Fluency of Faculty Scaling Teaching Innovations Learning Analytics 3D Printing Quantified Self

• Important Developments

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

Climate vs Weather
• Climate is what you should expect -- weather is what you get. • Climate for reproducibility and data management seems favorable… prepare for shifts in the weather.

Maximizing the Impact of Research through Research Data Management

15

What Was Accomplished?
The National Agenda for Digital Stewardship identifies high-impact opportunities to advance:
• the state of the art • the state of practice • the state of collaboration
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 16

Foundations of Content Stewardship
— Framework & Research

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

17

What is Content Stewardship?
• Stewardship involves taking broad responsibility for preservation and curation • The goal of preservation is ensuring meaningful long-term access • Example:

If you have 1000 files (bitstreams), and you’d like to have 99.99% chance of accessing them in 20 years. How do you store them?
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 18

Why not store everything with Amazon?

• Why not put everything in Amazon? • Amazon claims reliability of 99.999999999%
(Better odds than winning Powerball ®, being struck by lightning, and finding alien life… combined)
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 19

What’s left out of the Eleven Nines?
• What are the units? - Collection? Object? Bit? • How was the failure rate calculated? (It’s theoretical)
– MBTF + Independence * enough replicas = lots of nines – But.. No details for estimate provided; No historical reliability statistics provided; No service reliability auditing provided

• What is the empirical evidence for MBTF?
– Storage manufacture hardware MTBF (mean time between failures) is inaccurate… – Failures across hardware replicas are not independent

What threats are assumed away?
– software failure (e.g. a bug in the AWS software for its control backplane) – legal threats (leading to account lock-out — such as this, deletion, or content removal); – institutional threats (such as a change in Amazon’s business model) – Process threats (someone hits the delete button by mistake; forgets to pay the bill; or AWS rejects the payment)

• Do SLA’s or audits back up “design” reliability claims?
– – – – No claim to reliability in SLA’s (or uptime, availability, response time…) Can’t even prove AWS has the content without taking it out! Sole recovery for breach is limited to refund of fees for periods the service was unavailable No right to inspect Amazon logs, assistance with forensics, etc.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda

20

And How Much Does it Really Cost?
• Glacier storage is relatively cheap • Getting your data back is not – if you want it fast • Creates lock-in and gotcha’s

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

21

Observations
• Digital preservation does not equal “backup” • Ensuring long-term access requires ongoing evaluation and management of a broad spectrum of risks & costs • Without attention the digital evidence base will erode

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

22

The Problem - Restated
Keeping risk of object loss fixed -- what choices minimize $?
“Dual problem” Keeping $ fixed, what choices minimize risk? Extension For specific cost functions for loss of object: Loss(object_i), of all lost objects What choices minimize: Total cost= preservation cost+ sum(E(Loss))

Are we there yet?

risk

cost

What are some threats?

Physical & Hardware

Insider & External Attacks

Software

Organizational Failure
Digital Stewardship and Higher Education IT: Lessons from the National Agenda

Curatorial Error

24

Media characteristics Threat characteristics

Threat Modeling
Corruption

Bit Corruption

Correlations

Format Characteristics

Logical Scope of Corruption

File/encoding Characteristics
Filesystem Characteristics Auditing Frequency Auditing Algorithm

Detection

Repair

Probability of Successful Repair

Repair Algorithm

Repair Frequency

Repair duration

Methods for Mitigating Bit-Level Risk
Local Storage

Physical: Media, Hardware, Environment

Formats

File Transforms: compression, encoding, encryption

File Systems: transforms, deduplication, redundancy

Replication

Diversification of copies

Number of copies

Verification

Fixity

Audit

Repair

Observations
• Blind replication is rarely a rational long-run strategy – even with lots of copies. • Without verification/audit and repair strategies long-term risk often remains high • There are multiple methods to mitigating threats to access – use these to guide diversification • Threat / lifecycle modeling order to make an rational choice • Better practices, models, and evidence are needed
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 27

Research Priorities
• Applied Research for Cost Modeling and Audit Modeling • Value of information • Understanding Information Equivalence & Significance • Policy Research on Trust Frameworks • Preservation at Scale • The Evidence Base for Digital Preservation
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 28

What Else do We Need To Know?
• What is the expected future value of a specified collection of digital content? • What content is already being effectively stewarded by other organizations? • How much is the expected future cost of preserving that content? • How often do different threats to information manifest
– storage hardware or media failures – software errors cause information loss – stored information becomes inaccessible because of obsolete formats, or loss of other contextual knowledge – that human error or maliciousness causes loss content in an information system

• What is the reliability of current digital preservation networks and services? • How successful are other proposed strategies for replication, monitoring, certification, and auditing at preventing loss due to these threats?

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

29

The Limits of Case Studies
Most current evidence for digital preservation practices and outcomes are based on local case studies and convenience samples • Case studies are useful for:
– – – – – – – – existence proofs raising awareness of problems process tracing hypothesis generation, advance our scientific knowledge create robust predictive models test causal hypotheses strongly guide decision making.

• Case studies are not enough to

• Systematic Evidence is needed both to support
– general selection of digital preservation practices and method – applications of selected digital preservation methods in a specific operational context.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 30

How will we learn?
• Apply existing research methodologies from other fields -- especially fields involving observation research on humans and human systems • Some useful methodologies:
– probability-based surveys (e.g. of information management practice and outcomes) – replicable simulation experiments tied to theoretically grounded models of information management and risk; – creation of testbeds and test-corpuses which can be used to systematically compare new practices, tools, and methods; – field experiments, in which randomized interventions are applied and evaluated in real operational environments.

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

31

Observations
• Developing better practices will require going beyond case studies – to formal modeling, computer simulation, statistical analysis, experiments

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

32

National priorities for…

Digital Content
Digital Stewardship and Higher Education IT: Lessons from the National Agenda

33

Selected Digital Content Areas that Challenge Curation
• • • • • Web and Social Media Electronic Records Moving Image and Recorded Sound Research Data “Big” Data

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

34

Goals of content curation
• Curation involves selection of content for retention, and management for use • Selection requires predicting future value, in order to build an information portfolio that increases in value • Management requires capturing and maintaining tacit information that ensures fitness for use: Content size, uncertain value, rapid change, unstable form, and external context are core challenges to curation
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 35

Observations:
• The tacit information needed to understand formats is lost over time. Format migration plans are needed to mitigate risk. • Information objects are rarely self-documenting, ensuring fitness for use: requires metadata, provenance, “documentation”, rights, authenticity, To select content for long-term access, we need to develop theoretically grounded and empirically tested models of information valuation and portfolios. • Cost-models for digital stewardship exist, but they are most accurate for collections of small, static, digital objects in stable formats. Generally, a few things are clear: - Raw storage is rarely limiting cost factor - Management of objects is cheapest and most effective if tacit information is captured early in the lifecycle

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

36

National priorities for…

Technical Infrastructure
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 37

2014 Technical Infrastructure Priorities
• Interoperability and Portability in Storage Architectures • Integration of Digital Forensics Tools • Ensuring Content Integrity

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

38

Interoperability and Portability in Storage Architectures

As stewardship organizations manage increasingly large and complex data sets, the need for interoperability at various levels within the technical hardware and software stacks that make digital preservation becomes increasingly important. Interoperability of storage devices, hardware, data tape, and file systems software and would help alleviate bottlenecks in the interrelationship between distinct functions in workflows. Need for establishing and promoting technical means by which lower levels of the technology stack can directly integrate without requiring extensive computation and processing at higher levels.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 39

Integration of Digital Forensics Tools
• Digital Forensics tools are essential for working across the range of heterogeneous kinds of digital materials coming under stewardship • Projects like BitCurator are pulling together the suite of tools to do this work and developing processes and workflows. • We are now at the point of implementation, it’s time for organizations to start implementing and sharing information about their work • The result of this work, will be large sets of heterogeneous digital files which will then push for the development of tools to work with these kinds of data at scale.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 40

Ensuring Content Integrity

• •

Digital preservation is possible through a chain of migration of current hardware and software systems to yet-to-beestablished future infrastructures. Maintaining file fixity is a minimum requirements. Beyond file fixity there is a need to ensure that the semantics of the data and the quality of representation remain unchanged when the object is represented in different forms. Identifying the significant semantic properties of the digital object, and algorithms to create semantic fingerprints can ensure that meaning is preserved over time.

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

41

Observations:
• Interoperability and portability across local and cloud storage architecture remains a significant issue – beware economic and technical lock-in • Curation of objects acquired later in the information lifecycle often require digital forensics – invest in tools and expertise • Ensuring integrity of content over time requires assessing fixity at both a file and semantic level
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 42

National priorities for…

Organizational Development
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 43

State of the curation practice: Trusted Digital Repositories
An organization with a mission and to provide reliable, long-term access to managed digital resources to its designated community; coupled with sufficient evidence of practices to ensure the success of this mission. • Formalized in:
– OAIS Reference Model (standardized in ISO 14721:2012) – Trustworthy Repositories Audit & Certification (TRAC) (standardized in ISO 16363:2012)
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 44

National Priorities for Organizational Roles, Policies, and Practices Identifies need to increase cross‐organizational cooperation to increase the impact and leverage investments made by individual institutions.
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 45

Potential Nexuses for Preservation Failure
• Technical
– – – – – Media failure: storage conditions, media characteristics Format obsolescence Preservation infrastructure software failure Storage infrastructure software failure Storage infrastructure hardware failure

• External Threats to Institutions
– Third party attacks – Institutional funding – Change in legal regimes

• Quis custodiet ipsos custodes?
– – – – Unintentional curatorial modification Loss of institutional knowledge & skills Intentional curatorial de-accessioning Change in institutional mission

Source: Reich & Rosenthal 2005
Auditing Distributed Digital Preservation 46 Networks

Priorities for Organizational Collaboration
1) Provision networked preservation services
– network of preservation service providers with specialized services rather than every organization performing all aspects of digital preservation -- A number of core risks are institutional

2) Collaborate on shepherding and promotion of standards
– digital preservation community representation on the relevant standards bodies rather than each organization needing to participate in every body

3) Share digital preservation training and staffing resources
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 47

• Trustworthy repository standards provide good abstract models of a single institutions curatorial responsibilities, and an inventory of accepted practices • Many threats to content require multiinstitutional stewardship • Certification of trustworthiness and evaluation of impact of accepted practices is still in early stages • Both intra- and inter- institutional collaboration is needed to prevision preservation services, set standards, establish and evaluate trustworthiness
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 48

Observations

What’s next?

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

49

A National Stewardship Agenda for 2015 and Beyond
• Drafts and update process starts this winter • Community review process late spring • An update will be presented in July at Digital Preservation 2014

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

50

Moving Digital Stewardship Forward
NDSA has a commitment to: • Facilitating broad collaboration • Promoting dissemination and engagement • Regular updates and revisions of the National Agenda and core NDSA surveys

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

51

Want more information?
Contact NDSA for… • Briefings, webinars, and consultations on the Agenda or other NDSA work • Assistance in gathering comments on National policies and programs • Assistance in recruiting experts for review and discussion panels; grant review • Referrals to content stewards in specific areas
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 52

Observation: Principles
• The core of digital stewardship is taking broad responsibility for preservation and curation • The goal of preservation is meaningful long-term access • The principle activities of curation are selection and management for use

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

53

Observations: Planning
• Blind replication is rarely a rational long-run strategy – even with lots of copies. • Without verification and repair strategies long-term risk often remains high • There are multiple methods to mitigating threats to access – use these to guide diversification • Threat / lifecycle modeling order to make an rational choice • Developing better practices will require going beyond case studies – to formal modeling, computer simulation, statistical analysis, experiments
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 54

Observations: Curation
• The tacit information needed to understand formats is lost over time. Format migration plans are needed to mitigate risk. • Information objects are rarely self-documenting, ensuring fitness for use: requires metadata, provenance, “documentation”, rights, authenticity, To select content for long-term access, we need to develop theoretically grounded and empirically tested models of information valuation and portfolios. • Cost-models for digital stewardship exist, but they are most accurate for collections of small, static, digital objects in stable formats. Generally, a few things are clear: - Raw storage is rarely limiting cost factor - Management of objects is cheapest and most effective if tacit information is captured early in the lifecycle

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

55

Observations: Curation
• The tacit information needed to understand formats is lost over time. Format migration plans are needed to mitigate risk. • Information objects are rarely self-documenting, ensuring fitness for use: requires metadata, provenance, “documentation”, rights, authenticity, To select content for long-term access, we need to develop theoretically grounded and empirically tested models of information valuation and portfolios. • Cost-models for digital stewardship exist, but they are most accurate for collections of small, static, digital objects in stable formats. Generally, a few things are clear: - Raw storage is rarely limiting cost factor - Management of objects is cheapest and most effective if tacit information is captured early in the lifecycle

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

56

Observations: Infrastructure
• Interoperability and portability across local and cloud storage architecture remains a significant issue – beware economic and technical lock-in • Curation of objects acquired later in the information lifecycle often require digital forensics – invest in tools and expertise • Ensuring integrity of content over time requires assessing fixity at both a file and semantic level
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 57

Observations: Organizations
• Interoperability and portability across local and cloud storage architecture remains a significant issue – beware economic and technical lock-in • Curation of objects acquired later in the information lifecycle often require digital forensics – invest in tools and expertise • Ensuring integrity of content over time requires assessing fixity at both a file and semantic level
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 58

Key Terms
• Audit: An independent evaluation of records and activities to assess a system of controls • Authenticity: information used to verify the truthfulness of assertions about content or ite provenance • Curation: selection of content for retention, and management for fit use • Content stewardship: broad responsibility for curation and preservation • File fixity: information used to verify that a digital object has not been altered or corrupted. • Provenance: the chronology of the ownership, custody, operations on, and/or location of an information object. • Preservation: ensuring meaningful long-term access • Trusted Digital Repository: an organization with a mission and to provide reliable, long-term access to managed digital resources to its designated community; coupled with sufficient evidence of practices to ensure the success of this mission
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 59

Bibliography
• • • Bailey, Charles (2011). Digital Curation and Preservation Bibliography, <digitalscholarship.org/dcpb/> CCSDS (2012), Reference model for an open archival information system (OAIS), <public.ccsds.org/publications/archive/650x0m2.pdf > Digital Curation Center, (2010-4): How to Guides: <dcc.ac.uk/resources/how-guides> Curation Reference Manual: <dcc.ac.uk/resources/curation-reference-manual> Giaretta, David (2011). Advanced Digital Preservation. <amazon.com/AdvancedDigital-Preservation-David-Giaretta> ISO, 2012, ISO 16363:2012: Audit and certification of trustworthy digital repositories. < iso.org/iso/catalogue_detail.htm?csnumber=56510 > Johnson, L., Adams Becker, S., Estrada, V., Freeman, A. (2014). NMC Horizon Report: 2014 Higher Education Edition. Austin, Texas: The New Media Consortium. NDSA (2013), National Agenda for Digital Stewardship, <digitalpreservation.gov/ndsa/nationalagenda/> Rosenthal, David SH, Thomas S. Robertson, Tom Lipkis, Vicky Reich, and Seth Morabito. (2005) "Requirements for digital preservation systems: A bottom-up approach”. Dlib 11(11) <dlib.org/dlib/november05/rosenthal/11rosenthal.html>

• •


• •

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

60

More Information

digitalpreservation.gov/ndsa/na tionalagenda ndsa@loc.gov
Digital Stewardship and Higher Education IT: Lessons from the National Agenda 61

Questions?
E-mail: Web: escience@mit.edu informatics.mit.edu

Digital Stewardship and Higher Education IT: Lessons from the National Agenda

62

Sign up to vote on this title
UsefulNot useful