You are on page 1of 10

LOCKSS for Government Documents:

A Needs Assessment for the Federal Depository Library


Community

Supported by a National Science Foundation Grant

January 2004
TABLE OF CONTENTS

Introduction......................................................................................................................... 3
Community Profile.............................................................................................................. 3
Bibliographic Content Structure ......................................................................................... 4
Dot Gov Domain Metrics................................................................................................ 4
Dot Gov Attributes.......................................................................................................... 5
Versions and Editions ................................................................................................. 5
Official-ness................................................................................................................ 5
Integrity....................................................................................................................... 6
Discovery and Tracking Issues ....................................................................................... 6
Implications for a LOCKSS Documents Application..................................................... 6
Social and Economic Aspects of Collection Development ................................................ 7
Preservation Architectures for Government Documents ................................................ 7
Roles for Depository Libraries........................................................................................ 7
Roles for the Government Printing Office...................................................................... 8
Roles for Others .............................................................................................................. 9
Sustainability................................................................................................................... 9
Legal Aspects of Collection Development ....................................................................... 10
Conclusion ........................................................................................................................ 10
Introduction

A group of government documents librarians began discussions in late 2001 at the


Depository Library Conference in Arlington, Virginia regarding the possible use of the
LOCKSS technology (http://www.lockss.org) to preserve and provide long-term access
to Federal web-based government information. With support of a National Science
Foundation small grant for exploratory research under the general digital government
information program, a 2-day workshop was held in April 2003 in order to assess
LOCKSS current functionality and to outline a needs assessment for a potential
government documents implementation of LOCKSS.1 This report represents a summary
of the discussions held by participants in that workshop.

Community Profile

Congress established the Federal Depository Library Program (FDLP) in 1860 to ensure
that the United States public has access to its government’s information. Authorized by
44 US Code Section 1902, the program involves the acquisition, format conversion, and
distribution of depository materials and the coordination of Federal depository libraries in
the 50 states, the District of Columbia and U.S. territories. The mission of the FDLP is to
disseminate information products from all three branches of the Government to
approximately 1200 libraries nationwide. Libraries that have been designated as
Federal depositories maintain these information products as part of their existing
collections and are responsible for assuring that the public has free access to the
material provided by the FDLP. Depository libraries represent a mix of library types,
including research libraries (both public and private), local public libraries, law libraries
and state library agencies. Of these 1200 libraries, Fifty-two have “Regional” status:
these libraries automatically receive every document distributed under the program and
are expected to maintain access to the material in perpetuity.

The FDLP as described ensures long-term public access through a geographically


dispersed network of repositories to government information on print, microform and
tangible electronic media (e.g., CD-ROMs, VideoDiscs, magnetic tape). These
repositories serve as guardians and trusted repositories for the content and the public’s
right to unfettered and efficient access to the content. Increasingly, government
agencies are producing less paper and relying on digital versions of documents made
available solely on government servers. At present, there is no program for the
systematic distribution of these electronic documents through the depository program.
Long-term public access to this increasingly significant body of content is shifting from
the model of a distributed set of trusted repositories (including public, academic, and
government agency libraries) to a centralized model. For example, rather than
distributing materials to depository libraries the Government Printing Office instead
provides access to a large electronic collection through its own server, GPO Access. No
longer is content distributed among a number of autonomous, trusted repositories of
content, but content is increasingly geographically concentrated, with a small number of
Federal servers located largely in the Washington, DC area serving everyone via the
Internet.

1
The summary of that meeting is available on the project website at http://lockss-docs.stanford.edu
FDLP institutions serve as diverse array of local community interests ranging from
scholarly research communities to citizens as government information consumers (both
individually and collectively). They need to acquire and preserve this content for a
variety of reasons, in order to ensure efficient and free access to important content as
well as to preserve this content for future citizens and scholars.

Bibliographic Content Structure

In this portion of the report we consider the critical technical aspects of web-based
government information content that must be addressed in any potential LOCKSS-based
implementation.

Dot Gov Domain Metrics

Government information is a broad category encompassing a wide range of content.


Print publications and archival sources are two primary genres of government
information around which systematic access and preservation strategies have been
developed over the past two-hundred years within the United States. Publications
include an array of relatively formal publications, ranging from reports by government
agencies or deliberative bodies on scientific-technical or public policy issues to formal
publications offered for sale by the Government Printing Office, originating agencies or
commercial re-publishers. Archival sources include paper correspondence, internal
reports, administrative records and other evidence of government activities stored in
agency office files or archives.

The development of electronic media, digital formats, and the internet has added
significantly to the variety of government information content and structure. Both of these
genres—publications and archival sources—have their equivalent in the electronic
arena. However, at least four new government information genres have emerged to
challenge access and preservation strategies: (1) portable digital media including an
array of diverse content on portable digital media (tape, diskette, CD-Rom, DVD, PDA,
etc); (2) databases (spatial, numeric, etc); (3) electronic administrative records (e-mail,
transaction records, etc.); and (4) software applications.

A study conducted by the California Digital Library within funding from the Andrew W.
Mellon Foundation explores the characteristics of the web-based Federal information
domain in great detail.2 In general, the study reveals that the domain of web-based
Federal information can be characterized as:

• highly volatile (rapid rate of content removal and addition)


• marked by filetype diversity but with a preponderance of two filetypes (html and pdf)
• genre diversity (publications, documents, databases of all types including numeric,
GIS, transaction records, etc.)
• opacity (high percentage of the content is hidden within the deep web)
• complex file structures (a single “publication” might be divided into several files within
or across web subdirectories)

2
Web-based Government Information: Evaluating Solutions for Capture, Curation, and Preservation.
Project Report – August 7, 2003. Unpublished.
A few metrics of the “dot gov” domain identified in this study can serve to illustrate these
themes.

• Federal websites occupy approximately one-half to one percent of the “surface web”
(available to the public)
• There has been a thirteen-fold increase in the size of the “dot gov” domain between
1992 and 2003
• The half-life of a Federal web resources is four months
• The vast majority of files on Federal websites are in one of two filetypes: html or pdf
• The dot gov domain occupies as much as 85 percent of the so-called “deep web”
(hidden behind firewalls and password-protection screens)

These findings are borne out by an unpublished analysis conducted by Jill Vassilakos-
Long. Her study focused on the realm of published content as cataloged through the
GPO’s Catalog of Government Publications and indicated that electronic records fell into
a number of common formats: 88% Pdf files with the remainder falling into the category
of an html, text or wordprocessed format.

If viewed in terms of a web of interlocking, interdependent resources, the domain of


government content is self referencing and highly volatile. It is characterized by broken
links, links to commercial sites containing government-subsidized studies with
proprietary content.

Dot Gov Attributes

Government information is characterized by a relatively unique combination of attributes


that challenge web preservation and access: (1) versions and editions; (2) official-ness;
and (3) integrity.

Versions and Editions


The first issue in this category relates to the accuracy and validity of information.
Agency missions depend on maintaining and disseminating current information.
Publications are routinely replaced by new and corrected versions or editions. The
intervals between new versions is not always consistent or predictable. In the print
environment the public issuance of new editions would be preceded by internal agency
effort at updating the existing paper edition. In the web environment this internal-
external publication relationship has become blurred: the existing “web edition” may be
directly replaced on an ongoing basis by new versions as new information is available.
When the effect of these cumulative updates passes a sufficient threshold to merit
identification as a new “edition” is a question that is not easily resolved.

Official-ness
A second important issue for government information relates to its “official” character.
The legal and academic communities as well as the general public depend upon the
ability to identify the “official” version of a Federal government document. “Official-ness”
in the print environment is in general a settled question deriving from broadly-accept
customs in the areas of citation and imprint. In the web environment—characterized by
easily mutable and replicable content—the question of what constitutes the “official”
version of a government document is not settled. The first question is whether any
electronic or web-based version of a document can be described as “official”; this debate
continues within the legal community where the print version remains the “legal” version.
Assuming one agrees that electronic or web-based versions can qualify as official, what
are the constraints under which this designation can be made? Is the official version
that which resides on the originating agency web-site? Is it equivalent in official
character to versions that might be produced on tangible electronic media or republished
on another agency server such as GPO Access?

Integrity
A third critical issue relates to file and content integrity. According to Charles Cullen: “An
authentic object is one whose integrity is intact—one that is and can be proven or
accepted to be what its owners say it is. It matters little whether the object is
handwritten, printer, or in digital form”3 The digital format provides a number of peculiar
challenges to affirming for the reader that its content has been corrupted neither through
bit-degradation nor content-manipulation. There are a number of technical approaches
to maintaining file-integrity. Digital preservation science proposes a number of
strategies involving routine file backups, refreshment, and migration. However, the issue
is not only technical but also social: establishment of authenticity will depend not only
upon the development of technical routines that guarantee chains of custody but also
includes “third-parties” in the chain.4

Discovery and Tracking Issues

The metrics of the dot gov domain demonstrate the size, complexity and volatility of
web-based government information. The attribute issues discussed above point to the
need to maintain file integrity and authenticity. A third set of issues relate to the
identification, cataloging and monitoring of the content. The same document may
appear at several different URL’s over time. Conversely, new content may be
substituted within the same URL path for existing content without notification or change
to the file-name.

Implications for a LOCKSS Documents Application

The bibliographic content structure analysis raises possibilities for a LOCKSS technical
solution as well as some challenges. The LOCKSS program deals well with a number
of the key aspects of the content structure. LOCKSS works well with static web filetypes
transmitted via http such as html and pdf, demonstrated to be the predominant filetypes
in the dot gov domain. In addition, LOCKSS holds the promise of assuring content
integrity and integrating third parties in the chain of custody in a way that could help to
address many of the attribute issues identified above.

However, the dot gov domain also poses some challenges to the existing LOCKSS
program in terms of versioning control and discovery. LOCKSS thus far has worked with
content that is regularly published with new editions or issues on relatively regular
frequency patterns. And the location for the publishing is clearly identified as part of the
LOCKSS publisher plug-in development effort. A clear possibility for solving some of
this dilemma would be to collaborate with a central government agency such as the

3
“Authentication of Digital Objects: Lessons from a Historian’s Research.” In Authenticity in a Digital
Environment. May 2000. Washington, DC: Council on Library and Information Resources.
4
See for example Duranti and MacNeil, The Protection of the Integrity of Electronic Records: An
Overview of the UBC-MAS Research Project.” Archivia 42:46-67.
Government Printing Office with whom responsibility for bibliographic control and
dissemination of government information is formally vested.

Social and Economic Aspects of Collection Development

In this portion of the report we consider the social and economic issues governing
development of web-based government information collections specifically as they relate
to a potential LOCKSS documents implementation.

Preservation Architectures for Government Documents

The issue of long-term access to and preservation of digital Federal government


information has been the subject of much discussion within and among both the
government documents library and government publishing and archiving communities.
The partner group identified two general classes of ‘solutions’ to the issue, broadly
described as the “light” and “dark” archive approaches. The “light” archive approach
focuses on creating redundant copies of the digital content and placement of the content
at geographically and institutionally diverse locations. LOCKSS is perhaps the purest
example of a light archive approach to preserving long-term access to government
information. The “dark archive approach focuses less on redundancy across
institutions, attempts to eliminate access to the digital master, and focuses attention on
managing the digital original through a variety of preservation routines (refreshment,
copying, integrity checks, migration, etc). Various institutions – including libraries and
government agencies – are at present engaged in building dark archives that adhere to
these routines—the Trusted Digital Repository standards endorsed by RLG are perhaps
the best reflection of the dark-archive approach to preserving long-term access to digital
content.

These two approaches are not mutually exclusively. The Texas Electronic Depository
program involves a mix of both models, with copies of digital web-based state
government content being placed in electronic depository collections for public access
and certain institutions agreeing to preserve the files in their locally developed and
managed digital repositories. The group believes that it would be in the interest of the
Depository Library Community to rely upon and develop both approaches in order to
assure the best insurance policy for long-term access to Federal information content.

Roles for Depository Libraries

An implementation of LOCKSS for the Federal depository library community implies the
potential for supporting new roles for depository libraries beyond preservation and
access. Depository libraries are diverse and representative group of organizations. The
institutions are often leaders and innovators across a range of library services, from
reference, resource discovery, and collection building. They constitute a locus for the
application of technological innovation

The Federal Depository Library Program (FDLP) is based upon a concept of resource
sharing, both in terms of the concept of shared ownership and management and the
encouragement of resource sharing arrangements at the local and regional levels. The
arguments supporting this distributed approach in the print environment – cost savings
through cost sharing – is certainly at least as valid and appropriate in the distributed
digital library environment.

One value of the FDLP is the concept of heterogeneous access points sustaining
diverse community needs and interests. For example, in the print realm local libraries
select digital Federal content and agree to provide an array of public and technical
services sustaining access to and preservation of this content. Within the existing
guidelines, there is significant room for local variation. The program does not specify a
particular classification scheme or organization of reference services. The program
does not specify binding styles or stack access policies. In a networked environment,
LOCKSS might support the same ability of local libraries to acquire and organize
according to local need information resources related to local interests. And this would
allow for the creation of new local modes of content access, new databases and
enhanced cataloging for digital content. The resulting resources would have the benefit
that anyone might access these specialized resources, not just the local depository
community.

Another key role for libraries has been as memory organizations that help through their
local stewardship of government content to ensure the authenticity of this content. The
very distributed nature of this content within local physical repositories poses
tremendous barriers to any attempt to tamper with the content (falsification or
destruction) in systematic fashion. In this respect, memory organizations serve as
trusted repositories for both the public and academic communities.

Just as a LOCKSS implementation for Federal documents has the potential to affirm,
reinforce and broaden the public benefits of existing roles for libraries, it might open a
range of new roles for depositories. As libraries manage the digital content, they are
positioned to create new knowledge in many forms. For example, one library might
extract or generate text files related to a particular set of documents of interest to their
local community. Searches within these files could be integrated with other text
searches in the library’s digital collection ensuring a broad cross-fertilization of
government and non-governmental content. Searchable interfaces to these text files
could become a valuable enhanced access point for this content, potentially available to
all members of the depository community.

In addition to creating new forms of access to richer digital collections, the availability of
files in the LOCKSS caches to citizens and scholars could empower new and innovative
interactions with the content beyond those that are possible with either print collections
or static page image files for that matter.

Roles for the Government Printing Office

As mentioned in the preceding section of the report, there are several potential roles for
GPO in a LOCKSS-based network of web-based government information. In large part
these derive directly from GPO’s existing mission as a key Federal agency responsible
for the dissemination and bibliographic control of government information. GPO is
uniquely positioned to facilitate a LOCKSS documents implementation. Specifically, the
adoption by GPO of the LOCKSS model, including establishment of a LOCKSS server
and might help solve several issues. GPO could:
• leverage its relationship with various Federal agencies and partnerships with
depository libraries to develop agency-specific LOCKSS plug-ins
• crawl agency web sites
• normalize formats across agencies
• authenticate captured content
• apply as appropriate digital signatures
• disseminate the content and associated metadata through the LOCKSS network

Roles for Others

Although GPO is uniquely positioned to facilitate a LOCKSS-based network of web-


based government information, participating libraries might also work with individual
government agencies at the state level for example in order to develop plug-ins and
redistribute content through the network.

Sustainability

Sustainability is a key economic aspect of collection development. Any implementation


of LOCKSS must address the issues of long-term viability of the LOCKSS network.
Ideally a LOCKSS documents implementation and the FDLP would develop as mutually-
sustaining programs. The three key sustainability questions are:

1. How to best support LOCKSS plug-in development for diverse government


agency web-publishing platforms?
2. How to ensure that a sufficient number of institutions develop and maintain
caches for a given array of government web publications to ensure file integrity?
3. How in general to make active rather than participation in a LOCKSS documents
program something that institutions will feel sufficiently vested in that withdrawal
from the participation is a very rare event?

Plug-in development requires a relatively deep technical facility with LOCKSS. It is


unclear whether the LOCKSS journal model – in which a single plug-in is associated with
an individual journal publisher platform – could directly be applied to Federal agency
content. As demonstrated in the first portion of this report, agency sites consist of a wide
variety of web content types and corresponding URL structures; journal publishers
generally adopt consistent file types and structures. If GPO were an active LOCKSS
server site, staff at GPO could be active partners with staff at agencies and depository
libraries in coordinating the development of appropriate plug-ins.

Building a community of LOCKSS partners from the existing FDLP program raises
several issues. What are some of the key reasons that libraries might join a LOCKSS-
based network? These might include some of the following:

• Good citizenship: it is good to collect US public documents for current and


future access.
• Authenticity: participation helps to ensure for local readers that the content is
authentic (as assured by the imprint on paper in the past)
• Preservation: participation in the network ensures that the integrity of the files
• Access: redundancy of files ensures a possibility for development of direct local
access to cached content not subject to the bandwidth limitations that frequent
remote access

Legal Aspects of Collection Development

The legal requirements governing a potential LOCKSS government documents


application are contained in US Statutes Title 44. LOCKSS would need to comply with
two key sets of requirements surrounding retention periods and withdrawals.

Assuming LOCKSS were broadly adopted within parameters of the existing FDLP,
regional libraries would be required to retain the content in perpetuity and selective
librariess would have the option for withdrawal of content after a period of 5 years had
elapsed since distribution of the content. In addition, GPO can upon the request of an
agency require return or destruction of FDLP content.

LOCKSS is currently designed as a preservation system and maintenance of the content


in perpetuity is a critical aspect of its architecture. Although literal destruction or
withdrawal of the depository content might be a difficult design requirement, the content
could be rendered opaque to partners, users and the system in general. It appears that
this would satisfy the requirements of Title 44.

Although depository collections are built around the needs of Congressional district
community in which the institution is sited, depository collections should be available to
all members of the public. GPO has developed an inspection program that ensures that
all collections and ancillary services are available to all members of the public. The
inspection regime focuses on a number of areas including: reference services;
bibliographic control; promotion/outreach; continuing education; collection development;
and physical facilities

Two key issues in the legal area for LOCKSS include:

• Does rendering cached content opaque satisfy FDLP withdrawal requirements


• If a LOCKSS partner develops local public access to its LOCKSS cache, do
FDLP access requirements apply? (i.e., must these caches be fully available
over the internet?)

Conclusion

This report finds sufficient need and interest within the Federal Depository Library
community for a government documents implementation of the LOCKSS preservation
program. Such an implementation would complement existing planning in GPO for
digital preservation and extend the current distributed archiving model of the FDLP into
the digital realm. Such an implementation would require fuller technical development of
the LOCKSS model to accommodate that distinct nature of web-based government
content as well as the social, economic and legal aspects of depository documents
distribution.