You are on page 1of 17

DSpace:

Evaluating It As A RecordKeeping System

Albert C. Whittenberg

HIST 7994 (S685): Electronic Records Management

Dr. Philip C. Bantin

04 May 2009

Whittenberg

2

With a growing interest in preserving business and industry records, countless

organizations are looking for a solid recordkeeping system. Knowing this, many

software companies continue to grind out new products year after year. For the budget

minded organization such as a public university or archives, the high costs of these

packages makes open source software solutions look attractive. One of these is DSpace,

a system created by MIT and Hewlett-Packard to manage digital assets. Looking at

numerous examples of DSpace being used in institutions (including MIT), this open

source program will be evaluated by the requirements of a true recordkeeping system as

set in archivist and professor Philip C. Bantin’s book, Understanding Data and

Information Systems for Recordkeeping. Does DSpace meet all these requirements?

How does it handle metadata? Can it be an effective repository for business or

organizational records? What needs to be added? What needs to be changed?

According to the official DSpace Wiki, there are 334 organizations currently

using DSpace in 56 countries. 1 The Wiki further states that DSpace “captures, stores,

indexes, preserves and redistributes an organization's research material in digital formats”

and that “research institutions worldwide use DSpace for a variety of digital archiving

needs.” 2 Repeated time and time again, the site also continues to hammer in that the

software is completely open source and free to everyone. For those institutions that are

worried about technical support, the website also boasts a DSpace Community and a

DSpace Federation with mailing lists, conference, user groups, workshops and a host of

1 DSPace Wiki, “DSpace Instances (as of 01/12/2009),” http://wiki.dspace.org/index.php/DSpaceInstances (accessed April 2009).

2 DSpace Wiki, “What is DSpace,” http://wiki.dspace.org/index.php/EndUserFaq#What_is_DSpace.3F (accessed April 2009).

Whittenberg

3

other websites. DSpace has also been around for some time. According to the January

2003 online D-Lib Magazine, the history of the product is as follows:

In March 2000, Hewlett-Packard Company (HP) awarded $1.8 million to the MIT Libraries for an 18-month collaboration to build DSpace™, a dynamic repository for the intellectual output in digital formats of multi-disciplinary research organizations. HP Labs and MIT Libraries released the system worldwide on November 4, 2002, under the terms of the BSD open source license [1], one month after its introduction as a new service of the MIT Libraries. As an open source system, DSpace is now freely available to other institutions to run as-is, or to modify and extend as they require to meet local needs. From the outset, HP and MIT designed the system to be run by institutions other than MIT, and to support federation among its adopters, in both the technical and the social sense. 3

The reason for the project was a “need to collect, preserve, index and distribute” research

materials like those being generated by faculty at MIT. 4 It was to be free, easy to install

and easy to use but is it a good recordkeeping system?

To answer this question, one must have both a definition of a recordkeeping

system as well as a set of requirements. As mentioned above, the book Understanding

Data and Information Systems for Recordkeeping provides these. One definition is “a

special kind of information system that manages and preserves the records that provide

evidence of business transactions or of personal activities.” 5 Another from ISO Records

Management Standard 15489 defines a system “as an information system which captures,

manages and provides access to records through time” and three characteristics of records

managed for that system are authenticity, reliability and integrity. 6 Finally, the

requirements listed for a recordkeeping system (as detailed in Bantin’s book) are as

listed:

3 D-Lib Magazine (January 2003), “DSpace: An Open Source Dynamic Digital Repository,” http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009).

4 Ibid.

5 Philip C. Bantin, Understanding Data and Information Systems for Recordkeeping (New York: Neal- Schuman Publishers, 2008), 32.

6 Ibid.

Whittenberg

4

1. Capture records,

2. Support classification scheme(s),

3. Capture record metadata,

4. Support audit control,

5. Ensure records are usable,

6. Manage security and control,

7. Schedule records for disposition, and

8. Preserve records 7

Each one of these requirements will be examined in terms of DSpace’s functionality.

Does DSpace capture records? DSpace’s main purpose or goal is to serve as a

“production digital repository service” for research organizations. 8 This means that

record capture would not generally be considered automatic but manually. Researchers

and their assistants would be submitting their information in some sort of digital format

to the system as needed or after completion/publication. According to the definition of

the capture process in a recordkeeping system as listed in Bantin’s book, records can be

captured either automatically or manually (so DSpace does qualify in that aspect). Other

characteristics mentioned that the system must support capturing records from various

types of software and/or applications, must be able to maintain all components captured

as one record, must support versioning and finally ensure reliability. 9

What types of records does DSpace support? Can it handle the countless word

processing and web applications available today? Accord to the DSpace Wiki, the

DSpace application can accommodate the following digital formats:

1. Documents, such as articles, preprints, working papers, technical reports, conference papers

2. Books

3. Theses

7 Bantin., 35-36.

8 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April

2009).

9 Bantin, 38.

Whittenberg

5

4. Data sets

5. Computer programs

6. Visualizations, simulations, and other models

7. Multimedia publications

8. Administrative records

9. Published books

10. Overlay journals

11. Bibliographic datasets

12. Images

13. Audio files

14. Video files

15. eformatted digital library collections

16. Learning objects

17. Web pages 10

Dspace also gives institutions the capability to accommodate different workflows. What

this means is different departments, groups, schools or teams submit items and organize

them in different ways. This answers the questions of how items are grouped together,

who can submit or who can have access. DSpace then has the capability to maintain all

components as a single record or not depending upon the department’s preference. While

this would seem to go against the definition of the record capture process in a true

recordkeeping system, the administrators of a DSpace instance could force the software

to group items.

Versioning was not available initially in DSpace. However, recent updates have

corrected this. According to the Wiki again, “a Google Summer of Code project in 2007

has implemented a versioning prototype, for DSpace Items, DSpace Items have two

identifiers, on permanent, the other is a version lineage id. The Lineage is comprised of

items, each with unique metadata and bundles, bitstreams within the items will be either

10 DSpace Wiki, “End User FAQ,” http://wiki.dspace.org/index.php/EndUserFaq#What_kind_of_content_does_DSpace_support.3F (accessed April 2009).

Whittenberg

6

linked from the previous version or added anew.11 This is a prototype and probably has

some bugs to it. The article in the wiki did not list if any further updates had been made,

but hopefully two years have made a difference.

Does DSpace support classification schemes? What is a classification scheme?

Bantin’s definition listed it as a “diagram, table, or other representation categorizing the

creator’s records, usually by hierarchical classes, and according to a coding system

expressed in alphabetical, numerical, or alphanumerical symbols.” 12 Again, this seems to

be resolved by the robust workflow system built into DSpace. Records can be classified

into a variety of ways or categories. One example given is in the already mentioned

article in D-Lib Magazine where “a department may choose to have two collections: one

for working papers and another for datasets. They may then decide that any member of

the faculty can deposit items to either collection directly, and that any member of the

general public can have access to these collections.” 13 This is a very simple classification

scheme, but more complex ones can be implemented. Records can be classified as well

as the record creators (also called users frequently in the articles and wiki).

Does DSpace capture record metadata? It actually uses a well recognized

standard in archives metadata: the qualified Dublin Core. This is composed of the 15

metadata elements of simple Dublin Core plus an additional three (Audience, Provenance

and RightsHolder):

11 DSpace Wiki, “DSpace 2.0/Comparing Exisitng Technologies,”

(accessed April 2009).

12 Bantin, 39.

13 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April

2009).

Whittenberg

7

1. Title,

2. Creator,

3. Subject,

4. Description,

5. Publisher,

6. Contributor,

7. Date,

8. Type,

9. Format,

10. Identifier,

11. Source,

12. Language,

13. Relation,

14. Coverage and

15. Rights. 14

Only three of these fields are mandatory with the other being optional. All the metadata

information is in the item record and is fully searchable. In its article regarding DSpace,

D-Lib Magazine authors acknowledge that the metadata “is indexed for browsing and

searching the system within a collection, across collections or across Communities.” 15

Since only three fields must be present by design, it is also assumed that organizations

could requests further mandatory fields as needed.

Does DSpace provide and support some type of audit control? The requirements

set in Understanding Data and Information Systems for Recordkeeping regarding audit

control are fairly steep:

1. The system must maintain audit trails for all processes that create, update or modify, delete, access and use records.

2. At a minimum, the system must track the action that was implemented (what data or information was accessed, added, deleted or modified).

3. The system must automatically capture the audit trail.

4. The audit trail must be unalterable.

14 Dublin Core Metadata Initative Website, “DCMI Metadata Terms,” http://dublincore.org/documents/dcmi-terms/ (accessed April 2009).

15 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April

2009).

Whittenberg

8

5. The audit trail must be kept at least until the records it refers to are destroyed or deleted.

6. The audit trail must be logically linked to the records it documents.

7. The audit data is not available for inspection or export by any user except those authorized (administrators of the system for example).

8. Documentation must be created when change are made to the system or actions taken to the records. 16

DSpace fails on several of these items. In a design proposal for DSpace 2.0 from October

2004, the few examples of auditing are listed:

Another essential digital preservation process is basic auditing; i.e. periodically ensuring that the content in the archive is all present and correct, to ensure that storage systems are not failing, and content has not become corrupt. In DSpace 1.x, this is relatively simple in the case of bitstreams (sizes and checksums are stored for each), but the all-important data in the relational database is not easily auditable in this way. 17

While checksums can ensure that records are not altered through the process, no

information is given about any reporting or automatic documentation. This proposal

clearly stated that the records in the main relational database are not easily audited nor

was the proposal mentioned enhancing this in later versions of the software. After

exploring the DSpace Wiki and the main Dpace.org website, there was little to no

information about auditing except repeating what was found in the 2.0 proposal. In fact,

there were several examples of people requesting third party auditing packages to use

with DSpace on the community listserv (with no clear answers given to solve the

problem).

Does DSpace ensure that records are usable? Again, what are the requirements?

According to Understanding Data and Information Systems for Recordkeeping, records

must be “easily accessed and retrieved in a timely manner” with searching capabilities

16 Bantin, 41-42.

17 DSpace Wiki, “DSpace 2.0 Design Proposal,” http://wiki.dspace.org/static_files/1/16/Ds2arch.doc (accessed April 2009).

Whittenberg

9

including full text searches or metadata across files and categories (entire classification

scheme hierarchy). 18 Using MIT’s DSpace instance as an example, users can browse

records based on collections, issue date, title, authors or subjects. 19 Their search engine

seems to be limited to Boolean type searches like you would find in Google or Yahoo. A

quick test of the word “Washington” produced 5,571 hits with examples of where the

term is part of the title, part of the authors name or mentioned somewhere in the text.

Using the term “physics,” I also received departmental and theses links as well. Most of

the documents were available to review or print with the majority being in Adobe

Acrobat (PDF) format. If the website is up, one can only assume that access is available

so records have the potential of being available 24 hours a day seven days a week.

What about security? How does DSpace handle security and also control access?

Like many systems based on a web interface, the developers gave this a great deal of

attention. DSpace was created for the UNIX operating system, and the primary code was

written in Java. All additional components are open source as well and common to the

web environment (an example is that DSpace uses Apache as its web server engine which

is one of the most common in the industry). By not using specialized packages and

focusing on what is out there readily available and robust, this allows an organization to

have countless tools that could be used to protect the DSpace servers. Virus, anti-spam,

firewall and other software is available in many flavors for a UNIX system.

Again, using Understanding Data and Information Systems for Recordkeeping, a

primary focus of security is allowing only authorized employees or researchers the ability

18 Bantin, 42.

19 MIT Libraries DSpace Website, “Search DSpace,” http://dspace.mit.edu/search (accessed April 2009).

Whittenberg 10

to create, delete or update records. DSpace should be able to limit access in terms of

record manipulation and should also never present information that a user does not have

the necessary permission to receive. 20 DSpace responds to there requirements with its

unique workflow system. Departments within an organization have the capability of

setting restrictions based on how DSpace Communities are set:

In other words, different DSpace Communities, representing different schools, departments, research labs and centers, have very different ideas of how material should be submitted to DSpace, by whom, and with what restrictions. Who is allowed to deposit items? What type of items will they deposit? Who else needs to review, enhance, or approve the submission? To what collections can they deposit material? Who can see the items once deposited? All of these issues are addressed by the Community representatives, working together with the Libraries' DSpace user support staff, and are then modeled in a workflow for each collection to enforce their decisions. The system models "e-people" who have "roles" in the workflow of a particular Community in the context of a given collection. Individuals from the Community are registered with DSpace, then assigned to appropriate roles. 21

An example of this is Indiana University’s ScholarWorks repository. According to their

website, contributors are limited to IU departments and scholars (faculty, students and

other organizations on campus). To gain access, individuals or departments must submit

requests to create “communities” and has specific requirements:

To get started, departments should decide on:

Content they would like to distribute widely and preserve over the long-term,

A contact person to work with the IUScholarWorks Repository team to set up and run the Community,

The Community/Collection structure that is best for the department or units content,

Metadata (descriptive cataloging information) and

20 Bantin, 45.

21 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April

2009).

Whittenberg 11

Individuals who will be allowed to submit materials. 22

As a user that is not a member of any IU community, I was able to access ScholarWorks

and browse the collection based on community, collection, issue date, author, title or

subject. I was able to access a wealth of information, but it was only read access. I was

never given the chance to manipulate the records in any fashion. To do this, I would

have had to go through a formal process with the IU staff to get an IUScholarWorks

Repository Account.

Does DSpace provide a means to retain and dispose of records? In terms of

preservation, the DSpace creators focused on two main digital types called “bit

preservation” and “functional preservation”. 23 The first means the record is preserved

exactly like it was submitted (down to the actual bit count). Functional means that the

record will be changed to allow for the changed in software and technology to ensure its

accessibility. DSpace currently captures the necessary metadata to support bit

preservation (although each repository should also have a solid program of backups and

disaster recovery plans in place). Functional is limited to an organizations policy. The

DSpace creators cannot see the future and predict the countless software updates that may

occur. An example is co-creator MIT itself. They plan to provide functional support for

well-known documented standards such as TIFF or XML, but not for rare or complicated

formats such as CAD drawings. 24 Another example is IU ScholarWorks which clearly

states they are “not equipped to support the archiving and/or accessibility of dynamic

22 Indiana University ScholarWorks Repository Website, “Getting Started with the IUScholarWorks Repository,https://scholarworks.iu.edu/docs/repository/gettingstarted.shtml (accessed April 2009).

23 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April

2009).

24 Ibid.

Whittenberg 12

resources like open web sites, interactive applications, files with complex metadata

requirements, streaming audio or video, authoring tools, or dynamic learning objects.25

Part of the process of retaining and disposing of records is the ability to make

backups as well as the system creating reports based on records changing or being

deleted. The DSpace Wiki lists in detail the means to restore a system using a full

backup as well as what must be done if you are using a different platform or operating

system version. Since DSpace uses PostgreSQL (an open source data management

system), an SQL dump created from a backup can be uploaded into a new PostgreSQL

instance to get the system back online. 26 Since DSpace uses a number of servers for the

overall system, it is also advisable for the organization to have sufficient hardware to

possibly do mirroring (at least a matching server for each one in the system that is

updated as the main ones are updated). Unfortunately, not having access to any sort of

administrator account nor any relevant information found on the DSpace Wiki, there are

few items to list regarding reporting. Several of the wiki documents mention reports or

statistics so one can only assume it does exist. In any case, there are statistical packages

that can be used with an Apache web server to show when files are added, changed or

deleted. This is also true for PostgreSQL environments. If Dspace does not have it built

in, other products can be used to enhance the process.

Does DSpace meet the requirements for a true recordkeeping system? It has the

capability to capture records in a variety of digital formats. Recent updates have also

25 Indiana University ScholarWorks Repository Website, “IUScholarWorks Repository FAQ for Submitters,” https://scholarworks.iu.edu/docs/repository/faq.shtml (accessed April 2009).

26 DSpace Wiki, “Backup and Restore,” http://wiki.dspace.org/index.php/BackupRestore (accessed April

2009).

Whittenberg 13

added versioning. Its workflow system does provide a means for classification schemes

as well as ensure that only the right people have access to change records (by forming

communities and creating user accounts). DSpace is using an accepted metadata

standard, and the information is incorporated in the record (and is searchable). Security

is a strength as well as the use of accepted software standards like UNIX, Java and

Apache (which several strong security packages exist for). Reporting may or may not be

a problem, but additional packages again can be purchased to expand this capability. The

one glaring weakness seems to be audit control with the online documentation clearly

stating that records in the database are not audited easily. As mentioned before, there are

a number of organizations trying to find a third party software solution to resolve this

with no clear winners/suggestions being highlighted on the wiki.

How can DSpace be converted into a true recordkeeping system? What steps

must take place? What types of functionality must be added? For the institution willing

to take these steps, it would seem logical to investigate third party solutions (and

potentially open source solutions) for the problems with audit control, reporting and to a

lesser degree, security. For example, PostgreSQL has a report generator through its open

source graphical user interface pgaccess. 27 Free with support from several PostgreSQL

listservs, this could possibly be converted to produce some of the needed reports for an

organization. The options for security with UNIX servers and Apache web application

are so numerous that an organization should get their IT department involved to wallow

through the many possibilities.

27 PostgreSQL Website, “User Client Questions,” http://www.postgresql.org/files/documentation/books/aw_pgsql/node194.html (accessed May 2009).

Whittenberg 14

Another potential problem could be metadata. As mentioned before, metadata

could be altered to require more than the three mandatory fields of Dublin Core.

However, the Dublin Core has realatively few fields with most involved with creation.

Solid recordkeeping metadata should include field throughout the life of the record. In

Bantin’s book, there are nine primary categories:

1. Identification or Registration Metadata,

2. Content Metadata,

3. Contextual Metadata,

4. Audit Trail Metadata,

5. Access and Use Metadata,

6. Disposition Metadata,

7. Preservation History Metadata,

8. Structural Metadata and

9. History of Use Metadata. 28

The Dublin Core version used by DSpace covers roughly the first and second category

while leaving some significant gaps for the rest. Is it any wonder that most of the

standards listed in Understanding Data and Information Systems for Recordkeeping are

dramatically larger (such as the European model listed with 109 elements with 79 being

mandatory). 29

DSpace is a remarkable product and can truly be a viable solution for an

institution needing an online repository. It is not a perfect recordkeeping solution and

may require additional software to expand its functionality depending on the institutions

need. However, price seems to always be a concern for most universities or other

organizations that might need a digital repository. In this, DSpace knocks down most of

its competitors, and makes many an archive or library think about implementing it (as

28 Bantin, 48.

29 Ibid., 49.

Whittenberg 15

seen by the 334 organizations using it already). Open source products are sometimes a

concern to install due to the lack of technical support. DSpace also seems to have this

beat by the online communities that have been formed to help one another. Is it perfect?

It is not, but there is probably not a “perfect” system out there. DSpace should be

considered if your organization has this need.

Whittenberg 16

Bibliography

Bantin, Philip C. Understanding Data and Information Systems for Recordkeeping. New York: Neal-Schuman Publishers, 2008.

D-Lib Magazine (January 2003), “DSpace: An Open Source Dynamic Digital Repository,” http://www.dlib.org/dlib/january03/smith/01smith.html , Accessed April 2009.

DSpace Wiki, “Backup and Restore,” http://wiki.dspace.org/index.php/BackupRestore , Accessed April 2009.

DSpace Wiki, “DSpace 2.0/Comparing Exisitng Technologies,”

#Versioning_Content , Accessed April 2009.

DSPace Wiki, “DSpace Instances (as of 01/12/2009),” http://wiki.dspace.org/index.php/DSpaceInstances , Accessed April 2009.

DSpace Wiki, “What is DSpace,” http://wiki.dspace.org/index.php/EndUserFaq#What_is_DSpace.3F , Accessed April 2009.

Dublin Core Metadata Initative Website, “DCMI Metadata Terms,” http://dublincore.org/documents/dcmi-terms/ , Accessed April 2009.

Indiana University ScholarWorks Repository Website, “Getting Started with the IUScholarWorks Repository,

2009.

Indiana University ScholarWorks Repository Website, “IUScholarWorks Repository FAQ for Submitters,” https://scholarworks.iu.edu/docs/repository/faq.shtml , Accessed April 2009.

MIT Libraries DSpace Website, “Search DSpace,” http://dspace.mit.edu/search , Accessed April 2009.

Whittenberg 17

PostgreSQL Website, “User Client Questions,”

Accessed May 2009.