HIST 7994 DSpace Record Keeping System | Metadata | Records Management

DSpace: Evaluating It As A RecordKeeping System

Albert C. Whittenberg

HIST 7994 (S685): Electronic Records Management Dr. Philip C. Bantin 04 May 2009

Whittenberg With a growing interest in preserving business and industry records, countless organizations are looking for a solid recordkeeping system. Knowing this, many software companies continue to grind out new products year after year. For the budget minded organization such as a public university or archives, the high costs of these packages makes open source software solutions look attractive. One of these is DSpace, a system created by MIT and Hewlett-Packard to manage digital assets. Looking at numerous examples of DSpace being used in institutions (including MIT), this open source program will be evaluated by the requirements of a true recordkeeping system as set in archivist and professor Philip C. Bantin’s book, Understanding Data and Information Systems for Recordkeeping. Does DSpace meet all these requirements? How does it handle metadata? Can it be an effective repository for business or organizational records? What needs to be added? What needs to be changed? According to the official DSpace Wiki, there are 334 organizations currently using DSpace in 56 countries.1 The Wiki further states that DSpace “captures, stores,

2

indexes, preserves and redistributes an organization's research material in digital formats” and that “research institutions worldwide use DSpace for a variety of digital archiving needs.”2 Repeated time and time again, the site also continues to hammer in that the software is completely open source and free to everyone. For those institutions that are worried about technical support, the website also boasts a DSpace Community and a DSpace Federation with mailing lists, conference, user groups, workshops and a host of

1

DSPace Wiki, “DSpace Instances (as of 01/12/2009),” http://wiki.dspace.org/index.php/DSpaceInstances (accessed April 2009). 2 DSpace Wiki, “What is DSpace,” http://wiki.dspace.org/index.php/EndUserFaq#What_is_DSpace.3F (accessed April 2009).

Whittenberg other websites. DSpace has also been around for some time. According to the January 2003 online D-Lib Magazine, the history of the product is as follows:

3

In March 2000, Hewlett-Packard Company (HP) awarded $1.8 million to the MIT Libraries for an 18-month collaboration to build DSpace™, a dynamic repository for the intellectual output in digital formats of multi-disciplinary research organizations. HP Labs and MIT Libraries released the system worldwide on November 4, 2002, under the terms of the BSD open source license [1], one month after its introduction as a new service of the MIT Libraries. As an open source system, DSpace is now freely available to other institutions to run as-is, or to modify and extend as they require to meet local needs. From the outset, HP and MIT designed the system to be run by institutions other than MIT, and to support federation among its adopters, in both the technical and the social sense.3 The reason for the project was a “need to collect, preserve, index and distribute” research materials like those being generated by faculty at MIT.4 It was to be free, easy to install and easy to use but is it a good recordkeeping system? To answer this question, one must have both a definition of a recordkeeping system as well as a set of requirements. As mentioned above, the book Understanding Data and Information Systems for Recordkeeping provides these. One definition is “a special kind of information system that manages and preserves the records that provide evidence of business transactions or of personal activities.”5 Another from ISO Records Management Standard 15489 defines a system “as an information system which captures, manages and provides access to records through time” and three characteristics of records managed for that system are authenticity, reliability and integrity.6 Finally, the requirements listed for a recordkeeping system (as detailed in Bantin’s book) are as listed:
3

D-Lib Magazine (January 2003), “DSpace: An Open Source Dynamic Digital Repository,” http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009). 4 Ibid. 5 Philip C. Bantin, Understanding Data and Information Systems for Recordkeeping (New York: NealSchuman Publishers, 2008), 32. 6 Ibid.

Whittenberg 1. 2. 3. 4. 5. 6. 7. 8. Capture records, Support classification scheme(s), Capture record metadata, Support audit control, Ensure records are usable, Manage security and control, Schedule records for disposition, and Preserve records7

4

Each one of these requirements will be examined in terms of DSpace’s functionality. Does DSpace capture records? DSpace’s main purpose or goal is to serve as a “production digital repository service” for research organizations.8 This means that record capture would not generally be considered automatic but manually. Researchers and their assistants would be submitting their information in some sort of digital format to the system as needed or after completion/publication. According to the definition of the capture process in a recordkeeping system as listed in Bantin’s book, records can be captured either automatically or manually (so DSpace does qualify in that aspect). Other characteristics mentioned that the system must support capturing records from various types of software and/or applications, must be able to maintain all components captured as one record, must support versioning and finally ensure reliability.9 What types of records does DSpace support? Can it handle the countless word processing and web applications available today? Accord to the DSpace Wiki, the DSpace application can accommodate the following digital formats:

1. Documents, such as articles, preprints, working papers, technical reports, conference papers 2. Books 3. Theses
7 8

Bantin., 35-36. D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009). 9 Bantin, 38.

Whittenberg 4. Data sets 5. Computer programs 6. Visualizations, simulations, and other models 7. Multimedia publications 8. Administrative records 9. Published books 10. Overlay journals 11. Bibliographic datasets 12. Images 13. Audio files 14. Video files 15. eformatted digital library collections 16. Learning objects 17. Web pages10 Dspace also gives institutions the capability to accommodate different workflows. What this means is different departments, groups, schools or teams submit items and organize them in different ways. This answers the questions of how items are grouped together, who can submit or who can have access. DSpace then has the capability to maintain all

5

components as a single record or not depending upon the department’s preference. While this would seem to go against the definition of the record capture process in a true recordkeeping system, the administrators of a DSpace instance could force the software to group items.

Versioning was not available initially in DSpace. However, recent updates have corrected this. According to the Wiki again, “a Google Summer of Code project in 2007 has implemented a versioning prototype, for DSpace Items, DSpace Items have two identifiers, on permanent, the other is a version lineage id. The Lineage is comprised of items, each with unique metadata and bundles, bitstreams within the items will be either

10

DSpace Wiki, “End User FAQ,” http://wiki.dspace.org/index.php/EndUserFaq#What_kind_of_content_does_DSpace_support.3F (accessed April 2009).

Whittenberg

6

linked from the previous version or added anew.”11 This is a prototype and probably has some bugs to it. The article in the wiki did not list if any further updates had been made, but hopefully two years have made a difference.

Does DSpace support classification schemes? What is a classification scheme? Bantin’s definition listed it as a “diagram, table, or other representation categorizing the creator’s records, usually by hierarchical classes, and according to a coding system expressed in alphabetical, numerical, or alphanumerical symbols.”12 Again, this seems to be resolved by the robust workflow system built into DSpace. Records can be classified into a variety of ways or categories. One example given is in the already mentioned article in D-Lib Magazine where “a department may choose to have two collections: one for working papers and another for datasets. They may then decide that any member of the faculty can deposit items to either collection directly, and that any member of the general public can have access to these collections.”13 This is a very simple classification scheme, but more complex ones can be implemented. Records can be classified as well as the record creators (also called users frequently in the articles and wiki).

Does DSpace capture record metadata? It actually uses a well recognized standard in archives metadata: the qualified Dublin Core. This is composed of the 15 metadata elements of simple Dublin Core plus an additional three (Audience, Provenance and RightsHolder):

11

DSpace Wiki, “DSpace 2.0/Comparing Exisitng Technologies,” http://wiki.dspace.org/index.php/DSpace_2.0/Comparing_Existing_Technologies#Versioning_Content (accessed April 2009). 12 Bantin, 39. 13 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009).

Whittenberg 1. Title, 2. Creator, 3. Subject, 4. Description, 5. Publisher, 6. Contributor, 7. Date, 8. Type, 9. Format, 10. Identifier, 11. Source, 12. Language, 13. Relation, 14. Coverage and 15. Rights.14 Only three of these fields are mandatory with the other being optional. All the metadata information is in the item record and is fully searchable. In its article regarding DSpace, D-Lib Magazine authors acknowledge that the metadata “is indexed for browsing and searching the system within a collection, across collections or across Communities.”15 Since only three fields must be present by design, it is also assumed that organizations could requests further mandatory fields as needed.

7

Does DSpace provide and support some type of audit control? The requirements set in Understanding Data and Information Systems for Recordkeeping regarding audit control are fairly steep: 1. The system must maintain audit trails for all processes that create, update or modify, delete, access and use records. 2. At a minimum, the system must track the action that was implemented (what data or information was accessed, added, deleted or modified). 3. The system must automatically capture the audit trail. 4. The audit trail must be unalterable.

14

Dublin Core Metadata Initative Website, “DCMI Metadata Terms,” http://dublincore.org/documents/dcmi-terms/ (accessed April 2009). 15 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009).

Whittenberg 5. The audit trail must be kept at least until the records it refers to are destroyed or deleted. 6. The audit trail must be logically linked to the records it documents. 7. The audit data is not available for inspection or export by any user except those authorized (administrators of the system for example). 8. Documentation must be created when change are made to the system or actions taken to the records.16

8

DSpace fails on several of these items. In a design proposal for DSpace 2.0 from October 2004, the few examples of auditing are listed: Another essential digital preservation process is basic auditing; i.e. periodically ensuring that the content in the archive is all present and correct, to ensure that storage systems are not failing, and content has not become corrupt. In DSpace 1.x, this is relatively simple in the case of bitstreams (sizes and checksums are stored for each), but the all-important data in the relational database is not easily auditable in this way.17 While checksums can ensure that records are not altered through the process, no information is given about any reporting or automatic documentation. This proposal clearly stated that the records in the main relational database are not easily audited nor was the proposal mentioned enhancing this in later versions of the software. After exploring the DSpace Wiki and the main Dpace.org website, there was little to no information about auditing except repeating what was found in the 2.0 proposal. In fact, there were several examples of people requesting third party auditing packages to use with DSpace on the community listserv (with no clear answers given to solve the problem).

Does DSpace ensure that records are usable? Again, what are the requirements? According to Understanding Data and Information Systems for Recordkeeping, records must be “easily accessed and retrieved in a timely manner” with searching capabilities
16 17

Bantin, 41-42. DSpace Wiki, “DSpace 2.0 Design Proposal,” http://wiki.dspace.org/static_files/1/16/Ds2arch.doc (accessed April 2009).

Whittenberg including full text searches or metadata across files and categories (entire classification scheme hierarchy).18 Using MIT’s DSpace instance as an example, users can browse records based on collections, issue date, title, authors or subjects.19 Their search engine

9

seems to be limited to Boolean type searches like you would find in Google or Yahoo. A quick test of the word “Washington” produced 5,571 hits with examples of where the term is part of the title, part of the authors name or mentioned somewhere in the text. Using the term “physics,” I also received departmental and theses links as well. Most of the documents were available to review or print with the majority being in Adobe Acrobat (PDF) format. If the website is up, one can only assume that access is available so records have the potential of being available 24 hours a day seven days a week.

What about security? How does DSpace handle security and also control access? Like many systems based on a web interface, the developers gave this a great deal of attention. DSpace was created for the UNIX operating system, and the primary code was written in Java. All additional components are open source as well and common to the web environment (an example is that DSpace uses Apache as its web server engine which is one of the most common in the industry). By not using specialized packages and focusing on what is out there readily available and robust, this allows an organization to have countless tools that could be used to protect the DSpace servers. Virus, anti-spam, firewall and other software is available in many flavors for a UNIX system.

Again, using Understanding Data and Information Systems for Recordkeeping, a primary focus of security is allowing only authorized employees or researchers the ability

18 19

Bantin, 42. MIT Libraries DSpace Website, “Search DSpace,” http://dspace.mit.edu/search (accessed April 2009).

Whittenberg 10 to create, delete or update records. DSpace should be able to limit access in terms of record manipulation and should also never present information that a user does not have the necessary permission to receive.20 DSpace responds to there requirements with its unique workflow system. Departments within an organization have the capability of setting restrictions based on how DSpace Communities are set:

In other words, different DSpace Communities, representing different schools, departments, research labs and centers, have very different ideas of how material should be submitted to DSpace, by whom, and with what restrictions. Who is allowed to deposit items? What type of items will they deposit? Who else needs to review, enhance, or approve the submission? To what collections can they deposit material? Who can see the items once deposited? All of these issues are addressed by the Community representatives, working together with the Libraries' DSpace user support staff, and are then modeled in a workflow for each collection to enforce their decisions. The system models "e-people" who have "roles" in the workflow of a particular Community in the context of a given collection. Individuals from the Community are registered with DSpace, then assigned to appropriate roles.21 An example of this is Indiana University’s ScholarWorks repository. According to their website, contributors are limited to IU departments and scholars (faculty, students and other organizations on campus). To gain access, individuals or departments must submit requests to create “communities” and has specific requirements:

To get started, departments should decide on:
   

Content they would like to distribute widely and preserve over the long-term, A contact person to work with the IUScholarWorks Repository team to set up and run the Community, The Community/Collection structure that is best for the department or units content, Metadata (descriptive cataloging information) and

20 21

Bantin, 45. D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009).

Whittenberg 11

Individuals who will be allowed to submit materials.22

As a user that is not a member of any IU community, I was able to access ScholarWorks and browse the collection based on community, collection, issue date, author, title or subject. I was able to access a wealth of information, but it was only read access. I was never given the chance to manipulate the records in any fashion. To do this, I would have had to go through a formal process with the IU staff to get an IUScholarWorks Repository Account.

Does DSpace provide a means to retain and dispose of records? In terms of preservation, the DSpace creators focused on two main digital types called “bit preservation” and “functional preservation”.23 The first means the record is preserved exactly like it was submitted (down to the actual bit count). Functional means that the record will be changed to allow for the changed in software and technology to ensure its accessibility. DSpace currently captures the necessary metadata to support bit preservation (although each repository should also have a solid program of backups and disaster recovery plans in place). Functional is limited to an organizations policy. The DSpace creators cannot see the future and predict the countless software updates that may occur. An example is co-creator MIT itself. They plan to provide functional support for well-known documented standards such as TIFF or XML, but not for rare or complicated formats such as CAD drawings.24 Another example is IU ScholarWorks which clearly states they are “not equipped to support the archiving and/or accessibility of dynamic

22

Indiana University ScholarWorks Repository Website, “Getting Started with the IUScholarWorks Repository,” https://scholarworks.iu.edu/docs/repository/gettingstarted.shtml (accessed April 2009). 23 D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009). 24 Ibid.

Whittenberg 12 resources like open web sites, interactive applications, files with complex metadata requirements, streaming audio or video, authoring tools, or dynamic learning objects.”25

Part of the process of retaining and disposing of records is the ability to make backups as well as the system creating reports based on records changing or being deleted. The DSpace Wiki lists in detail the means to restore a system using a full backup as well as what must be done if you are using a different platform or operating system version. Since DSpace uses PostgreSQL (an open source data management system), an SQL dump created from a backup can be uploaded into a new PostgreSQL instance to get the system back online.26 Since DSpace uses a number of servers for the overall system, it is also advisable for the organization to have sufficient hardware to possibly do mirroring (at least a matching server for each one in the system that is updated as the main ones are updated). Unfortunately, not having access to any sort of administrator account nor any relevant information found on the DSpace Wiki, there are few items to list regarding reporting. Several of the wiki documents mention reports or statistics so one can only assume it does exist. In any case, there are statistical packages that can be used with an Apache web server to show when files are added, changed or deleted. This is also true for PostgreSQL environments. If Dspace does not have it built in, other products can be used to enhance the process.

Does DSpace meet the requirements for a true recordkeeping system? It has the capability to capture records in a variety of digital formats. Recent updates have also

25

Indiana University ScholarWorks Repository Website, “IUScholarWorks Repository FAQ for Submitters,” https://scholarworks.iu.edu/docs/repository/faq.shtml (accessed April 2009). 26 DSpace Wiki, “Backup and Restore,” http://wiki.dspace.org/index.php/BackupRestore (accessed April 2009).

Whittenberg 13 added versioning. Its workflow system does provide a means for classification schemes as well as ensure that only the right people have access to change records (by forming communities and creating user accounts). DSpace is using an accepted metadata standard, and the information is incorporated in the record (and is searchable). Security is a strength as well as the use of accepted software standards like UNIX, Java and Apache (which several strong security packages exist for). Reporting may or may not be a problem, but additional packages again can be purchased to expand this capability. The one glaring weakness seems to be audit control with the online documentation clearly stating that records in the database are not audited easily. As mentioned before, there are a number of organizations trying to find a third party software solution to resolve this with no clear winners/suggestions being highlighted on the wiki.

How can DSpace be converted into a true recordkeeping system? What steps must take place? What types of functionality must be added? For the institution willing to take these steps, it would seem logical to investigate third party solutions (and potentially open source solutions) for the problems with audit control, reporting and to a lesser degree, security. For example, PostgreSQL has a report generator through its open source graphical user interface pgaccess.27 Free with support from several PostgreSQL listservs, this could possibly be converted to produce some of the needed reports for an organization. The options for security with UNIX servers and Apache web application are so numerous that an organization should get their IT department involved to wallow through the many possibilities.

27

PostgreSQL Website, “User Client Questions,” http://www.postgresql.org/files/documentation/books/aw_pgsql/node194.html (accessed May 2009).

Whittenberg 14 Another potential problem could be metadata. As mentioned before, metadata could be altered to require more than the three mandatory fields of Dublin Core. However, the Dublin Core has realatively few fields with most involved with creation. Solid recordkeeping metadata should include field throughout the life of the record. In Bantin’s book, there are nine primary categories:

1. 2. 3. 4. 5. 6. 7. 8. 9.

Identification or Registration Metadata, Content Metadata, Contextual Metadata, Audit Trail Metadata, Access and Use Metadata, Disposition Metadata, Preservation History Metadata, Structural Metadata and History of Use Metadata.28

The Dublin Core version used by DSpace covers roughly the first and second category while leaving some significant gaps for the rest. Is it any wonder that most of the standards listed in Understanding Data and Information Systems for Recordkeeping are dramatically larger (such as the European model listed with 109 elements with 79 being mandatory).29

DSpace is a remarkable product and can truly be a viable solution for an institution needing an online repository. It is not a perfect recordkeeping solution and may require additional software to expand its functionality depending on the institutions need. However, price seems to always be a concern for most universities or other organizations that might need a digital repository. In this, DSpace knocks down most of its competitors, and makes many an archive or library think about implementing it (as

28 29

Bantin, 48. Ibid., 49.

Whittenberg 15 seen by the 334 organizations using it already). Open source products are sometimes a concern to install due to the lack of technical support. DSpace also seems to have this beat by the online communities that have been formed to help one another. Is it perfect? It is not, but there is probably not a “perfect” system out there. DSpace should be considered if your organization has this need.

Whittenberg 16 Bibliography

Bantin, Philip C. Understanding Data and Information Systems for Recordkeeping. New York: Neal-Schuman Publishers, 2008. D-Lib Magazine (January 2003), “DSpace: An Open Source Dynamic Digital Repository,” http://www.dlib.org/dlib/january03/smith/01smith.html , Accessed April 2009. DSpace Wiki, “Backup and Restore,” http://wiki.dspace.org/index.php/BackupRestore , Accessed April 2009. DSpace Wiki, “DSpace 2.0/Comparing Exisitng Technologies,” http://wiki.dspace.org/index.php/DSpace_2.0/Comparing_Existing_Technologies #Versioning_Content , Accessed April 2009. DSPace Wiki, “DSpace Instances (as of 01/12/2009),” http://wiki.dspace.org/index.php/DSpaceInstances , Accessed April 2009. DSpace Wiki, “End User FAQ,” http://wiki.dspace.org/index.php/EndUserFaq#What_kind_of_content_does_ DSpace_support.3F , Accessed April 2009. DSpace Wiki, “What is DSpace,” http://wiki.dspace.org/index.php/EndUserFaq#What_is_DSpace.3F , Accessed April 2009. Dublin Core Metadata Initative Website, “DCMI Metadata Terms,” http://dublincore.org/documents/dcmi-terms/ , Accessed April 2009. Indiana University ScholarWorks Repository Website, “Getting Started with the IUScholarWorks Repository,” https://scholarworks.iu.edu/docs/repository/gettingstarted.shtml , Accessed April 2009. Indiana University ScholarWorks Repository Website, “IUScholarWorks Repository FAQ for Submitters,” https://scholarworks.iu.edu/docs/repository/faq.shtml , Accessed April 2009. MIT Libraries DSpace Website, “Search DSpace,” http://dspace.mit.edu/search , Accessed April 2009.

Whittenberg 17 PostgreSQL Website, “User Client Questions,” http://www.postgresql.org/files/documentation/books/aw_pgsql/node194.html, Accessed May 2009.

Sign up to vote on this title
UsefulNot useful