You are on page 1of 8

2.

0 Best Practices for File Naming

A filename provides one form of unique identification for each digital asset that the Library
creates. A good file naming system ensures consistency, prevents file loss through accidental
overwriting, and can facilitate retrieval and processing of materials from creation onwards. File
naming conventions and practices should be determined for each digital project, or content set,
at the beginning of the project when other technical specifications (e.g., file format, resolution,
etc.) are being established. A file naming system for a specific project or content set should
employ a directory structure to help guard against filename collisions across projects.

Filenames, however, are only one form of digital asset identifier. Subsequent to instantiation,
additional identifiers are associated with digital resources. URLs, handles, PURLs, DOIs, and
CONTENTdm identifiers are examples of such additional identifiers. While file name
conventions insure unique identification of a digital file in the scope of a particular local project
or content set, other identifiers are needed to insure unique identification in larger scopes,
such as on the World Wide Web or within a general archive. Non-filename identifiers also are
used to deal with issues of granularity (e.g., assigning an identifier for the entire digitized book
rather than just an individual digitized page in a book) and versioning (e.g., a corrected PDF of a
digitized book), and can be useful in expressing persistent relationships (e.g., the relationship
between a metadata record and a digitized book object, independent of updates made to the
digitized book's component files). These non-file name identifiers are addressed in a separate
document.

Table of contents

2.1 ISO Standard 9660:1999 (Level 2)


2.2 Root identifiers
• Registry of root identifiers
• Guidelines for constructing root identifiers
o Monographs
o Serials
o Newspapers
o Collections (non-ContentDM)
2.3 Subsequent directory levels
• Serials
• Monographs
• Collections (non-ContentDM)
2.4 Non-image files
2.5 ContentDM collections
• Root identifiers
• File name structure
• Collection registry
Appendix: Page Naming Conventions for Monographs and Serials
2.1 ISO Standard 9660:1999 (Level 2)

The Library follows ISO Standard 9660:1999 (Level 2) format, which defines a file system for
digital media. This standard stipulates certain restrictions on file names:

• Limit total path length to 207 characters


• Characters used in file names are restricted to lowercase a-z, 0-9, underscore ( _ ), and
period (.)
• File names shall not include spaces; should not begin or end with a period (.); and should
contain no more than one period (.).
• Limit directory hierarchy to eight levels. Directory names should not use periods.

Names for directories, folders, and files will be no longer than 21 characters (not including 3
letter extensions) and will be unique within the context of the project.

2.2 Root identifiers (top level of directory structure)

Each content set, be it a full-text book, a collection of related documents, a group of


photographs, etc., should be assigned a root identifier that is unique to that particular set of
content; the top level file directory for the content set should be named with this unique
identifier. Root identifiers should be no longer than 16 characters and serve as the basis for
naming the image files created from it. Uniqueness of root identifiers should be verified by
checking the root identifier against a Library-wide Registry of Root Identifiers.

• The Registry of Root Identifiers should be a centrally managed resource containing the
following information about the identifier:
a. Date the root identifier was assigned
b. Name of person assigning the identifier
c. Where the content resides
d. Finding aid that describes the content
• Guidelines for constructing root identifiers:
a. Monographs— For monographs, the first four characters will be the first four
letters of the author’s last name; the next two characters will be the first two
letters of the author’s first name; the next four characters will be a zero padded
incrementing number; the next three characters will be the first three characters
of the first word of the title (articles omitted); and the last three characters will
be the first three letters of the second word of the title. For example, the root
identifier for the book “Collected Works of Abraham Lincoln” would be
“lincab0001colwor.”
b. Serials—For serials, the unique root identifier will consist of the first 16 letters of
the journal name (excluding initial articles). If the journal name is less than 16
letters, the root identifier will be also. An example of a root identifier for a
journal is “librarytrends”.
c. Newspapers—File naming conventions for newspapers are specific to this format
and to the presentation software used, e.g., Olive Active Paper. These
conventions are outlined in the best practices document on newspaper
digitization.
d. Collections of letters, photographs, maps, and other non-book or serial content
may already have some kind of naming and numbering scheme associated with
them that could be used as the basis for creating a unique identifier for the
content set. An effort should be made to parallel the file naming convention
described above for monographs. For example, the RBML has collections of
letters from Carl Sandburg to Lillian Sandburg and Vachel Lindsay. The root
identifier for the Carl and Lillian collection might be something like
“sandca0001sanlil” and the Carl and Vachel collection might be something like
“sandca0002linvac”.
e. Stop words—certain words may occur with such frequency that they should be
avoided when constructing root identifiers. Examples might be “association,”
“journal,” etc.

2.3 Subsequent directory levels

Under the root identifier level, subsequent directory levels should follow consistent patterns as
described below:
• Serials— The logical order for directories for image files from serial publications will be
rootidentifier/volume/issue/page as illustrated in the example below. Volume and
issue numbers should be preceded by the letters “v” and “I” and four padded zeros.
Page image file names will be divided into two logical components. The first six
characters will contain a leading zero padded sequentially incremented image sequence
number. The final six characters will contain a representation of the page number as
printed on the page. See Appendix for more specifics on dealing with unnumbered
pages, prefatory matter numbered with Roman numbers, and other deviations.

Root Identifier Volume Issue# Page #s


librarytrends v00001i00002 00000100000a.jp2
000002000001.jp2

• Monographs—The logical order of directories for image files for monographs will be
rootidentifier/volume/placeholder_for_issue/page. The two directory levels under the
root identifier will usually serve only as dummy directory levels so that the basic
directory structure for monographs and serials are the same. However, multi-volume
monographic sets will use the volume level directory (e.g., v00002). Issue numbers for
monographs will always be “i00000”.
Root Identifier Volume Issue# Page #s
lincab0001colwor v00000 i00000 00000100000a.jp2
000002000001.jp2

• Collections—generally the directory levels below the root identifier directory level
should make sense in the context of the project. For instance, there may just be one
additional level containing the individual content files; or a collection of letters might be
best described by using the volume level for the year the letter(s) was written and the
issue level for the month. Subsequent directory levels should employ the padded zero
convention described above. This will insure that the documents will sort as expected
utilizing the natural sort order of all ASCII-based computer systems. Make sure you start
with enough zeros to accommodate the maximum number of items in the collection.

2.4 File naming conventions for non-image files

In addition to image files, other files are often created in a digitization project. Among these
are OCR, PDF, xml, and encoding files. The following conventions should be followed for these
files:

• The file naming convention for the OCR file will be rootidentifier_ocr.txt.
• The file naming convention for PDF files will be rootidentifier.pdf.
• The file naming convention for xml files will be rootidentifier.xml.
• The file naming convention for TEI encoded file will be rootidentifier_tei.xml.

2.5 ContentDM collections

For image collections going into ContentDM, file names should be created for each digitized
image, both access and master. The image file name consists of a three letter root identifier,
seven digit number and letter (when the object is a compound object such as post card or
pamphlet) combinations. The file name should be included in the metadata of the item with a
proper file name extension. (Please see the minimum requirement of the metadata element
for CONTENTdm collections.)

• Root identifiers

The root identifier works as a collection identifier and combines all the associated items
into a collection where it belongs. Since most of the collections reside in CONTENTdm,
we recommend using a collection alias as a root identifier. When the collection is added
into CONTENTdm, we create a unique alias for each collection.
The alias can be more than three letters. For a root identifier, please use the first three letters of
the collection alias. (For this collection, the root identifier should be ‘emb.’)

• File name structure

A seven digit number and letter combination will follow the root identifier. If you have a
compound object (i.e., post card or pamphlet), each item will share the same number
but each image will have a different letter. This structure can be seen in the following
examples:

1. simple object: abc1000000

2. Postcard:
o Front – abc200000a
o Back – abc200000b

3. Pamphlet 1 :
o Cover – abc300000a
o Page 1 – abc300000b
o Page 2 – abc300000c

4. Pamphlet 2:
o Cover – abc400000a
o Page 1 – abc400000b
o Page 2 – abc400000c

• Collection registry

In order to make each root identifier unique, creation of the formal registry of the root identifier
and collection is needed. The registry should include the following information for
administrative purpose. These elements are derived from Dublin Core Collection Description
Application Profile (http://dublincore.org/groups/collections/collection-application-
profile/2006-08-24/) and Illinois Harvest Collection Description Application Profile
(\\libgrtyr\harvests\IllinoisHarvest\projectManagement\Illinois Harvest\Collection-Level
Metadata).

The location and management of the registry should be discussed further.


Element Label Definition

dc:identifier Root identifier The unique root identifier of the collection

dc:title Collection title Title of the collection

dc:creator Collection Collection coordinator


coordinator

vcard:UID Email Contact information of the collection


coordinator

dc:description Collection Collection description that could include


description collection development policy, uniqueness,
and other relevant information about the
collection.

dc:source Physical collection Location of the physical collection

dc:date Date Date information of when the collection was


created.

dct:extent Size Size of the collection, usually a number of


items in the collection.

dc:right Right Right statement of the collection.

Completeness Indicate whether the collection is complete


dcterms:accrualMethod or not.

dc:contributor Contributor People involved in the collection creation.


Add the CONTENTdm ID.
APPENDIX: Page Naming Conventions for Monographs and Serials

Page image file names will be divided into two logical components. The first six characters will
contain a leading zero padded sequentially incremented image sequence number. The final six
characters will contain a representation of the page number as printed on the page, formulated
according to the following rules:

a) Every image should have designated a logical page number or appropriate tag to
accompany it. This page number or tag will be used in the image file header and, if
necessary to accommodate ISO 9660 file name restrictions, in the image file name.

b) For the purposes of these instructions, the word “pagination” will refer to the logical
sequential pagination of a series of pages. For example, a page without a printed number
on it which is located between pages imprinted 2 and 4 can be assumed to be page 3.
Similarly, four pages without page numbers printed on them followed by pages 5, 6, 7,
etc., can be assumed to be pages 1, 2, 3 and 4.

c) All pages that are included within the logical pagination should be designated with their
actual page numbers.

d) The first page of every volume will be the production note; the second page will be the
outside front cover; and the third page will be the inside front cover (which may or may
not contain a bookplate). These will always be designated 00000a, 00000b, and 00000c
respectively. If there are more pages before the logical pagination begins, they will be
designated 00000d, 00000e, 00000f, etc.

e) Pagination that appears as Roman numerals in the original will be translated into Arabic
numbers and appended with a leading “R” for file names (e.g., page vii becomes page
r00007, etc.). In the absence of printed page numbers, it is to be assumed that Roman
numerals continue until logical Arabic pagination commences. In the situation where
sequential pagination continues through a change from Roman numerals to Arabic
numerals, the Arabic numerals will be assumed to start at the change in type of document
content.

f) When there are pages in the material which are not included in the sequential pagination
(commonly occurring with plates) the pages will be designated by the number of the
preceding paginated page appended with a trailing letter which will increase sequentially
for each page (e.g., 000031, 000032, 00032a, 00032b, 000033, 000034).

g) When pages are numbered incorrectly in the original material, the correct logical
pagination should be used in the image file header and file name, unless otherwise
specified in the pagination instructions which will accompany each volume.
h) When pagination restarts result in duplicate page numbers in the same volume, the
longest section will have its pages recorded unamended. Shorter segments will be
recorded with a letter preceding the number to differentiate it from similarly numbered
pages in the same volume (e.g., a00001, a00002, a00003, a00004; b00001, b00002,
b00003, b00004). Note: Letters that will not be used in this situation are i, l, o, and r. This
procedure does not need to be used for a Roman Numeral section if it is the only one in
the volume. If there is more than one, the shorter section(s) will be differentiated in the
same manner as Arabic Numbers (e.g., ra0001, ra0002, ra0003, ra0004; rb0001, rb0002,
rb0003, rb0004.

i) Page numbers which actually contain letter prefixes will be recorded according to the
same rules as standard Arabic numbered pages, except that punctuation between the
prefix and the number should be dropped. Thus a page from Appendix A which is labeled
A-9 should be recorded as 0000a9.

j) Page numbers containing characters that are not permitted in ISO 9660 file names should
be recorded with an underscore character in place of the illegal character. For example,
page 22.6 should be recorded as 22_6. In situations such as this, the unmodified page
number should be recorded in the image file header.

k) Adornments around page numbers, such as if the page number is both preceded and
followed by a dash, asterisk, parenthesis, square bracket, etc., should be ignored (not
entered).

l) Any pagination situation that falls outside those described above will be noted in the
pagination instructions that will accompany the volume. These instructions will include
how the pages in question should be designated in the image file header and file name. If
the vendor discovers a situation that is not described above and not commented on in the
worksheet, they will contact UIUC for instructions about how to proceed.