You are on page 1of 57

DIGITISATION PROCESS

Introduction
 A digital library may contain materials that are born digital, such as
e-journals and ebooks, or may contain materials that were originally
produced in another form but subsequently digitized

The process of digitising materials involves different steps depending


upon material, technology and requirement
What is Digitization?

It is a process of translating a piece of information such


as book, journal articles, sound recordings, pictures,
audio tapes or video recordings, etc. into bits. Bits are
the fundamental units of information in a computer
system.
ANALOG VS. DIGITAL
• Analog is continuous, digital is discrete in
nature
• Analog technology is around 125 years old
• Digital technology has been there for about 40
– 50 years
• Digital is being preferred over analog due to its
efficiency and reliability, examples: digital audio
and video CDs
DIGITAL VS. ANALOG
INFORMATION
• Analog: audio and video tapes
• Digital : CD–ROM, Digital video
• Digital information: multimedia in nature
• Text, graphics, images, sound, video, and computer
animation
DIGITISATION OF PRINT BASED
DOCUMENTS
• the first step is to capture the documents available in print or
analogue form for conversion into digital
• Form Converted from analog to digital in two ways:
• Using a scanner
• Using a camera
Digitization
• The process of digitisation involves capturing the physical or analogue
object through devices like scanners, digital camera, recorder etc.,
converting them into numerical values in bits and bytes which enables
them to be read electronically.
• Digitisation of text is possible either through
• text transcription or (Text transcription can be through keying in the text using a keyboard or by
voice recognition software)
• using optical character recognition method
Capturing Print Based Document
• For converting hard copies into machine readable form there are
three options available for a library:
• Keying in the text
Fresh keying in costs
• Scanning and capturing them as image files ten times more than
• OCR the files scanning and saving as
image files. However,
if you are converting
• Scanners come in three broad price ranges: them into OCR, then
i) low cost flatbed scanners or hand held some costs will be
involved in error
devices, correction and editing
ii) low end sheet feeder type,
iii) high end professional or book scanner
DIGITISATION: PROCESS
• Scanning
• Storing
• Indexing
• Retrieving
Scanning: Steps in Flatbed Scanner
• Step 1 Place picture on the scanner’s glass
• Step 2 Start scanner software
• Step 3 Select the area to be scanned
• Step 4 Choose the image type
• Step 5 Sharpen the image
• Step 6 Set the image size image type

• Step 7 Save the scanned image using a desirable


format
The steps for scanning a document
• Step 1: Place the document on the scanner bed
• Konica Minolta PS 7000 book scanner
• Step 2: Open the Adobe Acrobat
• Click on File>>Import>>Scan...

Fill in the information for device, format


and destination in the dialogue box that
appears
DIGITISATION: PROCESS

To scan the documents click


on the Scan All option. From
the Minolta PS7000 Scanner
Setup Dialog Box that Click on Done option from the Minolta
appears. PS7000 Scanner Setup Dialog Box
which shows the file like this:
Cont..
• Save the file as PDF version giving .pdf extension
• To change the resolution, Click on Scan Setting >> Resolution (DPI)
from the Minolta PS7000 Scanner Setup Dialog Box.
• To change the Scan Area click on, Scan Setting >> Scan Area.
• can also change the Brightness and Contrast of the scanned file
• change the Image Type then click on Scan Setting
• Scanned pages can be saved as individual files or as a complete
document by appending them to the current document while scanning.
Storing
• Hard Disks (Internal and External)
• Snap Server (a network attached storage
computer appliance)
• CDs
• DVDs
Indexing

Author Title Key Words Image22

Key to
Image

/image/new/smith.pdf Image22

Image
Location
Retrieving
• Simple
• Advanced
• Keyword and Phrase Search
• 19.4.2 Keyword and Subject Search
• 19.4.3 Boolean Search
• 19.4.4 Truncation Search
• 19.4.5 Proximity Search
• 19.4.8 Range Search
Digitisation: Input and Output Options

• Scanned as Image Only


• OCR and Retaining Page Layout
• Retaining Page Layout using Acrobat Capture
• Re-keying the Data
Scanned as Image
OCR and Retaining Page Layout
• OCR – a tool/software used for converting scanned
text pages into computer readable text files

• Searchable, changeable, etc.

• The OCR software has options for either storing the


text and graphics in their original layout or converting
them into ASCII or word processing format.
• Omnipage Pro and ABBYY Fine Reader are two
commonly used OCR software.
After OCR, you can export the resulting text to a variety of word-processing, page layout, and
spreadsheet applications. It also provides the option to save it directly as a PDF file.
Functioning of OCR
Retaining Page Layout using Acrobat
Capture

• Image Only
• Image + Text
• PDF Normal
Technology of Digitisation
• Bit Depth or Dynamic Range
• Resolution
• Threshold
• Image Enhancement
Bit Depth
• A bit is the smallest unit of data in digital imaging.
• Each pixel in a digital image is represented by a number of bits.
• More bits translate into more tones, grayscale and color, represented
per pixel in a digital image.
• The number of pixels represents the two-dimensional height and
width of an image.
• The number of bits represents a third dimension describing how light,
dark or colorful each pixel is.
• This dimensional aspect results in the term Bit Depth.
Setting Bit Depth
Terminologies
• Digital images are produced in bitonal, grayscale or color formats.
• The difference between the formats is determined by the number and
the type of information each bit records per pixel.
• Every bit represents two options; 1 or 0, on or off.
• A bitonal image
• is represented by pixels composed of 1 bit, each in the 1 or 0
• usually described as a foreground color and a background color (normally
black and white).
Terminologies
• A grayscale image
• is represented by multiple bits of tonal information, usually between 2 to 8 (or
more) bits per pixel.
• Most of the digital world works with 8 bit images. An 8 bit grayscale image has
256 tonal options (2 to the 8th power)

• 0 (black) to 255 (white)


• Color images
• are generally composed of bit depths ranging from 8 to 24 bits per pixel or higher.
• used digital color standard, RGB (red, green, and blue), applies three 8 bit or
three16 bit grayscale channels.
When photographers refer to an 8 bit color
image, they usually mean a 24 bit image
because of RGB's three separate 8 bit
Bits Used for Representing Shades in
Colour and Gray-scale Scanning

No. of No. of No. of No. of


Bits bits/shades Shades Shades/Pixel
2 2 22=4 43 = 64
4 3 23=8 83 = 512
8 4 24=16 163 = 4096
16 5 25=32 323 = 32768
32 6 26=64 643 = 262144
64 7 27=128 1283 = 2097152
128 8 28=256 2563 = 16777216
Resolution
• Number of pixels (picture elements) in a given area
• Measured in terms of dots per inch (dpi)
• The higher the dpi set on the scanner, the better the
resolution and quality of image and larger the image
file
• Text images = 300 dpi; preservation projects = 600
dpi;
Setting-up Resolution Manually
Threshold (0-255)
• the simplest method of image segmentation. From a
grayscale image, thresholding can be used to create
binary images
• individual pixels in an image are marked as “object”
pixels if their value is greater than some threshold
value (assuming an object to be brighter than the
background) and as “background” pixels otherwise.
Cont…

• In Binary Image, pixels are either pure black or pure white. There are
no gray values in between. There are only two values (0 or 1) possible
for a pixel, this is why such images are called as Binary Images.

When Gray scale image is converted into Binary Image, it uses a


threshold. Suppose gray scale values are from 0 (Pure Black) to
255(Pure White) , values greater than threshold will be
converted into 1 (White) and below to threshold will be
converted into 0 (Black).
Threshold Setting in Bitonal
Scanning

128

85
Image Enhancement
• used to improve scanned images at a cost of image
authenticity
• filters, tonal reproduction, curves and colour
management, touch, crop, image sharpening,
contrast, transparent background, etc.
Sharpening Image
COMPRESSION

• Image compression is the process of reducing size of an image by


abbreviating the repetitive information such as one or more rows of
white bits to a single code
• economic storage
• processing and transmission over a network
• a page of text scanned at 300 dpi = 1 mb in size whereas a page of
text file = 2 – 3 kb
Compression: Types
• Lossless – No information is “lost” or “sacrificed”
in the process of compression
• Lossy – discards or minimises details that are least
significant or which may not make appreciable
effect on the quality of image.
Compression Protocols
• TIFF G-4 (Tagged Image File Format) – Standard
compression scheme for black and white or bitonal
images
• An image created as a TIFF and compressed using
ITU-G4 compression technique is called a Group-4
TIFF or TIFFG4
• Joint Bi-level Image Group (JBIG), LZW (Lenpel-Ziv
Welch) are the other protocols
Compression Protocols
• JPEG (Joint Photographic Expert Group) –
compression protocol that works by finding areas of
the image that have same tone, shade, colour or
other characteristics and represents this area by a
code
• Compression is achieved at loss of data
OCR Technology: Types
• Matrix / Template Matching – Compares each
character with a template of the same character.
Such a system is usually limited to a specific number
of fonts, or must be “taught” to recognise a
particular font.
• Feature Extraction – Can recognise a character from
its structure and shape (angles, points, breaks, etc.)
based on a set of rules. The process claims to
recognise all fonts
OCR Technology: Types

• Structural analysis: Determines characters on the basis of density


gradations or character darkness
• Neural Networking: Neural networking is a form of artificial
intelligence that attempts to mimic processes of the human mind
used to recognize hand-written text as well as other traditionally
difficult source material
OCR Technology: Types

• Structural analysis: Determines characters on the basis of density


gradations or character darkness
• Neural Networking: Neural networking is a form of artificial
intelligence that attempts to mimic processes of the human mind
used to recognize hand-written text as well as other traditionally
difficult source material
Standards
• Resource Description Formats
• BIBFRAME (Bibliographic Framework Initiative)
Linked data model, vocabulary, and tools for expressing bibliographic data
• EAD (Encoded Archival Description)
XML markup designed for encoding archival finding aids
• Extended Date/Time Format (EDTF)
Comprehensive date/time definition for the bibliographic community
• MADS (Metadata Authority Description Standard)
XML markup for authority data from MARC 21 records and original authority data
• MARC 21 formats
Representation and communication of descriptive metadata about library items
• MARCXML
XML representation of MARC 21 data
• MODS (Metadata Object Description Standard)
XML markup for metadata from existing MARC 21 records and original resource description
• VRA Core
XML schema and data format description of visual culture and images that document them
Digital Library Standards

• ALTO
Technical metadata for Optical Character Recognition (OCR)
• AudioMD and VideoMD
XML schemas for technical metadata on audio- and video-based digital objects
• METS (Metadata Encoding & Transmission Standard)
Structure for encoding descriptive, administrative, and structural metadata
• MIX (NISO Metadata for Images in XML)
XML schema for encoding technical data elements required to manage digital image collections
• PREMIS (Preservation Metadata)
Data dictionary and supporting XML schemas for core preservation metadata needed to
support the long-term preservation of digital materials.
• TextMD (Technical Metadata for Text)
XML schema that details technical metadata for text-based digital objects
Information Resource Retrieval Protocols

• CQL (Contextual Query Language)


Formal, user-friendly query language for use between information
retrieval systems
• SRU/SRW (Search and Retrieve URL/Web Service)
Web services for search and retrieval based on Z39.50 semantics
• Z39.50
Supports information retrieval among different information systems
Information Resource Retrieval Standards

• ISO 639-2
Codes for representing names of languages (Part 2: Alpha-3 code)
• ISO 639-5
Codes for representing names of languages (Part 5: Alpha-3 code for
language families and groups)
• ISO/DIS 25577Information and documentation (MarcXchange)
• ISO 20775Schema for holdings information
Image File Types
• JPEG (And JPG) — Joint Photographic Experts Group
• PNG — Portable Network Graphics
• GIF — Graphics Interchange Format
• WebP
• TIFF-Tagged Image File Format
• BMP — Bitmap
• HEIF — High Efficiency Image File Format
• SVG — Scalable Vector Graphics
• EPS — Encapsulated Postscript
• PDF — Portable Document Format
• PSD — Photoshop Document
• AI — Adobe Illustrator Artwork
• XCF — eXperimental Computing Facility
• INDD — Adobe InDesign Document

https://kinsta.com/blog/image-file-types/
Selection policies

• A careful selection of materials should be followed before undertaking a


digitization project by the institute. The copyright status of the original
materials must be clear. The selection of the documents may be based on
• material in demand
• interest of users
• the review and selection are by the subject experts
• good quality
• high resolution in case of photographs and videos.
Selection for
Digitization: Factors
to Consider
Obvious Copyright
& Purpos
Subtle Permissio
n
e

Cos Audienc
t e

Physical Discover Intrinsi


Conditio y& c
n Access Value
Copyright &
Permission
Are the materials in the public domain?
If items are copyrighted, do you have permission to digitize?
What risk are you willing to accept if you digitize materials
for which you do not have (cannot obtain) permission?
Cos
t
Do you have sufficient resources in both money and personnel
to devote to digitization?
Do you have resources set aside for ongoing long-term storage
of digital objects (digital preservation)?
Do materials have special characteristics that require
special processing during digitization that would add to
the cost?
Physical
Condition
Will digitization damage the item?
What level of damage is acceptable?
Will a digitized item substitute for continued physical handling
of the original item, thereby preventing further deterioration?
Discovery &
Access
How will the digital objects be discovered?
Will you be able to digitize objects with sufficient
quality?
Will the digital objects have added value, e.g.,
keyword searchability for textual materials?
Purpos
e
Will digital objects fulfill a specific, articulated purpose
related to teaching, research or institutional mission?
Do you have a collection policy that will inform
digitization selection decisions?
Audienc
e
Will digital objects reach new audiences? Will digitization help
previously known audiences access your items, when they
could not access them in the analog versions?
Do your objects have appeal to specific scholarly communities?
Do you have evidence that digital objects would be used
in teaching or as curricular materials?
Intrinsic Value

Are your objects unique?


Do your objects have representative value? That is, would
digital objects sufficiently represent a larger collection that you
would like to make known?
Is there sufficient context surrounding your digital objects to
make the collection usable? (For example, a collection of 19th
c. photographs will mean little without sufficient descriptive
information about the subjects and photographer.)
IPR

You might also like