Professional Documents
Culture Documents
Digital Data
Agenda
• Types of digital data
• Unstructured
• Origin
• Management
• Storage
• Storage of unstructured data in relational database
• Process of extracting information
• Key take-away and additional reads
• Semi-Structured
• Origin
• Management
• Storage
• Storage of semi-structured data in relational database
• Process of extracting information
• XML
• Key take-away and additional reads
Agenda
• Types of digital data - contd.
• Structured
• Origin
• Management
• Storage
• Process of extracting information
• Key take-away and additional reads
Digital Data
• Digital data can be
• Unstructured
• Semi-Structured
• Structured
• According to Merrill Lynch 80-90% of business data is either
unstructured or semi-structured
• Data is usually in a format which makes it difficult to extract
information from it
Formats of Digital Data
Formats of Data
Unstructured Data
Semi-structured Data
80%
Unstructured Data
What is unstructured data?
Does not
conform to
any data
model Cannot be
stored in
Has no easily form of rows
identifiable and columns
structure as in a
database
Unstructured
data
Not in any
Does not particular
follow any format or
rule or sequence
semantics
Not easily
usable by a
program
Where does unstructured data come from?
Web pages
Memos
Body of an e-mail
Word document
Unstructured data
PowerPoint
presentations
Chats
Reports
Whitepapers
Surveys
How to store unstructured data?
Sheer volume of unstructured data and its
Storage unprecedented growth makes it difficult to store. Audios,
Space videos, images etc. acquire huge amount of storage
space
Scalability becomes an issue with
Scalability increase in unstructured data
Update and
delete Updating, deleting etc. is not easy due
to the unstructured form
Indexing
and
searching Indexing becomes difficult with increase in
data. Searching is difficult for non-text data
How to store unstructured data?
Unstructured data may be be converted to formats which
Change are easily managed, stored and searched. E.g. IBM is
formats working on providing a solution which converts audio , video
etc. to text
CAS
Organize files based on their meta-data
How to extract information from unstructured data?
Unstructured data is not easily interpreted by
Interpretation conventional search algorithms
Not sufficient
Metadata
Where does semi-structured data come from?
E-mail
XML
TCP/IP packets
Zipped files
Semi-structured
data
Binary
executables
Mark-up languages
Integration of data
from heterogeneous
sources
How to manage semi-structured data?
Challenges Faced
In many cases the structure is
Implicit structure implicit. Interpreting relationships
and correlations is very difficult
Possible Solutions
Special Databases which are specifically designed to
purpose store semi-structured data
DBMS
Indexing
• http://queue.acm.org/detail.cfm?id=1103832
• http://www.computerworld.com/s/article/93968/Taming_Text
• http://searchstorage.techtarget.com/generic/0,295582,sid5_gci133
4684,00.html
• http://searchdatamanagement.techtarget.com/generic/0,295582,si
d91_gci1264550,00.html
• http://searchdatamanagement.techtarget.com/news/article/0,289
142,sid91_gci1252122,00.html
Structured Data
What is Structured data?
Conforms to a
data model
Data is stored in
form of rows
Similar entities and columns
are grouped e.g. relational
database
Structured
data
Spreadsheets
Structured data
SQL
OLTP systems
Structured Data: Everything in its place
• http://www.govtrack.us/articles/20061209data.xpd
• http://www.sapdesignguild.org/editions/edition2/sui_content.asp
THANK YOU
www.infosys.com
The contents of this document are proprietary and confidential to Infosys
Limited and may not be disclosed in whole or in part at any time, to any third
party without the prior written consent of Infosys Limited.
© 2011 Infosys Limited. All rights reserved. Copyright in the whole and any
part of this document belongs to Infosys Limited. This work may not be
used, sold, transferred, adapted, abridged, copied or reproduced in whole or
in part, in any manner or form, or in any media, without the prior written
consent of Infosys Limited.