You are on page 1of 33

Business Intelligence

Digital Data
Agenda
• Types of digital data
• Unstructured
• Origin
• Management
• Storage
• Storage of unstructured data in relational database
• Process of extracting information
• Key take-away and additional reads
• Semi-Structured
• Origin
• Management
• Storage
• Storage of semi-structured data in relational database
• Process of extracting information
• XML
• Key take-away and additional reads
Agenda
• Types of digital data - contd.
• Structured
• Origin
• Management
• Storage
• Process of extracting information
• Key take-away and additional reads
Digital Data
• Digital data can be
• Unstructured
• Semi-Structured
• Structured
• According to Merrill Lynch 80-90% of business data is either
unstructured or semi-structured
• Data is usually in a format which makes it difficult to extract
information from it
Formats of Digital Data

Formats of Data
Unstructured Data

Semi-structured Data

10% 10% Structured Data

80%
Unstructured Data
What is unstructured data?
Does not
conform to
any data
model Cannot be
stored in
Has no easily form of rows
identifiable and columns
structure as in a
database

Unstructured
data

Not in any
Does not particular
follow any format or
rule or sequence
semantics
Not easily
usable by a
program
Where does unstructured data come from?
Web pages

Memos

Videos (MPEG etc.)

Images (JPEG, GIF etc)

Body of an e-mail

Word document
Unstructured data
PowerPoint
presentations

Chats

Reports

Whitepapers

Surveys
How to store unstructured data?
Sheer volume of unstructured data and its
Storage unprecedented growth makes it difficult to store. Audios,
Space videos, images etc. acquire huge amount of storage
space
Scalability becomes an issue with
Scalability increase in unstructured data

Retrieve Retrieving and recovering


informatio unstructured data is cumbersome
n
Challenges Faced
Ensuring security is difficult to due
Security varied sources of data e.g. e-mail,
web pages

Update and
delete Updating, deleting etc. is not easy due
to the unstructured form
Indexing
and
searching Indexing becomes difficult with increase in
data. Searching is difficult for non-text data
How to store unstructured data?
Unstructured data may be be converted to formats which
Change are easily managed, stored and searched. E.g. IBM is
formats working on providing a solution which converts audio , video
etc. to text

New Create hardware which support unstructured


data either compliment the existing storage
hardware devices or be a stand alone for unstructured data

RDBMS/ Store in relational databases which support


Possible Solutions BLOBs BLOBs which is Binary Large Objects

XML Store in XML which tries to give some structure to


unstructured data by using tags and elements

CAS
Organize files based on their meta-data
How to extract information from unstructured data?
Unstructured data is not easily interpreted by
Interpretation conventional search algorithms

Tags As the data grows it is not possible to put tags


manually

Indexing Designing algorithms to understand the


meaning of the document and then tag or
index them accordingly is difficult
Challenges Faced
Deriving
meaning Computer programs cannot automatically
derive meaning/structure from
unstructured data
File
formats
Increasing number of file formats make it difficult to
interpret data
Classificatio
n/ Taxonomy
Different naming conventions followed across the
organization make it difficult to classify data.
How to extract information from unstructured data?
Unstructured data can be stored in a virtual repository
Tags and be automatically tagged. E.g. Documentum provides
this type of solution

Text mining tools help in grouping and classifying


Text mining unstructured data and analyze by considering
grammar, context, synonyms etc.

Application platforms like XOLAP help


Application extract information from e-mail and XML
Possible Solutions platforms based documents

Classification/ Taxonomies within the organization can be


Taxonomy managed automatically to organize data in
hierarchical structures

Naming conventions/ Following naming conventions or standards


standards across an organization can greatly improve
storage and retrieval
Additional reads
• http://www.information-management.com/issues/20030201/6287-
1.html
• http://www.enterpriseitplanet.com/storage/features/article.php/1
1318_3407161_2
• http://domino.research.ibm.com/comm/research_projects.nsf/pag
es/uima.index.html
• http://www.research.ibm.com/UIMA/UIMA%20Architecture%20Highli
ghts.html
Semi- Structured Data
What is semi- structured data?
Does not
conform to a
data model
but contains
tags &
elements Cannot be
(metadata) stored in
Similar
entities are form of rows
grouped and columns
as in a
database
Semi-
structured
data
The tags and
Attributes in a
elements
group may not
describe how
be the same
data is stored

Not sufficient
Metadata
Where does semi-structured data come from?
E-mail

XML

TCP/IP packets

Zipped files
Semi-structured
data
Binary
executables

Mark-up languages

Integration of data
from heterogeneous
sources
How to manage semi-structured data?

Some ways in which semi-structured data is managed and stored:

Graph based data


Schemas XML
models

• Describe the • Contain data on the • Models the data


structure and leaves of the using tags and
content of data to graph. Also known elements
some extent as ‘schema less’

• Assign meaning to • Used for data • Schemas are not


data hence exchange among tightly coupled to
allowing automatic heterogeneous data
search and indexing sources
How to store semi-structured data?
Storing data with their schemas increases cost
Storage Cost

Semi-structured data cannot be stored in


RDBMS existing RDBMS as data cannot be mapped
into tables directly

Irregular and Some data elements may have extra


partial structure information while others none at all

Challenges Faced
In many cases the structure is
Implicit structure implicit. Interpreting relationships
and correlations is very difficult

Schemas keep changing with


Evolving schemas requirements making it difficult to
capture it in a database
Distinction
between schema Vague distinction between schema and data exists
and data at times making it difficult to capture data
How to store semi-structured data?
XML allows to define tags and attributes to store
data. Data can be stored in a hierarchical/
XML nested structure

Semi-structured data can be stored in a


relational database by mapping the data to a
RDBMS relational schema which is then mapped to a
table

Possible Solutions
Special Databases which are specifically designed to
purpose store semi-structured data
DBMS

OEM Data can be stored and exchanged in the form of


graph where entities are represented as objects
which are the vertices in a graph
How to extract information from semi-structured data?

Semi-structured is usually stored in


flat files which are difficult to index
and search
Flat files

Data comes from varied sources which is


difficult to tag and search
Challenges Faced
Heterogeneou
s sources

Extracting structure when there is none and


Incomplete/ interpreting the relations existing in the
irregular structure which is present is a difficult task
structure
How to extract information from semi-structured data?

Indexing data in a graph based model


enables quick search

Indexing

Allows data to be stored in a graph based


data model which is easier to index and
search
OEM
Possible Solutions

Allows data to be arranged in a hierarchical


or tree like structure which enables indexing
XML and searching

Various mining tools are available which


Mining search data based on graphs, schemas,
tools structure etc.
XML- A solution for semi-structured data management

XML Extensible MarkUp Language

Open-source mark up language written in


What is XML? plain text. It is hardware and software
independent

Designed to store and transport data over


Does what? the internet

It allows data to be stored in a hierarchical/


How? nested structure. It allows user to define
tags to store the data
XML- A solution for semi-structured data management

XML has no predefined tags


<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you?
</body>
</message>

The words in the <> (angular brackets) are user-defined tags


XML is known as self-describing as data can exist without a schema
and schema can be added later
Schema can be described in XSLT or XML schema
Additional Read

• http://queue.acm.org/detail.cfm?id=1103832
• http://www.computerworld.com/s/article/93968/Taming_Text
• http://searchstorage.techtarget.com/generic/0,295582,sid5_gci133
4684,00.html
• http://searchdatamanagement.techtarget.com/generic/0,295582,si
d91_gci1264550,00.html
• http://searchdatamanagement.techtarget.com/news/article/0,289
142,sid91_gci1252122,00.html
Structured Data
What is Structured data?

Conforms to a
data model
Data is stored in
form of rows
Similar entities and columns
are grouped e.g. relational
database

Structured
data

Attributes in a Data resides in


group are the fixed fields
same within a record
or file
Definition,
format &
meaning of data
is explicitly
known
Where does structured data come from?

Databases e.g. Access

Spreadsheets

Structured data

SQL

OLTP systems
Structured Data: Everything in its place

Fully described datasets

Clearly defined categories and sub-categories

Data neatly placed in rows and columns

Data that goes into the records is regulated by a well- defined


structure

Indexing can be easily done either by the DBMS itself or manually


Structured Data
Semi-structured Structured

Name E-mail First Name Last Name E-mail Id Alternate


E-mail Id

Patrick Wood ptw@dcs.abc.ac.uk, Patrick Wood ptw@dcs.a p.wood@y


p.wood@ymail.uk bc.ac.uk mail.uk

first name: Mark MarkT@dcs.ymail.ac.uk Mark Taylor MarkT@dcs


last name: Taylor .ymail.ac.uk

Alex Bourdoo AlexBourdoo@dcs.ymai Alex Bourdoo AlexBourdo


l.ac.uk o@dcs.yma
il.ac.uk
Ease with structured data- storage
Data types – both defined and user defined
Storage help with the storage of structured data

Scalability is not generally an issue


Scalability with increase in data

Ease with structured


data
Security Ensuring security is easy

Update and Updating, deleting etc. is easy due to


delete structured form
Ease with structured data- retrieval

Retrieve A well- defined structure helps in


information easy retrieval of data

Data can be indexed based not only on a


Indexing and text string but other attributes as well.
searching This enables streamlined search

Ease with structured


data
Structured data can be easily mined
Mining data and knowledge can be extracted
from it

BI works extremely well with structured


BI operations data. Hence data mining, warehousing
etc can be easily undertaken
Additional reads

• http://www.govtrack.us/articles/20061209data.xpd
• http://www.sapdesignguild.org/editions/edition2/sui_content.asp
THANK YOU
www.infosys.com
The contents of this document are proprietary and confidential to Infosys
Limited and may not be disclosed in whole or in part at any time, to any third
party without the prior written consent of Infosys Limited.
© 2011 Infosys Limited. All rights reserved. Copyright in the whole and any
part of this document belongs to Infosys Limited. This work may not be
used, sold, transferred, adapted, abridged, copied or reproduced in whole or
in part, in any manner or form, or in any media, without the prior written
consent of Infosys Limited.

© 2011 Infosys Limited 33

You might also like