You are on page 1of 52

Topic 2

Data in Data Mining

ISP642 – BUSINESS
INTELLIGENCE
Learning Objectives
 Understand the concepts of data and the various sources
of data
 Understand unstructured, structured and semi-structured
data
 Understand simple taxonomy of data in data mining
BI Revisited
 The goal of BI
 is to help decision-makers make more informed
and better decisions to guide the business

 Decisions are made based on previous and


current data
Data
 a collection of facts usually resulted from
 Experiences
 Observations
 Experiments
 lowest level of abstraction (from which
information and knowledge are derived)
 Data may consist of
 Numbers
 Words
 mages
Sources of Data
 Abundant data available from:
 Business : Web, e-commerce, transactions, stock
 Science : Remote sensoring, bioinformatics, scientific
simulation
 Society and everyone : news, digital cameras
Firms and DATA
 Firms have a vast amount of data available to them
from a huge range of sources.
 BI tools
 are designed to help firms sift useful information from
all the sources,
 but, as with any tool, they should be used intelligently
to achieve optimal results
Digital data
 Types of digital data
 Structured
 Unstructured
 Semi-structured
 80–90% of business data is either unstructured or
semi-structured (Merrill Lynch)
 Difficult to extract information
Formats of Data

Source: Prasad & Acharya (2011)


Taxonomy of Data in Data Mining
Unstructured Data

Source: Prasad & Acharya (2011)


Example of Unstructured Data

Source: Prasad & Acharya (2011)


How to Store Unstructured Data?

Source: Prasad & Acharya (2011)


How to Store Unstructured Data?

Source: Prasad & Acharya (2011)


How to Extract Information from
Unstructured Data?

Source: Prasad & Acharya (2011)


How to Extract Information from
Unstructured Data?
Semi-structured Data
Where does Semi-structured Data
Come from?
How to Manage Semi-structured
Data?
How to Store Semi-structured
Data?
How to Store Semi-structured
Data?
How to Extract Information from
Semi-structured Data?
How to Extract Information from
Semi-structured Data?
XML – A Solution for Semi-
structured Data Management
XML – A Solution for Semi-
structured Data Management
Structured Data
Where does Structured Data Come
from?
Structured Data: Everything in its
Place
Semi-structured to Structured
Ease with Structured Data-Storage
Ease with Structured Data-
Retrieval
Why Structured Data?
 Structured data is what data mining algorithms use.
Categorical and Numerical

Data

Categorical Numerical

Nominal Ordinal Interval Ratio


Categorical Data
 (Structured) → Categorical
 Categorical data (or also known as discrete data)
 represent the labels of multiple classes used to divide a
variable into specific group.
 Examples:
 Race, sex, age, group, educational level
 Further divided into Data
 Nominal data
 Ordinal data Categorical Numerical

Nominal Ordinal Interval Ratio


Nominal Data
 (Structured) → (Categorical) → Nominal
 contains measurements of simple codes assigned to objects as
labels, which are not measurements.
 Examples:
 Marital status
 Single
 Married, or
 Divorced
 yes/no Data
 true/false
 good/bad Categorical Numerical
 Red/green/blue
Nominal Ordinal Interval Ratio
Ordinal Data
 (Structured) → (Categorical) → Ordinal
 contains codes assigned to objects or events as labels that also
represent the rank order among them.
 Examples:
 Credit score
 Low
 Medium
 High
 Age group
 Child Data
 Young
 Educational level
Categorical Numerical
 High school
 College
Nominal Ordinal Interval Ratio
Numerical Data
 (Structured) → (Numerical)
 Represent numeric values of specific variables
 Examples:
 Age
 Number of children
 Total household income (in US Dollars)
 Travel distance (in miles)
 Temperature (in Fahrenheit degrees)
Data
 Can be
 Integer
Categorical Numerical
 Real number

Nominal Ordinal Interval Ratio


Interval Data
 (Structured) → (Numerical) → Interval
 Variables that can be measured on interval scale
 There is no absolute zero value
 Examples:
 Temperature (in Celsius scale)
 Difference between melting temperature and boiling temperature
 My level of happiness, rated from 1 to 10
 Time
Data

Categorical Numerical

Nominal Ordinal Interval Ratio


Ratio Data
 (Structured) → (Numerical) → Ratio
 Measurement variables commonly found in
 physical sciences and
 engineering
 Examples:
 Mass
 Length
 Time
 Energy Data

 Electric charge
 Height Categorical Numerical

 Weight
Nominal Ordinal Interval Ratio
Unstructured & Semi-structured Data
 Other data types (Qualitative)
 Textual
 Spatial
 Imagery
 Voice
 MUST BE CONVERTED into structured
 Then only they can be processed by data mining
algorithms.
Why is this Important?
 Important to understand the data
 To perform operation / statistical analysis
Summary of Data Operations
Statistical Analysis from Data
Statistical Methods
Descriptive vs. Inferential
Statistics
Descriptive Statistics (example)
 Asking 35 people their favorite ice cream flavors
Descriptive Analysis
 Univariate Analysis
 the distribution
 Frequency distribution
 the central tendency
 Mean
 Median
 Mode
 the dispersion
 Range
 Standard deviation
Inferential Statistics
 Confidence Interval:
 when you want to estimate
a population parameter
 Mean difference

 Significance Testing
 to assess the evidence
provided by the data in
favor of hypothesis
 T-Tests
Inferential Analysis
Descriptive Statistical Analysis:
Others
END OF TOPIC
50

 THANK YOU FOR YOUR ATTENTION


Next Topic
References
52

1. RN Prasad & Seema Acharya(2011), Fundamentals of Business


Analytics, Wiley India Pvt. Ltd
2. Ramesh Sharda, Dursun Delen & Efraim Turban (2014), Business
Intelligence and Analytics, 10th ed., Pearson Education Ltd.
3. Nathan Yau, (2013), Data Points: Visualization That Means
Something, Wiley.
4. http://www.graphpad.com/support/faqid/1089/
5. https://statistics.laerd.com/statistical-guides/hypothesis-testing.php
6. http://www.mymarketresearchmethods.com/descriptive-inferential-statis
tics-difference/

You might also like