You are on page 1of 20

5/25/2020

What is Data Mining (DM)?


• Data: facts, measurements or text collected for reference or analysis (Oxford
dictionary).

Introduction to DM and Unstructured data: data that does not t a certain data structure (text, a list of numeric
measurements)
Structured data: data that fits a certain data structure (table, tree, graph/network, etc.)

Data Warehousing • “Data mining is the process of discovering meaningful new correlations,
patterns and trends by sifting through large amounts of data stored in
repositories, using pattern recognition technologies as well as statistical and
mathematical techniques.” (The Gartner Group, www.gartner.com)
Overview of the Course • “Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are
both understandable and useful to the data owner”. (David Hand, Heikki
Mannila, and Padhraic Smyth, Principles of Data Mining, MIT Press, Cambridge,
MA, 2001.)

• Process Mining is the task of converting event data into process models.

1 2

Data Mining: Defined and Explained The Data Mining Process


• Data mining: the computational process of discovering patterns in
large data sets involving methods at the intersection of artificial 1. Understand the domain;
intelligence, machine learning, statistics, and database systems 2. Create a dataset;
(Wikipedia).
3. Select the interesting attributes;
• Data mining: the practice of examining large pre-existing databases 4. Data cleaning and preprocessing;
in order to generate new information (Oxford).
5. Choose the data mining task and the specific algorithm;
• Data mining: knowledge discovery from data (or information) in an
automated way (see the DIKW pyramid). 6. Interpret the results, and possibly return to step 2.

• Knowledge Discovery in Data is the non-trivial process of identifying • The DM process must address:
valid,
novel, • Enormity of data
potentially useful
and ultimately understandable patterns in data. • High dimensionality of data

• Process Mining is the task of converting event data into process • Heterogeneous and distributed nature of data.
models.
3

1
5/25/2020

Discussion Q: What DM is and What it is not?

5 6

7 8

2
5/25/2020

9 10

11 12

3
5/25/2020

13 14

15 16

4
How much data do we generate?

19
17

20
18

5
5/25/2020
5/25/2020

21 22

Business Management Issues Why BDA?


• “We have mountains of data in this company, but we can’t
access it.”
• “We need to slice and dice the data every which way.”
• “You’ve got to make it easy for business people to get at the
data directly.”
• “Just show me what is important.”
• “It drives me crazy to have two people present the same
business metrics at a meeting, but with different numbers.”
• “We want people to use information to support more fact-
based decision making.”

23 24

6
5/25/2020

Managing
KK1 Organizations

Informed decision making as a prerequisite for success

Vision

Mission
Values, Purpose, Structure, Politics, Environment, etc.
Strategic Givens
Direction
Policies, Goals, and Objectives
Decision What should be done ?
Making
Analytics, Decision Making
When and how ??
Implementation
Project Management
Action
25

Managerial Decision Making Components of a DSS


Information Technology Solutions for Improving Effectiveness Creating Information Under Conditions of Uncertainty and Complexity

Information Technology for Enterprise Strategic Systems

DATA MODEL
BASE BASE
Enterprise Application
INTELLIGENCE MODELS Data Models
DATA DBMS MBMS

Structuring Relationships
DESIGN Problem Representation DATA ON LINE ANALYTICAL
Variables (Measures and Generation of Alternatives WAREHOUSING
Estimates) PROCESSING
Probabilities and
Estimates
CHOICE
Spreadsheet Models
Decision Analysis and
Influence Diagrams for for managing complex Business Reporting
Visualizing Models and relationships and detail
Choices

7
Slide 25

KK1 Kula K, 5/20/2019


5/25/2020

Enterprise Wide Decisions

Goals/Strategy

Pricing
Promotion Marketing Demand Consumers
Loyalty

Capacity
Labor Production Quantity Suppliers

Materials

Cash flow
Finance Revenues Investors
Debt/Equity
Investments

30

31 32

8
5/25/2020

Why DM?
• Data explosion • Data  Information Knowledge
• We are drowning in data, but
starving for knowledge!" • Knowledge Discovery
• Interpretation
• Machine Learning
• Understanding
• Learning
• Data Mining

• Acting
• Descriptive data mining:
clustering, pattern mining, etc.
• Predictive data mining:
classification, prediction, etc.

• Big Data Analytics or Data Science

33 34

35 36

9
5/25/2020

37 38

39 40

10
5/25/2020

41 42

44

11
5/25/2020

45 46

What is Data Warehouse?


• Data warehouse: a copy of transaction data specially structured for query and
analysis (R. Kimball)
• Data warehouse: a system used for reporting and data analysis (Wikipedia)
• Data warehouse: a subject oriented, integrated, nonvolatile, timestamped
collection of data designed to support management’s decision support needs.

47

12
5/25/2020

Data warehouse (DW): Definition Basic Elements of the Data Warehouse


• Data warehouse (DW or DWH), also known as an enterprise data
warehouse (EDW), is a system used for reporting and data analysis,
and is considered a core component of business intelligence.

• A data warehouse is simply a single, complete, and consistent store


of data obtained from a variety of sources and made available to
end users in a way they can understand and use it in a business
context.

• A data warehouse is a subject-oriented, integrated, time-varying,


non-volatile collection of data that is used primarily in organizational
decision making.

• A data warehouse integrates data originating from multiple sources


and various timeframes.
50

52

13
5/25/2020

Data Warehouse

• The data warehouse:


• must make an organization’s information easily accessible
• must present the organization’s information consistently

• must be adaptive and resilient to change


• must be a secure bastion that protects our information assets
• must serve as the foundation for improved decision making
• the business community must accept the data warehouse if it is to
be deemed successful

53

Benefits of a Data Mart (contd…) Operational Source Systemsand Data Staging Area
• Operational Source Systems
• capture the transactions of the business
• queries against source systems are narrow
• stovepipe application

• A storage area: a set of ETL processes (extract-transform-load)


• it is off-limits to business users and does not provide query and presentation
services.

• Data Staging Area - ETL


• EXTRACTION
• reading and understanding the source data and copying the data needed for
the data warehouse into the staging area for further manipulation.
• TRANSFORMATION
• cleansing, combining data from multiple sources, deduplicating data, and
assigning warehouse keys
• LOADING
• loading the data into the data warehouse presentation area
56

14
5/25/2020

Data Presentation Area Data Access Tools Microsoft SQL Server


• where data is organized, stored and made available for direct • tools that query the data in the data • SQL Server Integration Services
warehouse’s presentation area. (SSIS)
querying by users, report writers, and other analytical
• tool for the ETL process
applications
• the variety of capabilities that can
• it is all the business community sees and touches via data be provided to business users to
leverage the presentation area for
• SQL Server Analysis Services
access tools analytic decision making. (SSAS)
• tool for multidimensional
• dimensional data modeling prebuilt parameter-driven analytic
modeling
applications.
user understandability
• SQL Server Reporting Services
query performance ad hoc query tools.
(SSRS)
data mining, modeling, forecasting
• tool for reporting
resilience to change

• detailed, atomic data


57 58

60

15
5/25/2020

61 62

The KDD Process (Contd.)

63

16
5/25/2020

Steps of the KDD Process (Contd.)


• Data cleaning to remove noise and inconsistent data.

• Data integration, where multiple data sources may be combined.

• Data selection, where data relevant to the analysis task are retrieved from the
database.

• Data transformation, where data are transformed and consolidated into


forms appropriate for mining by preforming summary or aggregation
operations.

• Data mining, which is an essential process where intelligent methods are


applied to extract data patterns.

• Pattern evaluation to identify the truly interesting patterns representing


knowledge based on interesting measures.

• Knowledge presentation, where visualization and knowledge representation


techniques are used to present mined knowledge to users.
66

CROSS-INDUSTRY STANDARD PROCESS FOR DATA DM in Businesses DM in practice


MINING • Process management
1. Learn about the problem domain
CRISP-DM • Market basket analysis
2. Data selection
An industry- and tool-neutral
data mining process model. • Marketing 3. Data, cleaning, preprocessing and
reduction
 Business understanding • Customer loyalty
phase 4. Data mining
• Fraud detection
 Data understanding phase 5. Interpretation of information
• Trend analysis
 Data preparation phase 6. Apply knowledge in domain
 Modeling phase

 Evaluation phase • Data preprocessing: Sampling; Normalization Missing data


 Deployment phase Data confilicts Duplicate data Ambiguity in datam
68

17
5/25/2020

Guidelines for Successful Data mining


• The data must be available, relevant, adequate and clean;
• There must be a well-defined problem;
• The problem should not be solvable by means of ordinary query or
• OLAP tools
• The results must be actionable

• Successful data mining in businesses involves:


Use a small team with a strong internal integration and a loose
management style;
Carry out a small pilot project before a major data mining project;
Identify a clear problem owner responsible for the project, e.g., from
sales or marketing;
Try to realize a positive return on investment within 6 to 12 months
69
Have top management back the project up 70

Data Attribute Types

• Data quality:
Accuracy
Completeness
Consistency (uniformity)
Validity
Timeliness
Data cleaning, data cleansing, data scrubbing,
71 72

18
5/25/2020

73 74

Q & A: Comments and Suggestions

75 76

19

You might also like