Professional Documents
Culture Documents
Data Mining(DM) as
KM tool
Dereje F 12/15/2022 1
What is data mining?
Dereje F 12/15/2022 3
What is data mining…?
• DM is a technology that uses various techniques to discover
hidden knowledge from heterogeneous and distributed
historical data stored in large databases, warehouses and
other massive information repositories so to find patterns in
data that are:
• valid: not only represent current state, but also hold on new data
with some certainty
• novel: non-obvious to the system that are generated as new facts
• useful: should be possible to act on the item or problem
• understandable: humans should be able to interpret the pattern
Dereje F 12/15/2022 4
Lots of data
Dereje F
• ….. 12/15/2022 5
The four dimensions of Big Data
• Volume: Large volumes of data
• Velocity: Quickly moving data
• Variety: structured, unstructured, images, etc.
• Veracity: Trust and integrity is a challenge and
a must and is important for big data
Dereje F 12/15/2022 6
Too much data & too little knowledge
• There is a need to extract knowledge from the massive data.
• The competitive pressures are strong, which needs useful information
for prediction
• Facing too enormous volumes of data, human analysts with no
special tools can no longer make sense.
• DM can automate the process of finding patterns & relationships in raw
data and the results can be utilized for decision support. That is why
data mining is used, in science, health and business areas.
• If we know how to reveal valuable knowledge hidden in raw
data, data might be one of our most valuable assets.
• DM is the tool that involves retrospective analysis to extract
diamonds of knowledge from historical data & predict outcome of
the future.
Dereje F 12/15/2022 7
Why DM Now?
• Four main reasons why DM now?
1. The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers need?
• How to manage the high turnover rate of professionals?
Dereje F 12/15/2022 8
Why Data Mining…?
• Customer relationship management:
• Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor?
Dereje F 12/15/2022 10
DM vs. Knowledge Discovery in Databases(KDD)
• KDD is often used as a synonym for Data Mining.
• Some author define KDD as the whole process involving:
• data pre-processing: cleaning data transformation mining result
evaluation visualization
• KDD is the process model to find useful information and patterns in
database
• DM is the use of algorithms to extract hidden patterns & knowledge
in data sets
Dereje F 12/15/2022 11
Data Preparation
• Data cleansing
• Data integration
• Data reduction
• Data transformation
Dereje F 12/15/2022 12
Data Preparation…
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for instance)
5. Data mining
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)
Dereje F 12/15/2022 13
Data Collection for Mining
• Data mining requires collecting great amount of data (available in
data warehouses or databases) to achieve the intended objective.
– Data mining starts by understanding the business or problem
domain in order to gain the business knowledge
• Business knowledge guides the process towards useful
results & enables the recognition of those results that are
useful.
• Before feeding data to DM we have to make sure the quality of
data
Dereje F 12/15/2022 14
Data Quality Measures
• A well-accepted multidimensional data quality measures are the
following:
• Accuracy (free from errors and outliers)
• Completeness (no missing attributes and values)
• Consistency (no inconsistent values and attributes)
• Timeliness (appropriateness of the data for the purpose it is required)
• Believability (acceptability)
• Interpretability (easy to understand)
• Most of the data in the real world are poor quality; that is:
• Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
Dereje F 12/15/2022
15
Data Mining Main Tasks
Dereje F 12/15/2022
16
Basic Data Mining algorithms
• Classification: which is also called Supervised learning maps data
into predefined groups or classes to enhance the prediction process
Dereje F 12/15/2022 17
P red ict ive M ode ling - Cl assif icat ion
Dereje F 12/15/2022 20
Classification
Dereje F 12/15/2022 23
Predictive Modeling: Fraud Detection
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
Dereje F 12/15/2022 29
Example: clustering ..
• There are three different kind of padlocks; which can be grouped into three
different clusters.
• The padlocks of same kind are clustered into a group as shown below:
Dereje F 12/15/2022 30
Example: Clustering Application
• Text/Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach:
• Identify content-bearing terms in each document.
• Form a similarity measure based on the frequencies
of different terms and use it to cluster documents.
– Application:
• Information Retrieval can utilize the clusters to relate
a new document or search term to clustered
Dereje F
documents. 12/15/2022 31
Quality: What Is Good Clustering?
• The quality of a clustering result depends Intra-cluster
on both the similarity measure used by distances are
the method and its implementation minimized
• Key requirement of clustering: Need a good
measure of similarity between instances.
• The quality of a clustering method is also
measured by its ability to discover some or
all of the hidden patterns in the given
datasets
• A good clustering method will produce
Inter-cluster
high quality clusters with distances are
• high intra-class similarity maximized
Inter
Dereje F 12/15/2022 32
• low inter-class similarity
Cluster Evaluation: Hard Problem
• The quality of a clustering is very hard to evaluate because
• We do not know the correct clusters/classes
Dereje F 12/15/2022
34
Cluster Evaluation: Ground Truth
Dereje F 12/15/2022
35
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns (sequential
patterns) in the data rather than to characterize the data globally
• Also called link analysis (uncovers relationships among data)
Dereje F 12/15/2022 36
Example of Pattern Discovery
• Example in retail: Customer transactions to consumer behavior:
• People who bought “Da Vinci Code” also bought “The Five People You Meet in
Heaven” (www.amazon.com)
Dereje F 12/15/2022 38
Limitations of Data Mining…
• The successful application of data mining requires knowledge
of the domain area as well as in data mining methodology and
tools-Collectively, the data mining team should possess domain
knowledge, statistical and research expertise, and IT and data
mining knowledge and skills.
Dereje F 12/15/2022 39
Future Directions
• Possible directions include the standardization of clinical vocabulary and the
sharing of data across organizations to enhance the benefits of healthcare data
mining applications.
• healthcare data are not limited to just quantitative data, such as physicians’ notes or
clinical records, it is necessary to also explore the use of text mining to expand the
scope and nature of what healthcare data mining can currently do. In particular, it is
useful to be able to integrate data and text mining
• It is also useful to look into how digital diagnostic images can be brought into
healthcare data mining applications. Some progress has been made in these areas
Dereje F 12/15/2022 40
Future Directions
• Data mining and knowledge discovery techniques can be used to
“discover” or identify emergent patterns that could not have otherwise
been detected. Some of these techniques may provide valuable insights.
Dereje F 12/15/2022 41
Class work
• 1. what does DM mean?( 2 pts)
• 2. Write the four dimensions of big data( 1 pt)
• 3. what does predictive DM mean?( 2 pts)
Dereje F 12/15/2022 42
1. What is high intra-class similarity mean?
2. what is low inter-class similarity mean?
3. What do you mean by supervised learning mean?
4. Write the common forms of data mining techniques/tasks?
Dereje F 12/15/2022 43
Confusion Matrix for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)
ad TP
Accuracy *100
a b c d TP FP