You are on page 1of 41

5.

Data Mining(DM) as
KM tool

Dereje F 12/15/2022 1
What is data mining?

• A process of pattern & r/s discovery within large sets of data


• DM finds valuable knowledge hidden in large volumes of
data.
• Data processing using sophisticated data search capabilities
and statistical algorithms to discover patterns and
correlations in large preexisting databases
• DM is the analysis of data and the use of software
techniques for finding patterns and regularities in sets of
data.
• More specifically, one major primary goal of DM is to
Dereje F 12/15/2022 2

discover new patterns for the users.


What is data mining…?
• The discovery of new patterns can serve two purposes: description
and prediction.
• The former (description)focuses on finding patterns and presenting
them to users in an interpretable and understandable form.
• Prediction involves identifying variables or fields in the database
and using them to predict future values or behavior of some entities.
• Data Mining is a key technology for knowledge management
• In summary data mining is key to better business intelligence and
business intelligence is key to effective knowledge management

Dereje F 12/15/2022 3
What is data mining…?
• DM is a technology that uses various techniques to discover
hidden knowledge from heterogeneous and distributed
historical data stored in large databases, warehouses and
other massive information repositories so to find patterns in
data that are:
• valid: not only represent current state, but also hold on new data
with some certainty
• novel: non-obvious to the system that are generated as new facts
• useful: should be possible to act on the item or problem
• understandable: humans should be able to interpret the pattern

Dereje F 12/15/2022 4
Lots of data

• Data come from many quarters.


• Facility records/data
• Population based data
• Social media sites
• Sensors
• Digital images
• Business transactions
• Location-based data

Dereje F
• ….. 12/15/2022 5
The four dimensions of Big Data
• Volume: Large volumes of data
• Velocity: Quickly moving data
• Variety: structured, unstructured, images, etc.
• Veracity: Trust and integrity is a challenge and
a must and is important for big data

Dereje F 12/15/2022 6
Too much data & too little knowledge
• There is a need to extract knowledge from the massive data.
• The competitive pressures are strong, which needs useful information
for prediction
• Facing too enormous volumes of data, human analysts with no
special tools can no longer make sense.
• DM can automate the process of finding patterns & relationships in raw
data and the results can be utilized for decision support. That is why
data mining is used, in science, health and business areas.
• If we know how to reveal valuable knowledge hidden in raw
data, data might be one of our most valuable assets.
• DM is the tool that involves retrospective analysis to extract
diamonds of knowledge from historical data & predict outcome of
the future.
Dereje F 12/15/2022 7
Why DM Now?
• Four main reasons why DM now?
1. The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers need?
• How to manage the high turnover rate of professionals?

2. Massive data collection: produced at alarming rate & is being


warehoused
3. The computing power is available and is affordable
4. DM commercial products and machine learning algorithms are available

Dereje F 12/15/2022 8
Why Data Mining…?
• Customer relationship management:
• Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor?

• Fraud detection/Medical fraud detection


o Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
• Health care: Physicians identify effective treatments and best
practices
• E-commerce applications
Data Mining helps extract such useful information
Dereje F 12/15/2022 9
Knowledge Discovery in Databases(KDD) process

Dereje F 12/15/2022 10
DM vs. Knowledge Discovery in Databases(KDD)
• KDD is often used as a synonym for Data Mining.
• Some author define KDD as the whole process involving:
• data pre-processing: cleaning  data transformation  mining  result
evaluation  visualization
• KDD is the process model to find useful information and patterns in
database
• DM is the use of algorithms to extract hidden patterns & knowledge
in data sets

Dereje F 12/15/2022 11
Data Preparation
• Data cleansing
• Data integration
• Data reduction
• Data transformation
Dereje F 12/15/2022 12
Data Preparation…
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for instance)
5. Data mining
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)

Dereje F 12/15/2022 13
Data Collection for Mining
• Data mining requires collecting great amount of data (available in
data warehouses or databases) to achieve the intended objective.
– Data mining starts by understanding the business or problem
domain in order to gain the business knowledge
• Business knowledge guides the process towards useful
results & enables the recognition of those results that are
useful.
• Before feeding data to DM we have to make sure the quality of
data

Dereje F 12/15/2022 14
Data Quality Measures
• A well-accepted multidimensional data quality measures are the
following:
• Accuracy (free from errors and outliers)
• Completeness (no missing attributes and values)
• Consistency (no inconsistent values and attributes)
• Timeliness (appropriateness of the data for the purpose it is required)
• Believability (acceptability)
• Interpretability (easy to understand)

• Most of the data in the real world are poor quality; that is:
• Incomplete, Inconsistent, Noisy, Invalid, Redundant, …

Dereje F 12/15/2022

15
Data Mining Main Tasks

Dereje F 12/15/2022

16
Basic Data Mining algorithms
• Classification: which is also called Supervised learning maps data
into predefined groups or classes to enhance the prediction process

• Clustering: which is also called Unsupervised learning groups


similar data together into clusters.
• is used to find appropriate groupings of elements for a set of data.
• Unlike classification, clustering is a kind of undirected knowledge discovery or
unsupervised learning; that is, there is no target field, & the relationship among
the data is identified by bottom-up approach.
• Association Rule: is also known as market basket analysis
• It discovers interesting associations between attributes contained in a database.
• Based on frequency of occurrence of number of items in the event, association
rule tells if item X is a part of the event, then what is the likelihood of item Y is
also part of the event.

Dereje F 12/15/2022 17
P red ict ive M ode ling - Cl assif icat ion

• Classification is a data mining (machine learning) technique


used to predict group membership for data instances.
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class.
• Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the accuracy
of the model.
• Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
• For example, one may use classification to predict whether the
Dereje F weather on a particular day will be “sunny”, “rainy” or “cloudy”.
12/15/2022 18
Predictive Modeling…
• Predictive modeling is a statistical process by which
historical data is analyzed in order to create an algorithm
that can be used to determine the likelihood of a future
event.
• Predictive modeling helps identify the risk of an outcome,
based on an in-depth understanding and analysis of what
has happened in the past.
• A predictive model makes a prediction(forecast) about
values of data using known results found from different
historical data
• Prediction Methods use existing variables to predict unknown or
Dereje F
future values of other variables. 12/15/2022 19
DM Task: Predictive Modeling…
• Predict one variable Y given a set of other variables X. Here X could be
an n-dimensional vector
• In effect this is function approximation through learning the r/s b/n Y and X
• Many, many algorithms for predictive modeling in statistics and machine
learning, including Classification, regression, decision tree,neural network
etc.

Dereje F 12/15/2022 20
Classification

• Example: Credit scoring


• Differentiating between low-risk and high-risk customers from their
income and savings
Discriminant rule: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Dereje F 12/15/2022 22
Models and Patterns x f(x)
• Model = abstract representation of a given training 1 1
data 2 4
e.g., very simple linear model structure
Y=aX+b 3 9
• a & b are parameters determined from the data 4 16
• Y = aX + b is the model structure
5 ?
• Pattern represents “local structure” in a dataset

Dereje F 12/15/2022 23
Predictive Modeling: Fraud Detection

• Credit card fraud detection


• Credit card losses in the US are over 1 billion $ per
year
• Roughly 1 in 50 transactions are fraudulent
• Approach
• For each transaction estimate p(fraudulent |
transaction)
• Model is built on historical data of known fraud/non-
fraud
Dereje F 12/15/2022 24
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation
– e.g., a model that could simulate the data if needed
EM ITERATION 25
4.4

• Descriptive model identifies

Red Blood Cell Hemoglobin Concentration


4.3

patterns or relationship in data 4.2

• Unlike the predictive model, a descriptive 4.1

model serves as a way to explore the 4

properties of the data examined, not to 3.9

predict new properties 3.8

3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume

• Description Methods find human-interpretable patterns that describe and


find natural groupings of the data.
• Methods used in descriptive modeling are: clustering, summarization,
association rule discovery, etc.
Dereje F 12/15/2022 26
Example of Descriptive Modeling
• Goal: learn directed relationships among variables
• Techniques: directed (causal) graphs
• Challenge: distinguishing between correlation and causation
• Example: Do yellow fingers cause lung cancer?

smoking hidden cause:


smoking

yellow fingers cancer


?
Dereje F 12/15/2022 27
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one
cluster
• Given a set of points, with a notion of x
distance between points, group the x x x
x x
x x x
points into some number of clusters, x x x x
x x x
so that members of a cluster are in x xx x
x xx x
some sense as close to each other as x x x
possible. x x
x x
• While data points in the same cluster x x x x
are similar, those in separate clusters x x x
Dereje F are dissimilar to one another. x12/15/2022 28
Example: clustering
• The example below demonstrates the clustering of padlocks of
same kind. There are a total of 10 padlocks which various in
color, size, shape, etc.

• How many possible clusters of padlocks can be identified?

Dereje F 12/15/2022 29
Example: clustering ..
• There are three different kind of padlocks; which can be grouped into three
different clusters.
• The padlocks of same kind are clustered into a group as shown below:

Dereje F 12/15/2022 30
Example: Clustering Application
• Text/Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach:
• Identify content-bearing terms in each document.
• Form a similarity measure based on the frequencies
of different terms and use it to cluster documents.
– Application:
• Information Retrieval can utilize the clusters to relate
a new document or search term to clustered
Dereje F
documents. 12/15/2022 31
Quality: What Is Good Clustering?
• The quality of a clustering result depends Intra-cluster
on both the similarity measure used by distances are
the method and its implementation minimized
• Key requirement of clustering: Need a good
measure of similarity between instances.
• The quality of a clustering method is also
measured by its ability to discover some or
all of the hidden patterns in the given
datasets
• A good clustering method will produce
Inter-cluster
high quality clusters with distances are
• high intra-class similarity maximized
Inter
Dereje F 12/15/2022 32
• low inter-class similarity
Cluster Evaluation: Hard Problem
• The quality of a clustering is very hard to evaluate because
• We do not know the correct clusters/classes

• Some methods are used:


• User inspection
• Study centroids of the cluster, and spreads of data items in each cluster
• For text documents, one can read some documents in clusters to
evaluate the quality of clustering algorithms employed.

Dereje F 12/15/2022

34
Cluster Evaluation: Ground Truth

• We use some labeled data (for classification)


• Assumption: Each class is a cluster.
• After clustering, a confusion matrix is constructed. From the
matrix, we compute various measurements, entropy, purity,
precision and recall .
• Let the classes in the data D be C = (c1, c2, …, ck). The clustering
method produces k clusters

Dereje F 12/15/2022

35
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns (sequential
patterns) in the data rather than to characterize the data globally
• Also called link analysis (uncovers relationships among data)

• Given market basket data we might discover that


• If customers buy wine and bread then they buy cheese with probability
0.9

• Methods used in pattern discovery include:


• Association rules, Sequence discovery, etc.

Dereje F 12/15/2022 36
Example of Pattern Discovery
• Example in retail: Customer transactions to consumer behavior:
• People who bought “Da Vinci Code” also bought “The Five People You Meet in
Heaven” (www.amazon.com)

• Example: football player behavior


• If player A is in the game, player B’s scoring rate increases from 25% chance
per game to 95% chance per game

• What about the following?


ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDA
BDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBA
DDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCA
CCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCA
BDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBA
ADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADB
CAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCB
BCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCD
Dereje F 12/15/2022 37
ABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
Limitations of Data Mining
• Healthcare data mining can be limited by the accessibility of data, because the raw inputs
for data mining often exist in different settings and systems, such as administration, clinics,
laboratories and more
• Problem data warehouse –costly
• Missing, corrupted, inconsistent, or non-standardized data, such as pieces of information
recorded in different formats in different data sources.
• In particular, the lack of a standard clinical vocabulary is a serious hindrance to data mining.
• Data problems in healthcare are the result of the volume, complexity and heterogeneity of
medical data and their poor mathematical characterization and non-canonical form.
• Further, there may be ethical, legal and social issues, such as data ownership and privacy
issues, related to healthcare data.
• The quality of data mining results and applications depends on the quality of data

Dereje F 12/15/2022 38
Limitations of Data Mining…
• The successful application of data mining requires knowledge
of the domain area as well as in data mining methodology and
tools-Collectively, the data mining team should possess domain
knowledge, statistical and research expertise, and IT and data
mining knowledge and skills.

Dereje F 12/15/2022 39
Future Directions
• Possible directions include the standardization of clinical vocabulary and the
sharing of data across organizations to enhance the benefits of healthcare data
mining applications.
• healthcare data are not limited to just quantitative data, such as physicians’ notes or
clinical records, it is necessary to also explore the use of text mining to expand the
scope and nature of what healthcare data mining can currently do. In particular, it is
useful to be able to integrate data and text mining
• It is also useful to look into how digital diagnostic images can be brought into
healthcare data mining applications. Some progress has been made in these areas
Dereje F 12/15/2022 40
Future Directions
• Data mining and knowledge discovery techniques can be used to
“discover” or identify emergent patterns that could not have otherwise
been detected. Some of these techniques may provide valuable insights.

Dereje F 12/15/2022 41
Class work
• 1. what does DM mean?( 2 pts)
• 2. Write the four dimensions of big data( 1 pt)
• 3. what does predictive DM mean?( 2 pts)

Dereje F 12/15/2022 42
1. What is high intra-class similarity mean?
2. what is low inter-class similarity mean?
3. What do you mean by supervised learning mean?
4. Write the common forms of data mining techniques/tasks?

Dereje F 12/15/2022 43
Confusion Matrix for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)

• Most widely-used metric is measuring Accuracy of the system :


• Accuracy rate is the percentage of test set samples that are correctly classified by the
model

ad TP
Accuracy   *100
a  b  c  d TP  FP

You might also like