Professional Documents
Culture Documents
Data Mining and Its Application and Usage in Medicine
Data Mining and Its Application and Usage in Medicine
300
By Radhika
1
Data Mining and Medicine
History
CSE Past 20 years with relational databases
300
More dimensions to database queries
earliest and most successful area of data mining
Mid 1800s in London hit by infectious disease
Two theories
– Miasma theory Bad air propagated disease
– Germ theory Water-borne
Advantages
– Discover trends even when we don’t understand reasons
– Discover irrelevant patterns that confuse than enlighten
– Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment
Data Mining
Patterns persistent and meaningful
Knowledge Discovery of Data
2
The future of data mining
10 biggest killers in the US
CSE
300
3
Major Issues in Medical Data Mining
Heterogeneity of medical data
CSE Volume and complexity
300
Physician’s interpretation
Poor mathematical categorization
Canonical Form
Solution: Standard vocabularies, interfaces
between different sources of data integrations,
design of electronic patient records
Ethical, Legal and Social Issues
Data Ownership
Lawsuits
Privacy and Security of Human Data
Expected benefits
Administrative Issues
4
Why Data Preprocessing?
Patient records consist of clinical, lab parameters,
CSE results of particular investigations, specific to tasks
300
Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes or
names
Temporal chronic diseases parameters
No quality data, no quality mining results!
Data warehouse needs consistent integration of
quality data
Medical Domain, to handle incomplete,
inconsistent or noisy data, need people with
domain knowledge
5
What is Data Mining? The KDD Process
CSE
300
Pattern Evaluation
Data Mining
Task-relevant
Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
6
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data
CSE model that views data in the form of a data cube
300
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
Fact table contains measures (such as
dollars_sold) and keys to each of related dimension
tables
8
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
CSE Major task of traditional relational DBMS
300 Day-to-day operations: purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical,
consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex
queries
9
CSE
300
10
Why Separate Data Warehouse?
High performance for both systems
CSE DBMS tuned for OLTP: access methods, indexing,
300
concurrency control, recovery
Warehouse tuned for OLAP: complex OLAP
queries, multidimensional view, consolidation
Different functions and different data:
Missing data: Decision support requires historical
data which operational DBs do not typically
maintain
Data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
Data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
11
CSE
300
12
CSE
300
13
Typical OLAP Operations
Roll up (drill-up): summarize data
CSE by climbing up hierarchy or by dimension reduction
300 Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
14
CSE
300
15
CSE
300
16
Multi-Tiered Architecture
CSE
300
Monitor OLAP Server
other Metadata &
sources Integrator
Analysis
Query
Operational Extract
Reports
DBs Transform
Data Serve
Load Data mining
Refresh Warehouse
Data Marts
17
Steps of a KDD Process
Learning the application domain:
CSE relevant prior knowledge and goals of application
300 Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
18
Common Techniques in Data Mining
Predictive Data Mining
CSE
300
Most important
Classification: Relate one set of variables in data to
response variables
Regression: estimate some continuous value
Descriptive Data Mining
Clustering: Discovering groups of similar instances
Association rule extraction
Variables/Observations
Summarization of group descriptions
19
Leukemia
Different types of cells look very similar
CSE
300
Given a number of samples (patients)
can we diagnose the disease accurately?
Predict the outcome of treatment?
Recommend best treatment based of previous
treatments?
Solution: Data mining on micro-array data
38 training patients, 34 testing patients ~ 7000 patient
attributes
2 classes: Acute Lymphoblastic Leukemia(ALL) vs
Acute Myeloid Leukemia (AML)
20
Clustering/Instance Based Learning
Uses specific instances to perform classification than general
CSE IF THEN rules
300 Nearest Neighbor classifier
Most studied algorithms for medical purposes
Clustering– Partitioning a data set into several groups
(clusters) such that
Homogeneity: Objects belonging to the same cluster are
similar to each other
Separation: Objects belonging to different clusters are
dissimilar to each other.
Three elements
The set of objects
The set of attributes
Distance measure
21
Measure the Dissimilarity of Objects
CSE
Find best matching instance
300 Distance function
Measure the dissimilarity between a pair of
data objects
Things to consider
Usually very different for interval-scaled,
boolean, nominal, ordinal and ratio-scaled
variables
Weights should be associated with different
variables based on applications and data
semantic
Quality of a clustering result depends on both the
distance measure adopted and its implementation
22
Minkowski Distance
Minkowski distance: a generalization
CSE
300 d (i, j) q | x x |q | x x |q ... | x x |q (q 0)
i1 j1 i2 j2 ip jp
If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
Xi (1,7) xi
12
8.48
q=2 6
q=1
6
Xj(7,1) xj
23
Binary Variables
A contingency table for binary data
CSE Object j
300
1 0 sum
1 a b ab
0 c d cd
Object i sum a c b d p
d (i , j ) bc
ab cd
24
Dissimilarity between Binary Variables
Example
CSE
300
A1 A2 A3 A4 A5 A6 A7
Object 1 1 0 1 1 1 0 0
Object 2 1 1 1 0 0 0 1
Object 2
1 0 sum
1 2 2 4
Object 1 0 2 1 3
2 2 4
sum 4 3 7 d (O ,O )
1 2 2 2 2 1 7
25
K-nearest neighbors algorithm
Initialization
CSE Arbitrarily choose k objects as the initial cluster
300
centers (centroids)
Iteration until no change
For each object Oi
26
k-Means Clustering Method cluster
10 current 10
mean
9 clusters 9
CSE 8 8
300 7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
objects
new relocated
clusters
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
27
Dataset
Data set from UCI repository
CSE
300
http://kdd.ics.uci.edu/
768 female Pima Indians evaluated for diabetes
After data cleaning 392 data entries
28
Hierarchical Clustering
Groups observations based on dissimilarity
CSE
300
Compacts database into “labels” that represent the
observations
Measure of similarity/Dissimilarity
Euclidean Distance
Manhattan Distance
Types of Clustering
Single Link
Average Link
Complete Link
29
Hierarchical Clustering: Comparison
Single-link
CSE
Complete-link
5
300 1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4
30
Compare Dendrograms
Single-link Complete-link
CSE
300
1 2 5 3 6 4 1 2 5 3 6 4
2 5 3 6 4 1
1 2 5 3 6 4 31
Which Distance Measure is Better?
Each method has both advantages and disadvantages;
CSE application-dependent
300
Single-link
Can find irregular-shaped clusters
Sensitive to outliers
Complete-link, Average-link, and Centroid distance
Robust to outliers
Tend to break large clusters
Prefer spherical clusters
32
Dendrogram from dataset
CSE
300
36
Classification and Prediction
Derive models that can use patient specific
CSE information, aid clinical decision making
300
Apriori decision on predictors and variables to predict
No method to find predictors that are not present in the
data
Numeric Response
Least Squares Regression
Categorical Response
Classification trees
Neural Networks
Support Vector Machine
Decision models
Prognosis, Diagnosis and treatment planning
Embed in clinical information systems
37
Least Squares Regression
Find a linear function of predictor variables that
CSE minimize the sum of square difference with response
300
Supervised learning technique
38
Decision Trees
Decision tree
CSE Each internal node tests an attribute
300 Each branch corresponds to attribute value
Each leaf node assigns a classification
ID3 algorithm
Based on training objects with known class labels to
classify testing objects
Rank attributes with information gain measure
Minimal height
least number of tests to classify an object
Used in commercial tools eg: Clementine
ASSISTANT
Deal with medical datasets
Incomplete data
Discretize continuous variables
Prune unreliable parts of tree
Classify data
39
Decision Trees
CSE
300
40
Algorithm for Decision Tree Induction
CSE
300
Basic algorithm (a greedy algorithm)
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all training examples are at the root
Test attributes are selected on basis of a heuristic
or statistical measure (e.g., information gain)
Examples are partitioned recursively based on
selected attributes
41
Training Dataset
CSE Age BMI Hereditary Vision Risk of
300 Condition X
P1 <=30 high no fair no
P2 <=30 high no excellent no
P3 >40 high no fair yes
P4 31…40 medium no fair yes
P5 31…40 low yes fair yes
P6 31…40 low yes excellent no
P7 >40 low yes excellent yes
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 31…40 medium yes fair yes
P11 <=30 medium yes excellent yes
P12 >40 medium no excellent yes
P13 >40 high yes fair yes
P14 31…40 medium no excellent no
42
Construction of A Decision Tree for “Condition X”
CSE [P1,…P14]
Age?
300 Yes: 9, No:5
30…40
<=30 >40
[P1,P2,P8,P9,P11] [P3,P7,P12,P13] [P4,P5,P6,P10,P14]
Yes: 2, No:3 Yes: 4, No:0 Yes: 3, No:2
History YES Vision
NO YES NO YES
43
Entropy and Information Gain
S contains si tuples of class Ci for i = {1, ..., m}
CSE
300
Information measures info required to classify any
arbitrary tuple m
si si
I( s1,s2,...,sm ) log 2
i 1 s s
Entropy of attribute A with values {a1,a2,…,av}
v
s1 j ... smj
E(A) I (s1 j ,..., smj )
j 1 s
Information gained by branching on attribute A
44
Entropy and Information Gain
Select attribute with the highest information gain (or
CSE greatest entropy reduction)
300
Such attribute minimizes information needed to
classify samples
45
Rule Induction
IF conditions THEN Conclusion
CSE Eg: CN2
300 Concept description:
Characterization: provides a concise and succinct summarization
of given collection of data
Comparison: provides descriptions comparing two or more
collections of data
46
Example used in a Clinic
Hip arthoplasty trauma surgeon predict patient’s long-
CSE term clinical status after surgery
300
Outcome evaluated during follow-ups for 2 years
2 modeling techniques
Naïve Bayesian classifier
Decision trees
Bayesian classifier
P(outcome=good) = 0.55 (11/20 good)
Probability gets updated as more attributes are
considered
P(timing=good|outcome=good) = 9/11 (0.846)
P(outcome = bad) = 9/20 P(timing=good|
outcome=bad) = 5/9
47
CSE
300
Nomogram
48
Bayesian Classification
Bayesian classifier vs. decision tree
CSE Decision tree: predict the class label
300
Bayesian classifier: statistical classifier; predict
class membership probabilities
Based on Bayes theorem; estimate posterior
probability
Naïve Bayesian classifier:
Simple classifier that assumes attribute
independence
High speed when applied to large databases
Comparable in performance to decision trees
49
Bayes Theorem
Let X be a data sample whose class label is unknown
CSE
300
Let Hi be the hypothesis that X belongs to a particular
class Ci
P(Hi) is class prior probability that X belongs to a
particular class Ci
Can be estimated by ni/n from training data
samples
n is the total number of training data samples
ni is the number of training data samples of class Ci
P( X | H )P(H )
P(H | X ) i i
i P( X )
50
More classification Techniques
Neural Networks
CSE Similar to pattern recognition properties of biological
300 systems
Most frequently used
Multi-layer perceptrons
– Input with bias, connected by weights to hidden, output
Backpropagation neural networks
Support Vector Machines
Separate database to mutually exclusive regions
Transform to another problem space
Kernel functions (dot product)
Output of new points predicted by position
Comparison with classification trees
Not possible to know which features or combination of
features most influence a prediction
51
Multilayer Perceptrons
Non-linear transfer functions to weighted sums of
CSE inputs
300
Werbos algorithm
Random weights
Training set, Testing set
52
Support Vector Machines
3 steps
CSE Support Vector creation
300 Maximal distance between points found
Perpendicular decision boundary
Allows some points to be misclassified
Pima Indian data with X1(glucose) X2(BMI)
53
What is Association Rule Mining?
Finding frequent patterns, associations, correlations, or causal
CSE structures among sets of items or objects in transaction
300
databases, relational databases, and other information
repositories
PatientID Conditions
Example of Association
1 High LDL Low HDL, Rules
High BMI, Heart Failure {High LDL, Low HDL}
2 High LDL Low HDL, {Heart Failure}
Heart Failure, Diabetes
3 Diabetes
4 High LDL Low HDL,
Heart Failure People who have high LDL
5 High BMI , High LDL (“bad” cholesterol), low HDL
Low HDL, Heart Failure (“good cholesterol”) are at
higher risk of heart failure.
54
Association Rule Mining
Market Basket Analysis
CSE Same groups of items bought placed together
300
Healthcare
55
Association Rule Mining
Association Rule
CSE An implication D Trans containing
300 expression of the form both X and Y
X Y, where X and Y
are itemsets and
XY=
Rule Evaluation
Metrics
Support (s): Fraction of Trans Trans
transactions that containing X containing Y
contain both X and Y
Confidence (c): # trans containing ( X Y )
P( X Y )
Measures how often # trans in D
items in Y appear in
transactions that
contain X # trans containing ( X Y )
P( X | Y )
# trans containing X
56
The Apriori Algorithm
CSE
Starts with most frequent 1-itemset
300 Include only those “items” that pass threshold
Use 1-itemset to generate 2-itemsets
Stop when threshold not satisfied by any itemset
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
Candidate Generation: Ck+1 = candidates
generated from Lk;
Candidate Counting: for each transaction t in
database do increment the count of all candidates
in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_sup
return k Lk;
57
Apriori-based Mining
CSE
300
Data base D 1-candidates Freq 1-itemsets 2-candidates
TID Items Itemset Sup Itemset Sup Itemset
10 a, c, d a 2 a 2 ab
20 b, c, e Scan D b 3 b 3 ac
30 a, b, c, e c 3 c 3 ae
40 b, e d 1 e 3 bc
Min_sup=0.5 e 3 be
ce
3-candidates Freq 2-itemsets Counting
Scan D Itemset Itemset Sup Itemset Sup
bce ac 2 ab 1
bc 2 ac 2 Scan D
be 3 ae 1
Freq 3-itemsets
ce 2 bc 2
Itemset Sup
be 3
bce 2
ce 2
58
Principle Component Analysis
Principle Components
In cases of large number of variables, highly possible that
CSE
300 some subsets of the variables are very correlated with each
other. Reduce variables but retain variability in dataset
Linear combinations of variables in the database
Variance of each PC maximized
– Display as much spread of the original data
PC orthogonal with each other
– Minimize the overlap in the variables
Each component normalized sum of square is unity
– Easier for mathematical analysis
Number of PC < Number of variables
Associations found
Small number of PC explain large amount of variance
Example 768 female Pima Indians evaluated for diabetes
Number of times pregnant, two-hour oral glucose tolerance test
(OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold
thickness, Two-hour serum insulin, BMI, Diabetes pedigree
function, Age, Diabetes onset within last 5 years
59
PCA Example
CSE
300
60
National Cancer Institute
CSE
CancerNet http://www.nci.nih.gov
300 CancerNet for Patients and the Public
CancerNet for Health Professionals
CancerNet for Basic Reasearchers
CancerLit
61
Conclusion
About ¾ billion of people’s medical records are
CSE electronically available
300
Data mining in medicine distinct from other fields due
to nature of data: heterogeneous, with ethical, legal
and social constraints
Most commonly used technique is classification and
prediction with different techniques applied for
different cases
Associative rules describe the data in the database
Medical data mining can be the most rewarding
despite the difficulty
62
CSE
300