You are on page 1of 63

CSE

300

Data mining and its application and


usage in medicine

By Radhika

1
Data Mining and Medicine
 History
CSE  Past 20 years with relational databases
300
 More dimensions to database queries
 earliest and most successful area of data mining
 Mid 1800s in London hit by infectious disease
 Two theories
– Miasma theory  Bad air propagated disease
– Germ theory  Water-borne
 Advantages
– Discover trends even when we don’t understand reasons
– Discover irrelevant patterns that confuse than enlighten
– Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment
 Data Mining
 Patterns persistent and meaningful
 Knowledge Discovery of Data
2
The future of data mining
 10 biggest killers in the US
CSE
300

 Data mining = Process of discovery of interesting,


meaningful and actionable patterns hidden in large
amounts of data

3
Major Issues in Medical Data Mining
 Heterogeneity of medical data
CSE  Volume and complexity
300
 Physician’s interpretation
 Poor mathematical categorization
 Canonical Form
 Solution: Standard vocabularies, interfaces
between different sources of data integrations,
design of electronic patient records
 Ethical, Legal and Social Issues
 Data Ownership
 Lawsuits
 Privacy and Security of Human Data
 Expected benefits
 Administrative Issues
4
Why Data Preprocessing?
 Patient records consist of clinical, lab parameters,
CSE results of particular investigations, specific to tasks
300
 Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in codes or
names
 Temporal chronic diseases parameters
 No quality data, no quality mining results!
 Data warehouse needs consistent integration of
quality data
 Medical Domain, to handle incomplete,
inconsistent or noisy data, need people with
domain knowledge
5
What is Data Mining? The KDD Process
CSE
300
Pattern Evaluation

Data Mining

Task-relevant
Data
Data Selection
Warehouse
Data Cleaning
Data Integration

Databases

6
From Tables and Spreadsheets to Data Cubes
 A data warehouse is based on a multidimensional data
CSE model that views data in the form of a data cube
300
 A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
 Fact table contains measures (such as
dollars_sold) and keys to each of related dimension
tables

 W. H. Inmon:“A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making
process.”
7
Data Warehouse vs. Heterogeneous DBMS
 Data warehouse: update-driven, high performance
CSE  Information from heterogeneous sources is
300
integrated in advance and stored in warehouses for
direct query and analysis
 Do not contain most current information
 Query processing does not interfere with
processing at local sources
 Store and integrate historical information
 Support complex multidimensional queries

8
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)
CSE  Major task of traditional relational DBMS
300  Day-to-day operations: purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical,
consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex
queries
9
CSE
300

10
Why Separate Data Warehouse?
 High performance for both systems
CSE  DBMS tuned for OLTP: access methods, indexing,
300
concurrency control, recovery
 Warehouse tuned for OLAP: complex OLAP
queries, multidimensional view, consolidation
 Different functions and different data:
 Missing data: Decision support requires historical
data which operational DBs do not typically
maintain
 Data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
 Data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
11
CSE
300

12
CSE
300

13
Typical OLAP Operations
 Roll up (drill-up): summarize data
CSE  by climbing up hierarchy or by dimension reduction
300  Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions
 Slice and dice:
 project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

14
CSE
300

15
CSE
300

16
Multi-Tiered Architecture
CSE
300
Monitor OLAP Server
other Metadata &
sources Integrator

Analysis
Query
Operational Extract
Reports
DBs Transform
Data Serve
Load Data mining
Refresh Warehouse

Data Marts

Data Sources OLAP Engine Front-End Tools


Data Storage

17
Steps of a KDD Process
 Learning the application domain:
CSE  relevant prior knowledge and goals of application
300  Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation:
 Find useful features, dimensionality/variable reduction,
invariant representation.
 Choosing functions of data mining
 summarization, classification, regression, association,
clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns,
etc.
 Use of discovered knowledge
18
Common Techniques in Data Mining
 Predictive Data Mining
CSE
300
 Most important
 Classification: Relate one set of variables in data to
response variables
 Regression: estimate some continuous value
 Descriptive Data Mining
 Clustering: Discovering groups of similar instances
 Association rule extraction

 Variables/Observations
 Summarization of group descriptions

19
Leukemia
 Different types of cells look very similar
CSE 
300
Given a number of samples (patients)
 can we diagnose the disease accurately?
 Predict the outcome of treatment?
 Recommend best treatment based of previous
treatments?
 Solution: Data mining on micro-array data
 38 training patients, 34 testing patients ~ 7000 patient
attributes
 2 classes: Acute Lymphoblastic Leukemia(ALL) vs
Acute Myeloid Leukemia (AML)

20
Clustering/Instance Based Learning
 Uses specific instances to perform classification than general
CSE IF THEN rules
300  Nearest Neighbor classifier
 Most studied algorithms for medical purposes
 Clustering– Partitioning a data set into several groups
(clusters) such that
 Homogeneity: Objects belonging to the same cluster are
similar to each other
 Separation: Objects belonging to different clusters are
dissimilar to each other. 
 Three elements
 The set of objects
 The set of attributes
 Distance measure

21
Measure the Dissimilarity of Objects
CSE
 Find best matching instance
300  Distance function
 Measure the dissimilarity between a pair of
data objects
 Things to consider
 Usually very different for interval-scaled,
boolean, nominal, ordinal and ratio-scaled
variables
 Weights should be associated with different
variables based on applications and data
semantic
 Quality of a clustering result depends on both the
distance measure adopted and its implementation

22
Minkowski Distance
 Minkowski distance: a generalization
CSE
300 d (i, j)  q | x  x |q  | x  x |q ... | x  x |q (q  0)
i1 j1 i2 j2 ip jp
 If q = 2, d is Euclidean distance
 If q = 1, d is Manhattan distance

Xi (1,7) xi
12
8.48
q=2 6
q=1

6
Xj(7,1) xj
23
Binary Variables
 A contingency table for binary data
CSE Object j
300
1 0 sum
1 a b ab
0 c d cd
Object i sum a  c b  d p

 Simple matching coefficient

d (i , j )  bc
ab cd

24
Dissimilarity between Binary Variables
 Example
CSE
300
A1 A2 A3 A4 A5 A6 A7
Object 1 1 0 1 1 1 0 0
Object 2 1 1 1 0 0 0 1
Object 2
1 0 sum
1 2 2 4
Object 1 0 2 1 3
2  2 4
sum 4 3 7 d (O ,O )  
1 2 2  2  2 1 7

25
K-nearest neighbors algorithm
 Initialization
CSE  Arbitrarily choose k objects as the initial cluster
300
centers (centroids)
 Iteration until no change
 For each object Oi

 Calculate the distances between Oi and the k centroids


 (Re)assign Oi to the cluster whose centroid is the
closest to Oi
 Update the cluster centroids based on current
assignment

26
k-Means Clustering Method cluster
10 current 10
mean
9 clusters 9
CSE 8 8

300 7 7
6 6
5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
objects
new relocated
clusters
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

27
Dataset
 Data set from UCI repository
CSE 
300
http://kdd.ics.uci.edu/
 768 female Pima Indians evaluated for diabetes
 After data cleaning 392 data entries

28
Hierarchical Clustering
 Groups observations based on dissimilarity
CSE 
300
Compacts database into “labels” that represent the
observations
 Measure of similarity/Dissimilarity
 Euclidean Distance
 Manhattan Distance
 Types of Clustering
 Single Link
 Average Link
 Complete Link

29
Hierarchical Clustering: Comparison
Single-link
CSE
Complete-link
5
300 1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4

Average-link Centroid distance


5
1 5 4 1
2
5 2
2 5
2
3 6 3
3 6
4 1 1
4 4
3

30
Compare Dendrograms
Single-link Complete-link
CSE
300

1 2 5 3 6 4 1 2 5 3 6 4

Average-link Centroid distance

2 5 3 6 4 1
1 2 5 3 6 4 31
Which Distance Measure is Better?
 Each method has both advantages and disadvantages;
CSE application-dependent
300
 Single-link
 Can find irregular-shaped clusters
 Sensitive to outliers
 Complete-link, Average-link, and Centroid distance
 Robust to outliers
 Tend to break large clusters
 Prefer spherical clusters

32
Dendrogram from dataset
CSE
300

 Minimum spanning tree through the observations


 Single observation that is last to join the cluster is patient whose
blood pressure is at bottom quartile, skin thickness is at bottom
quartile and BMI is in bottom half
 Insulin was however largest and she is 59-year old diabetic
33
Dendrogram from dataset
CSE
300

 Maximum dissimilarity between observations in one


cluster when compared to another
34
Dendrogram from dataset
CSE
300

 Average dissimilarity between observations in one


cluster when compared to another
35
Supervised versus Unsupervised Learning
 Supervised learning (classification)
CSE  Supervision: Training data (observations,
300
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on training set
 Unsupervised learning (clustering)
 Class labels of training data are unknown
 Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in
data

36
Classification and Prediction
 Derive models that can use patient specific
CSE information, aid clinical decision making
300
 Apriori decision on predictors and variables to predict
 No method to find predictors that are not present in the
data
 Numeric Response
 Least Squares Regression
 Categorical Response
 Classification trees
 Neural Networks
 Support Vector Machine
 Decision models
 Prognosis, Diagnosis and treatment planning
 Embed in clinical information systems
37
Least Squares Regression
 Find a linear function of predictor variables that
CSE minimize the sum of square difference with response
300
 Supervised learning technique

 Predict insulin in our dataset :glucose and BMI

38
Decision Trees
 Decision tree
CSE  Each internal node tests an attribute
300  Each branch corresponds to attribute value
 Each leaf node assigns a classification
 ID3 algorithm
 Based on training objects with known class labels to
classify testing objects
 Rank attributes with information gain measure
 Minimal height
 least number of tests to classify an object
 Used in commercial tools eg: Clementine
 ASSISTANT
 Deal with medical datasets
 Incomplete data
 Discretize continuous variables
 Prune unreliable parts of tree
 Classify data

39
Decision Trees
CSE
300

40
Algorithm for Decision Tree Induction
CSE
300
Basic algorithm (a greedy algorithm)
 Attributes are categorical (if continuous-valued,
they are discretized in advance)
 Tree is constructed in a top-down recursive
divide-and-conquer manner
 At start, all training examples are at the root
 Test attributes are selected on basis of a heuristic
or statistical measure (e.g., information gain)
 Examples are partitioned recursively based on
selected attributes

41
Training Dataset
CSE Age BMI Hereditary Vision Risk of
300 Condition X
P1 <=30 high no fair no
P2 <=30 high no excellent no
P3 >40 high no fair yes
P4 31…40 medium no fair yes
P5 31…40 low yes fair yes
P6 31…40 low yes excellent no
P7 >40 low yes excellent yes
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 31…40 medium yes fair yes
P11 <=30 medium yes excellent yes
P12 >40 medium no excellent yes
P13 >40 high yes fair yes
P14 31…40 medium no excellent no

42
Construction of A Decision Tree for “Condition X”

CSE [P1,…P14]
Age?
300 Yes: 9, No:5

30…40
<=30 >40
[P1,P2,P8,P9,P11] [P3,P7,P12,P13] [P4,P5,P6,P10,P14]
Yes: 2, No:3 Yes: 4, No:0 Yes: 3, No:2
History YES Vision

no yes excellent fair

[P1,P2,P8] [P9,P11] [P6,P14] [P4,P5,P10]


Yes: 0, No:3 Yes: 2, No:0 Yes: 0, No:2 Yes: 3, No:0

NO YES NO YES

43
Entropy and Information Gain
 S contains si tuples of class Ci for i = {1, ..., m}
CSE 
300
Information measures info required to classify any
arbitrary tuple m
si si
I( s1,s2,...,sm )   log 2
i 1 s s
 Entropy of attribute A with values {a1,a2,…,av}
v
s1 j  ...  smj
E(A)   I (s1 j ,..., smj )
j 1 s
 Information gained by branching on attribute A

Gain( A )  I(s1, s2,..., sm )  E(A )

44
Entropy and Information Gain
 Select attribute with the highest information gain (or
CSE greatest entropy reduction)
300
 Such attribute minimizes information needed to
classify samples

45
Rule Induction
 IF conditions THEN Conclusion
CSE  Eg: CN2
300  Concept description:
 Characterization: provides a concise and succinct summarization
of given collection of data
 Comparison: provides descriptions comparing two or more
collections of data

 Training set, testing set


 Imprecise
 Predictive Accuracy
 P/P+N

46
Example used in a Clinic
 Hip arthoplasty trauma surgeon predict patient’s long-
CSE term clinical status after surgery
300
 Outcome evaluated during follow-ups for 2 years
 2 modeling techniques
 Naïve Bayesian classifier
 Decision trees
 Bayesian classifier
 P(outcome=good) = 0.55 (11/20 good)
 Probability gets updated as more attributes are
considered
 P(timing=good|outcome=good) = 9/11 (0.846)
 P(outcome = bad) = 9/20 P(timing=good|
outcome=bad) = 5/9

47
CSE
300

Nomogram

48
Bayesian Classification
 Bayesian classifier vs. decision tree
CSE  Decision tree: predict the class label
300
 Bayesian classifier: statistical classifier; predict
class membership probabilities
 Based on Bayes theorem; estimate posterior
probability
 Naïve Bayesian classifier:
 Simple classifier that assumes attribute
independence
 High speed when applied to large databases
 Comparable in performance to decision trees

49
Bayes Theorem
 Let X be a data sample whose class label is unknown
CSE 
300
Let Hi be the hypothesis that X belongs to a particular
class Ci
 P(Hi) is class prior probability that X belongs to a
particular class Ci
 Can be estimated by ni/n from training data
samples
 n is the total number of training data samples
 ni is the number of training data samples of class Ci

P( X | H )P(H )
P(H | X )  i i
i P( X )

Formula of Bayes Theorem

50
More classification Techniques
 Neural Networks
CSE  Similar to pattern recognition properties of biological
300 systems
 Most frequently used
 Multi-layer perceptrons
– Input with bias, connected by weights to hidden, output
 Backpropagation neural networks
 Support Vector Machines
 Separate database to mutually exclusive regions
 Transform to another problem space
 Kernel functions (dot product)
 Output of new points predicted by position
 Comparison with classification trees
 Not possible to know which features or combination of
features most influence a prediction

51
Multilayer Perceptrons
 Non-linear transfer functions to weighted sums of
CSE inputs
300
 Werbos algorithm
 Random weights
 Training set, Testing set

52
Support Vector Machines
 3 steps
CSE  Support Vector creation
300  Maximal distance between points found
 Perpendicular decision boundary
 Allows some points to be misclassified
 Pima Indian data with X1(glucose) X2(BMI)

53
What is Association Rule Mining?
 Finding frequent patterns, associations, correlations, or causal
CSE structures among sets of items or objects in transaction
300
databases, relational databases, and other information
repositories
PatientID Conditions
Example of Association
1 High LDL Low HDL, Rules
High BMI, Heart Failure {High LDL, Low HDL} 
2 High LDL Low HDL, {Heart Failure}
Heart Failure, Diabetes
3 Diabetes
4 High LDL Low HDL,
Heart Failure  People who have high LDL
5 High BMI , High LDL (“bad” cholesterol), low HDL
Low HDL, Heart Failure (“good cholesterol”) are at
higher risk of heart failure.

54
Association Rule Mining
 Market Basket Analysis
CSE  Same groups of items bought placed together
300
 Healthcare

 Understanding among association among patients with


demands for similar treatments and services
 Goal : find items for which joint probability of
occurrence is high
 Basket of binary valued variables
 Results form association rules, augmented with
support and confidence

55
Association Rule Mining
 Association Rule
CSE  An implication D Trans containing
300 expression of the form both X and Y
X  Y, where X and Y
are itemsets and
XY=
 Rule Evaluation
Metrics
 Support (s): Fraction of Trans Trans
transactions that containing X containing Y
contain both X and Y
 Confidence (c): # trans containing ( X  Y )
P( X  Y ) 
Measures how often # trans in D
items in Y appear in
transactions that
contain X # trans containing ( X  Y )
P( X | Y ) 
# trans containing X

56
The Apriori Algorithm
CSE
 Starts with most frequent 1-itemset
300  Include only those “items” that pass threshold
 Use 1-itemset to generate 2-itemsets
 Stop when threshold not satisfied by any itemset

 L1 = {frequent items};
for (k = 1; Lk !=; k++) do
 Candidate Generation: Ck+1 = candidates
generated from Lk;
 Candidate Counting: for each transaction t in
database do increment the count of all candidates
in Ck+1 that are contained in t
 Lk+1 = candidates in Ck+1 with min_sup
return k Lk;
57
Apriori-based Mining
CSE
300
Data base D 1-candidates Freq 1-itemsets 2-candidates
TID Items Itemset Sup Itemset Sup Itemset
10 a, c, d a 2 a 2 ab
20 b, c, e Scan D b 3 b 3 ac
30 a, b, c, e c 3 c 3 ae
40 b, e d 1 e 3 bc
Min_sup=0.5 e 3 be
ce
3-candidates Freq 2-itemsets Counting
Scan D Itemset Itemset Sup Itemset Sup
bce ac 2 ab 1
bc 2 ac 2 Scan D
be 3 ae 1
Freq 3-itemsets
ce 2 bc 2
Itemset Sup
be 3
bce 2
ce 2

58
Principle Component Analysis
 Principle Components
 In cases of large number of variables, highly possible that
CSE
300 some subsets of the variables are very correlated with each
other. Reduce variables but retain variability in dataset
 Linear combinations of variables in the database
 Variance of each PC maximized
– Display as much spread of the original data
 PC orthogonal with each other
– Minimize the overlap in the variables
 Each component normalized sum of square is unity
– Easier for mathematical analysis
 Number of PC < Number of variables
 Associations found
 Small number of PC explain large amount of variance
 Example 768 female Pima Indians evaluated for diabetes
 Number of times pregnant, two-hour oral glucose tolerance test
(OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold
thickness, Two-hour serum insulin, BMI, Diabetes pedigree
function, Age, Diabetes onset within last 5 years
59
PCA Example
CSE
300

60
National Cancer Institute
CSE
 CancerNet http://www.nci.nih.gov
300  CancerNet for Patients and the Public
 CancerNet for Health Professionals
 CancerNet for Basic Reasearchers
 CancerLit

61
Conclusion
 About ¾ billion of people’s medical records are
CSE electronically available
300
 Data mining in medicine distinct from other fields due
to nature of data: heterogeneous, with ethical, legal
and social constraints
 Most commonly used technique is classification and
prediction with different techniques applied for
different cases
 Associative rules describe the data in the database
 Medical data mining can be the most rewarding
despite the difficulty

62
CSE
300

Thank you !!!


63

You might also like