Data Analytics (BE-2015 Pattern) : Unit II Basic Data Analytic Methods

Data Analytics
(BE-2015 Pattern)
Unit II
Basic Data Analytic Methods
Syllabus
Statistical Methods for Evaluation- Hypothesis testing, difference of means,

wilcoxon rank–sum test, type 1 type 2 errors, power and sample size,
ANNOVA
Advanced Analytical Theory and Methods: Clustering- Overview, K means-

Use cases, Overview of methods, determining number of clusters,
diagnostics, reasons to choose and cautions.
Syllabus

ANNOVA

Statistical Methods for
Evaluation-
Hypothesis testing,
difference of means,
wilcoxon rank–sum test,
type 1 type 2 errors,
power and sample size,
ANNOVA
What is Hypothesis?
• A hypothesis is an educated guess about
something in the world around you. It should
be testable, either by experiment or
observation. For example:
• A new medicine you think might work.
• A way of teaching you think might be better.
What is a Hypothesis Statement?
• Hypothesis statement will look like this:
• “If I…(do this to an independent variable)….then (this will happen to the
dependent variable).”
• For example:
• If I (decrease the amount of water given to herbs) then (the herbs will
increase in size).
• If I (give patients counseling in addition to medication) then (their overall
depression scale will decrease).
What is Hypothesis Testing?
•Hypothesis testing refers to
1. Making an assumption, called hypothesis,
about a population parameter.
2. Collecting sample data.
3. Calculating a sample statistic.
4. Using the sample statistic to evaluate the
hypothesis
Hypothesis Testing :Population & sample
Hypothesis Testing
HYPOTHES
IS
TESTING
Null hypothesis, H0
Alternative hypothesis,HA
 State the hypothesized value of All possible alternatives other
the parameter before sampling. than the null hypothesis.
 The assumption we wish to test E.g
(or trying to reject) µ≠20
 E.g µ >20
 µ = 20 µ<
20
 There is no difference between
There
coke and diet coke is a
Hypothesis Testing
Basic concept is to form an assertion and test it with data
Common assumption is that there is no difference between samples (default

assumption)
Statisticians refer to this as the null hypothesis (H0)
The alternative hypothesis (HA) is that there is a difference between samples

What is the Null & alternate Hypothesis?
• The null hypothesis is always the accepted fact or accepted as

being true are:
• DNA is shaped like a double helix.
• There are 8 planets in the solar system (excluding Pluto).
• Given a population, the initial (assumed) hypothesis to be

tested ,Ho , is called the null hypothesis.
• Rejection of null hypothesis causes another hypothesis,H1,is

called the alternative hypothesis, to be made.
Evaluation-
Hypothesis testing,
ANNOVA
mean, variance , standard deviation
Mean ●
μ if working with population
(or Average)
denoted by
●
X̄ if working with samples
Variance σ2 (for population)

●
denoted by s2 (for sample)

●
Standard ●
σX or σ (for population)
deviation
denoted by
●
sX or s (for sample))
Mean – is a simple average of
given data values
• Example
• 4,5,9,2,14,6
• Mean X̄ = (4+ 5+9+3+15+6) /6

• = 42/6
• =7
Variance: a measure of how data
points differ from the mean
• Data Set 1: 3, 5, 7, 10, 10

Data Set 2: 7, 7, 7, 7, 7
• What is the mean of the above data set?

• Data Set 1: mean = 7
• Data Set 2: mean = 7
• But we know that the two data sets are not identical! The
variance shows how they are different.
• We want to find a way to represent these two data set

numerically.
How to Calculate variance?
• If we conceptualize the spread of a distribution
as the extent to which the values in the
distribution differ from the mean and from each
other, then a reasonable measure of spread
might be the average deviation, or difference, of
the values from the mean.
( x  X )
N
How to Calculate variance?
• The average of the squared deviations about the mean is called
the variance.
For population variance
 x  X 
2
 2
N
For sample variance
 x  X 
2
s  2
n 1
Example 1- Variance
Score XX (X  X ) 2
X
1
3
2
5
3
7
4
10
5
10
Total 35
The mean is 35/5 = 7.

Example 1- Variance
X
1
3 3-7=-4
2
5 5-7=-2
3
7 7-7=0
4
10 10-7=3
5
10 10-7=3
Totals 35
Example 1- Variance
X
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals 35 38
Example 1- Variance
X
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals 35 38
 x  X 
2
38
s 
2
  7.6
n 5
Example 1- Variance
X
1
7 7-7=0 0
2
7 7-7=0 0
3
7 7-7=0 0
4
7 7-7=0 0
5
7 7-7=0 0
Totals 35 0
 x  X 
2
38
s 
2
 0/5 =0
7.6
n 5
Example 2- Variance
Drive Mark Myrna

1 28 27
2 22 27
3 21 28
4 26 6
5 18 27
Which diver was more consistent?

Example 2- Variance
Dive Mark's Score XX ( X  X )2
X
1 28 5 25
2 22 -1 1
3 21 -2 4
4 26 3 9
5 18 -5 25
Totals 115 0 64
Mark’s Variance = 64 / 5 = 12.8

Myrna’s Variance = 362 / 5 = 72.4
Conclusion: Mark has a lower variance therefore he is more consistent.

standard deviation - a measure of
variation of scores about the mean
• Can think of standard deviation as the average
distance to the mean
• Higher standard deviation indicates higher
spread, less consistency, and less clustering.
 x  X 
2
• sample standard deviation: s

n 1
• population standard deviation: 

 x  
2
N
Example – Standard Deviation
Dive Mark's Score XX ( X  X )2
X
1 28 5 25
2 22 -1 1
3 21 -2 4
4 26 3 9
5 18 -5 25
Mark’s Variance =
Totals 115 0 64 64 / 5 = 12.8
Mark’s Standard Deviation for population =

Mark’s Standard Deviation for sample 78
Example- Variance & Standard Deviation
• You have just measured the heights of your dogs (in mm)
• The heights (at the shoulders) are: 600mm, 470mm, 170mm,
430mm and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
• Your first step is to find the Mean:
• Mean = (600 + 470 + 170 + 430 + 300)5
• Mean = 1970/5
• Mean = 394
• Now we calculate each dog's difference from the Mean
• To calculate the Variance, take each difference, square it, and
then average the result:
• Variance
σ2 = 2062 + 762 + (−224)2 + 362 + (−94)2 / 5

= 42436 + 5776 + 50176 + 1296 + 8836 / 5
= 108520 / 5
= 21704
• So the Variance σ2 is 21,704

• And the Standard Deviation is just the square root of Variance,
so:
• Standard Deviation
σ = √21704
= 147.32...
= 147 (to the nearest mm)
• And the good thing about the Standard Deviation is that it is
useful. Now we can show which heights are within one
Standard Deviation (147mm) of the Mean:
• So, using the Standard Deviation we have a "standard" way of

knowing what is normal, and what is extra large or
extra small.
difference of means
State the hypotheses
Formulate an analysis plan
Analyze sample data using hypothesis test
Interpret results.
Hypothesis Testing Procedures
H y p o th e s is
T e s tin g
P ro c e d u re s
EPI 809 / Spring 2008

P a ra m e tric N o n p a r a m e tr ic
W ilc o x o n K ru s k a l-W a llis

R ank Sum H -T e s t
Test
O n e -W a y
Z Test t Test
ANOVA
Many More Tests Exist!
Parametric Test Procedures
1.Involve Population Parameters (Mean)

2.Have Stringent(strict) Assumptions (Normality)
3.Examples: Z Test, t Test, c2 Test, F test

Nonparametric Test Procedures
1. Do Not Involve Population Parameters
●
Example: Probability Distributions, Independence

2. Data Measured on Any Scale (Ratio or Interval, Ordinal or Nominal)
3. Example: Wilcoxon Rank Sum Test

Parametric Test
Procedures

A t test allows us to compare the means of two
groups
The calculations for a t test requires three pieces of

information:
- the difference between the means (mean difference)
- the standard deviation for each group
- and the number of subjects(samples) in each group.
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
The size of the standard deviation also influences the
outcome of a t test.
Given the same difference in means, groups are more
with
smaller
likely to standard
report a significant
deviations difference than groups
with larger standard deviations.
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
From a practical standpoint, we can see that smaller
standard
than larger
deviations Less overlap
standardproduce
deviations. would indicate
less overlap betweenthat
the
the groups are more different from each other.
groups
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
Difference of Means
Two populations – same or different?
How do we Are the scores for the two
means from the same subject
determine which (or related subjects)?
t test to use…
Yes No
Paired t test Are there the same
(Dependent t-test; number of people in
Correlated t-test) the two groups?
No
Yes
Equal Variance Are the variances of
Independent t test the two groups same?
(Pooled Variance
Independent t-test)
No
yes (Significance Level
(Significance Level for Levene (or F-Max)
Equal Variance for Levene (or F-Max) is p >=.05
Independent t test is p<.05
(Pooled Variance Unequal Variance
Independent t test) Independent t-test
(Separate Variance
Independent t test)
Difference of Means
Two Parametric Methods
Student’s t-test
●
Assumes two normally distributed populations, and that they have equal variance
Welch’s t-test
●
Assumes two normally distributed populations, and they don’t necessarily have equal
variance
Student’s t-test
Student’s t-test assumes that distributions of the two
populations have equal but unknown variances.
Suppose n1 and n2 samples are randomly and independently

selected from two populations, pop1 and pop1, respectively.
If each population is normally distributed with the same

mean ( µ1=µ2) and with the same variance, then
T (the t-statistic), follows a t-distribution with degrees of

freedom (df)
Student’s t-test
•
T=
Where
=
• significance level
• degree of freedom df =n1+n2-2
• T*- is critical value found using df (from table)

Student’s t-test
•
T=
If T> =T*
Where
=
the null hypothesis
is rejected
• is pooled variance
• significance level
• degree of freedom df =n1+n2-2

• T*- is critical value found using df (from table)
Welch’s t-test
When the equal population variance assumption is not justified in

performing Student’s t-test for the difference of means, Welch’s t-test can be
used based on
Also known as unequal variances t-test

Welch’s t-test
•T welch=
Where x , s2, n correspond to the sample

mean, sample variance, and sample size.
Notice that Welch’s t-test uses the sample

variance (s2) for each population instead of the
pooled sample variance.
Example
t-test independent samples
Example
Some brown hairs were found on the clothing of a victim at a crime
scene.
The five of the hairs were measured: 46, 57, 54, 51, 38 μm.
A suspect is the owner of a shop with similar brown hairs. A sample
of the hairs has been taken and their widths measured: 31, 35, 50,
35, 36 μm.
Is it possible that the hairs found on the victim were left by the
suspect‟s ? Test at the %5 level.
[From D. Lucy Introduction to Statistics for Forensic Scientists Chichester: Wiley, 2005 p. 44.]
1. Calculate the mean and standard deviation for the data sets
A B
46 31
57 35
54 50
51 35
38 36
Total
Mean
Standard
deviation
Dog A Dog B
46 31
57 35
54 50
51 35
38 36
Total 246 187
Mean 49.2 37.4
Standard 7.463 7.301
deviation
. 2. Calculate the magnitude of the difference between the two means
49.2 – 37.4 = 11.8

3. Calculate the standard error in the difference
. 2. Calculate the magnitude of the difference between the two means.│
= 4.669 ≈ 4.67 (3 sf)

4. Calculate the value of T
T = difference between the means ÷ standard error in the difference

4. Calculate the value of T:
T = difference between the means ÷ standard error in the difference
11. 4.669 = 2.527

8 ≈ 2.53 (3 sig fig)
5. Calculate the degrees of freedom
5. Calculate the degrees of freedom = n1 + n2 - 2
5. Calculate the degrees of freedom = n1 + n2 - 2
5+5-2=8
6. Find the critical value for the particular significance you are working to
from the table
6. Find the critical value T* for the particular significance you are
working to
from the table
4. Calculate the value of t:
and find the critical value from the table
from the table
At the 0.05 level tcrit = 2.306

If T < T* (critical value) then there is no significant difference

between the two sets of data ,i.e. null hypothesis is Accepted
If T >=T* ( critical value) then there is a significant difference
between the two sets of data i.e. null hypothesis is Rejected
Evaluation-
Hypothesis testing,
ANNOVA
Advantages of Nonparametric Tests
1. Used With All Scales
2. Easier to Compute
3. Make Fewer Assumptions

4. Need Not Involve Population
Parameters
5. Results May Be as Exact
as Parametric Procedures
© 1984-1994 T/Maker Co.

Disadvantages of Nonparametric Tests
1.May Waste Information © 1984-1994 T/Maker Co.
Parametric model more efficient

if data Permit

2.Difficult to Compute by
hand for Large Samples
3.Tables Not Widely Available
Popular Nonparametric Tests
1.Sign Test

2.Wilcoxon Rank Sum Test
3.Wilcoxon Signed Rank Test

Wilcoxon Rank Sum
Test

Wilcoxon Rank-Sum Test
A Nonparametric Method
• Makes no assumptions about the

underlying probability distributions
Wilcoxon Rank Sum Test
1.Tests Two Independent Population Probability
Distributions
●
Independent, Random Samples
●
Populations Are Continuous

2.Corresponds to t-Test for 2 Independent Means
3.Assumptions
4.Can Use Normal Approximation If ni  10

Procedure
1. Assign Ranks, Ri, to the n1 + n2 Sample

Observations
If Unequal Sample Sizes, Let n1 Refer to Smaller-Sized Sample

Smallest Value = 1
2. Sum the Ranks, Ti, for Each Sample

Test Statistic Is TA (Smallest Sample)
Null hypothesis: both samples come from the same underlying
distribution
Distribution of T is not quite as simple as binomial, but it can be
computed
Example
• You’re a production planner.

• You want to see if the operating rates for 2 factories

is the same.
• For factory 1, the rates are
• 71, 82, 77, 92, 88.
• For factory 2, the rates are
• 85, 82, 94 & 97.
• Do the factory rates have the same probability
distributions at the .05 level?
Solution
• H0: Test Statistic:
• Ha:
• =
• n1 = n2 =
• Critical Value(s): Decision:
Conclusion:
 Ranks
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right
• =
• n1 = n2 =
Conclusion:
 Ranks
Solution
•  = .05
• n1 = 4 n2 = 5
Conclusion:
 Ranks
Wilcoxon Rank Sum
Table 12 (Rosner) (Portion)
 = .05 two-tailed
n1
4 5 6 ..

TL TU TL TU TL TU ..
4 10 26 16 34 23 43 ..
n2 5 11 29 17 38 24 48 ..
6 12 32 18 42 26 52 ..
: : : : : : : :
Solution
•  = .10
• n1 = 4 n2 = 5
Reject Do Not Reject

Conclusion:
Reject
12 28  Ranks
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 85
82 82
77 94
92 97
88 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85
82 82
77 94
92 97
88 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85
82 82
77 2 94
92 97
88 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85
82 3 82 4
77 2 94
92 97
88 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 6 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 7 97
88 6 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97
88 6 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum 19.5 25.5
Solution
• Ha: Shifted Left or Right T2 = 5 + 3.5 + 8+ 9 = 25.5
•  = .05 (Smallest Sample)
• n1 = 4 n2 = 5

Conclusion:
Reject
12 28  Ranks
Solution
• n1 = 4 n2 = 5
Do Not Reject at  = .05

Conclusion:
Reject
12 28  Ranks
Solution
• n1 = 4 n2 = 5
Do Not Reject at  = .05

Conclusion:
Reject
There is No evidence for
12 28  Ranks
unequal distrib
Evaluation-
Hypothesis testing,
ANNOVA
Type I and Type II errors
Type I error refers to the situation when we reject the null hypothesis when it is

true (H0 is wrongly rejected). Denoted by
Type II error refers to the situation when we accept the null hypothesis when it is

false. (H0 is wrongly Accepted). Denoted by
Type I and Type II errors
Which one is more

dangerous Type I
or Type II error ?
Justify your
answer.
Evaluation-
Hypothesis testing,
ANNOVA
Power and Sample Size
• The
power of a test is the probability of correctly rejecting the
null hypothesis
• It is denoted by , where (1-is the probability of a type II error.
• The power of a test improves as the sample size increases
• power is used to determine the necessary sample size.
• power of a hypothesis test depends on the true difference of
the population means.
• A larger sample size is required to detect a smaller difference
in the means.
• In general, Effect size d= difference between the means
• It is important to consider an appropriate effect size for the
problem at hand
Power and Sample Size
A larger sample size better identifies a fixed effect size
Evaluation-
Hypothesis testing,
ANNOVA
ANOVA (Analysis of Variance)
A generalization of the hypothesis testing of the difference of two population

means
Good for analyzing more than two populations
ANOVA tests if any of the population means differ from the other population
means
ANOVA (Analysis of Variance)
Find the mean for each of the groups.
Find the overall mean (the mean of the groups combined).
Find the Within Group Variation; the total deviation of each member’s score
from the Group Mean.
Find the Between Group Variation: the deviation of each Group Mean from
the Overall Mean.
Find the F critical and F statistic: the ratio of Between Group Variation to
Within Group Variation.
F statistic < F critical accept Ho else reject H0 and accept Ha

Syllabus

ANNOVA

Advanced Analytical Theory
and Methods
Clustering- Overview,
K means- Use cases,
Overview of methods,
determining number of clusters,
diagnostics,
reasons to choose and cautions.

Overview of Clustering
Clustering is the use of unsupervised techniques for grouping similar objects
●
Supervised methods use labeled objects
●
Unsupervised methods use unlabeled objects
Clustering looks for hidden structure in the data, similarities based on attributes
●
Often used for exploratory analysis
●
No predictions are made
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
1/19/22
• create thematic maps in GIS by clustering feature spaces
• detect spatial clusters and explain them in spatial data mining
Data Mining: Concepts and

Techniques
• Image Processing
• Economic Science (especially market research)
• WWW
• Document classification
113
• Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in

their customer bases, and then use this knowledge to
1/19/22
develop targeted marketing programs
 Land use: Identification of areas of similar land use in an
Data Mining: Concepts and

Techniques
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters 114
should be clustered along continent faults
CLUSTERING
• Cluster: a collection of data objects similar to one another within

the same cluster
• Dissimilar to the objects in other clusters
• The distance between points in a cluster is less than the distance
between a point in the cluster and any point outside it
• Data can be clustered on different attributes
• Clustering differs from classification
• Unsupervised learning
• No predefined classes (no a priori knowledge)
115
+
+
Cluster analysis: Finding similarities between data according

to the characteristics found in the data and grouping similar
data objects into clusters
116
Clustering Methods
• Given a cluster Km of N points { tm1,tm2, tmk} , the centroid or middle

of the cluster computed as
• Centroid = Cm = ∑ tmi / N is considered as the representative of the
cluster (there may not be any corresponding object)
• Some algorithms use as representative a centrally located object
called Medoid
Centroid Medoid
117
and Methods
K means- Use cases,
diagnostics,

K-means Algorithm
Given a collection of objects each with n measurable attributes and a chosen

value k that is the number of clusters, the algorithm identifies the k clusters
of objects based on the objects proximity to the centers of the k groups.
The algorithm is iterative with the centers adjusted to the mean of each
cluster’s n-dimensional vector of attributes
Use Cases
• Clustering is often used as a lead-in to
classification, where labels are applied to the
identified clusters
• Some applications
• Image processing
• With security images, successive frames are examined for change
• Medical
• Patients can be grouped to identify naturally occurring clusters
• Customer segmentation
• Marketing and sales groups identify customers having similar
behaviors and spending patterns
and Methods
K means- Use cases,
diagnostics,

K-Means Example
Given:
Given: {2,4,10,12,3,20,30,11,25},
{2,4,10,12,3,20,30,11,25}, k=2
k=2
Randomly
Randomly assign
assign means:
means: m
m11=3,m
=3,m22=4
=4
KK11={2,3},
={2,3}, KK22={4,10,12,20,30,11,25},
={4,10,12,20,30,11,25}, m
m11=2.5,m
=2.5,m22=16
=16
KK11={2,3,4},K
={2,3,4},K22={10,12,20,30,11,25},
={10,12,20,30,11,25}, m
m11=3,m
=3,m22=18
=18
KK11={2,3,4,10},K
={2,3,4,10},K22={12,20,30,11,25},
={12,20,30,11,25}, m
m11=4.75,m
=4.75,m22=19.6
=19.6
KK11={2,3,4,10,11,12},K
={2,3,4,10,11,12},K22={20,30,25},
={20,30,25}, m
m11=7,m
=7,m22=25
=25
K-means Method
Four Steps
Choose the value of k and the initial guesses for the centroids
Compute the distance from each data point to each centroid, and assign each point
to the closest centroid
Compute the centroid of each newly defined cluster from step 2
Repeat steps 2 and 3 until the algorithm converges (no changes occur)
K-means Method- for two dimension
Example – Step 1
• Choose the value of k and the k initial guesses for the centroids.
• In this example, k = 3, and the initial centroids are indicated by the points
shaded in red, green, and blue
Example – Step 2
• are assigned to the closest centroid.
Points
In two dimensions, the distance, d, between any two points,(x1,y1) and (x2,y2) is
expressed by Euclidean distance measure+
Example – Step 3
•
Computecentroidsof the new clusters. In two dimensions, the centroid
(Xc,Yc) of the m points is calculated as follows
(Xc,Yc)= ,
Example – Step 4
• Repeat steps 2 and 3 until convergence
• Convergence occurs when the centroids do not change or when
the centroids oscillate back and forth
• This can occur when one or more points have equal distances from
the centroid centers
• Videos
• http://www.youtube.com/watch?v=aiJ8II94qck
• https://class.coursera.org/ml-003/lecture/78
K-means - for n dimension
• To
generalize the prior algorithm to n dimensions, suppose
there are M objects, where each object is described by n
attributes or property values (P1,P2,….,Pn). Then object i is
described by for (Pi1,Pi2,….,Pin) for i= 1,2,..., M.
• For a given point, Pi, at (Pi1,Pi2,….,Pin)and a centroid, q, located
at (q1,q2,….qn), the distance, d, between Piand q, is expressed
as shown in
• The centroid q of a cluster of m points, (Pi1,Pi2,….,Pin) , is

calculated as shown in
• (q1,q2,…qn) = , , ……
and Methods
K means- Use cases,
diagnostics,

Determining Number of Clusters
• k clusters can be identified in a given dataset, but what

value of k should be selected?
• The value of k can be chosen based on a reasonable guess
or some predefined requirement.
• How to know better or worse having k clusters versus k – 1
or k + 1 clusters
• Solution:
• Use heuristic – e.g., Within Sum of Squares (WSS)
• WSS metric is the sum of the squares of the distances
between each data point and the closest centroid
• The process of identifying the appropriate value of k is
referred to as finding the “elbow” of the WSS curve
(WSS Method)
1. Compute clustering algorithm (e.g., k-means clustering) for
different values of k. For instance, by varying k from 1 to 10
clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered
as an indicator of the appropriate number of clusters.
• where: xi -is a data point belonging to the cluster Ck

• μk is the mean value of the points assigned to the cluster Ck
Example of WSS vs #Clusters curve
The elbow of the curve appears to occur at k = 3.

and Methods
K means- Use cases,
diagnostics,

Diagnostics
When the number of clusters is small,
plotting the data helps refine the choice of k
The following questions should be

considered
• Are the clusters well separated from each other?
• Do any of the clusters have only a few points
• Do any of the centroids appear to be too close to
each other?
Diagnostics
Example of distinct clusters
Diagnostics
Example of less obvious clusters
Diagnostics
Six clusters from points of previous figure
and Methods
K means- Use cases,
diagnostics,

Reasons to Choose and Cautions
• Decisions the practitioner must make
• What object attributes should be included
in the analysis?
• What unit of measure should be used for
each attribute?
• Do the attributes need to be rescaled?
• What other considerations might apply?
Object Attributes
• Important to understand what attributes will be

known at the time a new object is assigned to a
cluster
• E.g., information on existing customers’ satisfaction or
purchase frequency may be available, but such information
may not be available for potential customers .
• Eg. information like age and income of existing customers is
available but may not be available, for new customers
• Best to reduce number of attributes when possible
• Too many attributes minimize the impact of key variables
• Identify highly correlated attributes for reduction
• Combine several attributes into one: e.g., debt/asset ratio
Object attributes: scatterplot matrix for seven attributes
Units of Measure
• K-means algorithm will identify different clusters depending on the

units of measure
k=2
Units of Measure
Age
dominates
k=2
Rescaling
• Rescaling can reduce domination effect

• E.g., divide each variable by the appropriate standard
deviation
Rescaled
attributes
k=2
Additional Considerations
K-means sensitive to starting seeds

• Important to rerun with several seeds – R has the nstart option
Could explore distance metrics other than Euclidean

• E.g., Manhattan, Mahalanobis, etc.
K-means is easily applied to numeric data and does

not work well with nominal attributes
• E.g., color
Additional Algorithms
K-modes clustering
• kmod()
Partitioning around Medoids (PAM)

• pam()
Hierarchical agglomerative clustering

• hclust()
Summary
• Clustering analysis groups similar objects based on the
objects’ attributes
• To use k-means properly, it is important to
• Properly scale the attribute values to avoid
domination
• Assure the concept of distance between the
assigned values of an attribute is meaningful
• Carefully choose the number of clusters, k
• Once the clusters are identified, it is often useful to label
them in a descriptive way
References
• https://
www.slideshare.net/darlingjunior/hypothesis-testing?from_acti
on=save
• https://www.mathsisfun.com/data/standard-deviation.html
• http:/www2.aueb.gr/users/koundouri/resees/uploads/Chapter1
0.ppt
• https://researchbasics.education.uconn.edu/wp
-content/uploads/sites/1215/.../ttest.pps
• https://msu.edu/~fuw/teaching/Fu_ch9_Nonpara.ppt
• http://www.statisticshowto.com/probability-and-statistics/t-test
/

Data Analytics (BE-2015 Pattern) : Unit II Basic Data Analytic Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics (BE-2015 Pattern) : Unit II Basic Data Analytic Methods

Uploaded by

Copyright:

Available Formats

Data Analytics

Statistical Methods for Evaluation- Hypothesis testing, difference of means,

Advanced Analytical Theory and Methods: Clustering- Overview, K means-

Statistical Methods for Evaluation- Hypothesis testing, difference of means,

Advanced Analytical Theory and Methods: Clustering- Overview, K means-

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

Basic concept is to form an assertion and test it with data

Common assumption is that there is no difference between samples (default

Statisticians refer to this as the null hypothesis (H0)

The alternative hypothesis (HA) is that there is a difference between samples

• The null hypothesis is always the accepted fact or accepted as

• Given a population, the initial (assumed) hypothesis to be

• Rejection of null hypothesis causes another hypothesis,H1,is

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

Variance σ2 (for population)

denoted by s2 (for sample)

• Mean X̄ = (4+ 5+9+3+15+6) /6

• Data Set 1: 3, 5, 7, 10, 10

• What is the mean of the above data set?

• We want to find a way to represent these two data set

For population variance

The mean is 35/5 = 7.

Drive Mark Myrna

Which diver was more consistent?

Mark’s Variance = 64 / 5 = 12.8

Conclusion: Mark has a lower variance therefore he is more consistent.

• sample standard deviation: s

• population standard deviation: 

Mark’s Standard Deviation for population =

σ2 = 2062 + 762 + (−224)2 + 362 + (−94)2 / 5

• So the Variance σ2 is 21,704

• So, using the Standard Deviation we have a "standard" way of

State the hypotheses

Formulate an analysis plan

Analyze sample data using hypothesis test

EPI 809 / Spring 2008

W ilc o x o n K ru s k a l-W a llis

1.Involve Population Parameters (Mean)

EPI 809 / Spring 2008

3.Examples: Z Test, t Test, c2 Test, F test

1. Do Not Involve Population Parameters

EPI 809 / Spring 2008

3. Example: Wilcoxon Rank Sum Test

EPI 809 / Spring 2008

The calculations for a t test requires three pieces of

Suppose n1 and n2 samples are randomly and independently

If each population is normally distributed with the same

T (the t-statistic), follows a t-distribution with degrees of

• T*- is critical value found using df (from table)

• degree of freedom df =n1+n2-2

When the equal population variance assumption is not justified in

Also known as unequal variances t-test

Where x , s2, n correspond to the sample

Notice that Welch’s t-test uses the sample

49.2 – 37.4 = 11.8

= 4.669 ≈ 4.67 (3 sf)

T = difference between the means ÷ standard error in the difference

T = difference between the means ÷ standard error in the difference

11. 4.669 = 2.527