You are on page 1of 148

Data Analytics

(BE-2015 Pattern)
Unit II
Basic Data Analytic Methods
Syllabus

Statistical Methods for Evaluation- Hypothesis testing, difference of means,


wilcoxon rank–sum test, type 1 type 2 errors, power and sample size,
ANNOVA

Advanced Analytical Theory and Methods: Clustering- Overview, K means-


Use cases, Overview of methods, determining number of clusters,
diagnostics, reasons to choose and cautions.
Syllabus

Statistical Methods for Evaluation- Hypothesis testing, difference of means,


wilcoxon rank–sum test, type 1 type 2 errors, power and sample size,
ANNOVA

Advanced Analytical Theory and Methods: Clustering- Overview, K means-


Use cases, Overview of methods, determining number of clusters,
diagnostics, reasons to choose and cautions.
Statistical Methods for
Evaluation-
Hypothesis testing,

difference of means,

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

ANNOVA
What is Hypothesis?
• A hypothesis is an educated guess about
something in the world around you. It should
be testable, either by experiment or
observation. For example:
• A new medicine you think might work.
• A way of teaching you think might be better.
What is a Hypothesis Statement?
• Hypothesis statement will look like this:
• “If I…(do this to an independent variable)….then (this will happen to the
dependent variable).”

• For example:
• If I (decrease the amount of water given to herbs) then (the herbs will
increase in size).
• If I (give patients counseling in addition to medication) then (their overall
depression scale will decrease).
What is Hypothesis Testing?
•Hypothesis testing refers to
1. Making an assumption, called hypothesis,
about a population parameter.
2. Collecting sample data.
3. Calculating a sample statistic.
4. Using the sample statistic to evaluate the
hypothesis
Hypothesis Testing :Population & sample
Hypothesis Testing
HYPOTHES
IS
TESTING

Null hypothesis, H0
Alternative hypothesis,HA
 State the hypothesized value of All possible alternatives other
the parameter before sampling. than the null hypothesis.
 The assumption we wish to test E.g
(or trying to reject) µ≠20
 E.g µ >20
 µ = 20 µ<
20
 There is no difference between
There
coke and diet coke is a
Hypothesis Testing

Basic concept is to form an assertion and test it with data

Common assumption is that there is no difference between samples (default


assumption)

Statisticians refer to this as the null hypothesis (H0)

The alternative hypothesis (HA) is that there is a difference between samples


What is the Null & alternate Hypothesis?

• The null hypothesis is always the accepted fact or accepted as


being true are:
• DNA is shaped like a double helix.
• There are 8 planets in the solar system (excluding Pluto).

• Given a population, the initial (assumed) hypothesis to be


tested ,Ho , is called the null hypothesis.

• Rejection of null hypothesis causes another hypothesis,H1,is


called the alternative hypothesis, to be made.
Statistical Methods for
Evaluation-
Hypothesis testing,

difference of means,

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

ANNOVA
mean, variance , standard deviation

Mean ●
μ if working with population
(or Average)
denoted by

X̄ if working with samples

Variance σ2 (for population)


denoted by s2 (for sample)


Standard ●
σX or σ (for population)
deviation
denoted by

sX or s (for sample))
Mean – is a simple average of
given data values
• Example

• 4,5,9,2,14,6

• Mean X̄ = (4+ 5+9+3+15+6) /6


• = 42/6
• =7
Variance: a measure of how data
points differ from the mean

• Data Set 1: 3, 5, 7, 10, 10


Data Set 2: 7, 7, 7, 7, 7

• What is the mean of the above data set?


• Data Set 1: mean = 7
• Data Set 2: mean = 7

• But we know that the two data sets are not identical! The
variance shows how they are different.

• We want to find a way to represent these two data set


numerically.
How to Calculate variance?
• If we conceptualize the spread of a distribution
as the extent to which the values in the
distribution differ from the mean and from each
other, then a reasonable measure of spread
might be the average deviation, or difference, of
the values from the mean.

( x  X )
N
How to Calculate variance?
• The average of the squared deviations about the mean is called
the variance.

For population variance

 x  X 
2

 2

N
For sample variance

 x  X 
2

s  2

n 1
Example 1- Variance
Score XX (X  X ) 2
X

1
3
2
5
3
7
4
10
5
10
Total 35

The mean is 35/5 = 7.


Example 1- Variance
Score XX (X  X ) 2
X
1
3 3-7=-4
2
5 5-7=-2
3
7 7-7=0
4
10 10-7=3
5
10 10-7=3
Totals 35
Example 1- Variance
Score XX (X  X ) 2
X

1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals 35 38
Example 1- Variance
Score XX (X  X ) 2
X

1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals 35 38

 x  X 
2
38
s 
2
  7.6
n 5
Example 1- Variance
Score XX (X  X ) 2
X

1
7 7-7=0 0
2
7 7-7=0 0
3
7 7-7=0 0
4
7 7-7=0 0
5
7 7-7=0 0
Totals 35 0

 x  X 
2
38
s 
2
 0/5 =0
7.6
n 5
Example 2- Variance

Drive Mark Myrna


1 28 27
2 22 27
3 21 28
4 26 6
5 18 27

Which diver was more consistent?


Example 2- Variance
Dive Mark's Score XX ( X  X )2
X

1 28 5 25

2 22 -1 1

3 21 -2 4

4 26 3 9

5 18 -5 25

Totals 115 0 64

Mark’s Variance = 64 / 5 = 12.8


Myrna’s Variance = 362 / 5 = 72.4

Conclusion: Mark has a lower variance therefore he is more consistent.


standard deviation - a measure of
variation of scores about the mean
• Can think of standard deviation as the average
distance to the mean
• Higher standard deviation indicates higher
spread, less consistency, and less clustering.

 x  X 
2

• sample standard deviation: s


n 1

• population standard deviation: 


 x  
2

N
Example – Standard Deviation
Dive Mark's Score XX ( X  X )2
X

1 28 5 25

2 22 -1 1

3 21 -2 4

4 26 3 9

5 18 -5 25
Mark’s Variance =
Totals 115 0 64 64 / 5 = 12.8

 Mark’s Standard Deviation for population =


Mark’s Standard Deviation for sample 78
Example- Variance & Standard Deviation
• You have just measured the heights of your dogs (in mm)
• The heights (at the shoulders) are: 600mm, 470mm, 170mm,
430mm and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
Example- Variance & Standard Deviation
• Your first step is to find the Mean:
• Mean = (600 + 470 + 170 + 430 + 300)5
• Mean = 1970/5
• Mean = 394
Example- Variance & Standard Deviation
• Now we calculate each dog's difference from the Mean
Example- Variance & Standard Deviation
• To calculate the Variance, take each difference, square it, and
then average the result:
• Variance

σ2 = 2062 + 762 + (−224)2 + 362 + (−94)2 / 5


  = 42436 + 5776 + 50176 + 1296 + 8836 / 5
  = 108520 / 5
  = 21704

• So the Variance σ2 is 21,704


Example- Variance & Standard Deviation
• And the Standard Deviation is just the square root of Variance,
so:
• Standard Deviation

σ = √21704
  = 147.32...
  = 147 (to the nearest mm)
Example- Variance & Standard Deviation
• And the good thing about the Standard Deviation is that it is
useful. Now we can show which heights are within one
Standard Deviation (147mm) of the Mean:

• So, using the Standard Deviation we have a "standard" way of


knowing what is normal, and what is extra large or
extra small.
difference of means

State the hypotheses

Formulate an analysis plan

Analyze sample data using hypothesis test

Interpret results.
Hypothesis Testing Procedures
H y p o th e s is
T e s tin g
P ro c e d u re s

EPI 809 / Spring 2008


P a ra m e tric N o n p a r a m e tr ic

W ilc o x o n K ru s k a l-W a llis


R ank Sum H -T e s t
Test
O n e -W a y
Z Test t Test
ANOVA
Many More Tests Exist!
Parametric Test Procedures

1.Involve Population Parameters (Mean)

EPI 809 / Spring 2008


2.Have Stringent(strict) Assumptions (Normality)

3.Examples: Z Test, t Test, c2 Test, F test


Nonparametric Test Procedures

1. Do Not Involve Population Parameters


Example: Probability Distributions, Independence

EPI 809 / Spring 2008


2. Data Measured on Any Scale (Ratio or Interval, Ordinal or Nominal)

3. Example: Wilcoxon Rank Sum Test


Parametric Test
Procedures

EPI 809 / Spring 2008


A t test allows us to compare the means of two
groups

The calculations for a t test requires three pieces of


information:
- the difference between the means (mean difference)
- the standard deviation for each group
- and the number of subjects(samples) in each group.

10
9
8
7
6
5
4
3
2
1

12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
The size of the standard deviation also influences the
outcome of a t test.
Given the same difference in means, groups are more
with
smaller
likely to standard
report a significant
deviations difference than groups
with larger standard deviations.

10
9
8
7
6
5
4
3
2
1

12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
10
9
8
7
6
5
4
3
2
1

12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
From a practical standpoint, we can see that smaller
standard
than larger
deviations Less overlap
standardproduce
deviations. would indicate
less overlap betweenthat
the
the groups are more different from each other.
groups

10
9
8
7
6
5
4
3
2
1

12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
10
9
8
7
6
5
4
3
2
1

12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
Difference of Means
Two populations – same or different?
How do we Are the scores for the two
means from the same subject
determine which (or related subjects)?
t test to use…
Yes No
Paired t test Are there the same
(Dependent t-test; number of people in
Correlated t-test) the two groups?

No

Yes
Equal Variance Are the variances of
Independent t test the two groups same?
(Pooled Variance
Independent t-test)
No
yes (Significance Level
(Significance Level for Levene (or F-Max)
Equal Variance for Levene (or F-Max) is p >=.05
Independent t test is p<.05
(Pooled Variance Unequal Variance
Independent t test) Independent t-test
(Separate Variance
Independent t test)
Difference of Means
Two Parametric Methods

Student’s t-test


Assumes two normally distributed populations, and that they have equal variance

Welch’s t-test


Assumes two normally distributed populations, and they don’t necessarily have equal
variance
Student’s t-test
Student’s t-test assumes that distributions of the two
populations have equal but unknown variances.

Suppose n1 and n2 samples are randomly and independently


selected from two populations, pop1 and pop1, respectively.

If each population is normally distributed with the same


mean ( µ1=µ2) and with the same variance, then

T (the t-statistic), follows a t-distribution with degrees of


freedom (df)
Student’s t-test
•  
T=
Where
=

• significance level
• degree of freedom df =n1+n2-2

• T*- is critical value found using df (from table)


Student’s t-test
•  
T=
If T> =T*
Where
=
the null hypothesis
is rejected

• is pooled variance
• significance level

• degree of freedom df =n1+n2-2


• T*- is critical value found using df (from table)
Welch’s t-test

When the equal population variance assumption is not justified in


performing Student’s t-test for the difference of means, Welch’s t-test can be
used based on

Also known as unequal variances t-test


Welch’s t-test
•T welch=

Where x , s2, n correspond to the sample


 
mean, sample variance, and sample size.

Notice that Welch’s t-test uses the sample


variance (s2) for each population instead of the
pooled sample variance.
Example
t-test independent samples
Example
Some brown hairs were found on the clothing of a victim at a crime
scene.
The five of the hairs were measured: 46, 57, 54, 51, 38 μm.
A suspect is the owner of a shop with similar brown hairs. A sample
of the hairs has been taken and their widths measured: 31, 35, 50,
35, 36 μm.
Is it possible that the hairs found on the victim were left by the
suspect‟s ? Test at the %5 level.
[From D. Lucy Introduction to Statistics for Forensic Scientists Chichester: Wiley, 2005 p. 44.]
t-test independent samples

1. Calculate the mean and standard deviation for the data sets
t-test independent samples
1. Calculate the mean and standard deviation for the data sets

A B
46 31
57 35
54 50
51 35
38 36
Total
Mean
Standard
deviation
t-test independent samples
1. Calculate the mean and standard deviation for the data sets

Dog A Dog B
46 31
57 35
54 50
51 35
38 36
Total 246 187
Mean 49.2 37.4
Standard 7.463 7.301
deviation
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means

49.2 – 37.4 = 11.8


t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means.│
3. Calculate the standard error in the difference
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference

= 4.669 ≈ 4.67 (3 sf)


t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T

T = difference between the means ÷ standard error in the difference


t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T:

T = difference between the means ÷ standard error in the difference

11. 4.669 = 2.527


8 ≈ 2.53 (3 sig fig)
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T:
5. Calculate the degrees of freedom
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T:
5. Calculate the degrees of freedom = n1 + n2 - 2
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T:
5. Calculate the degrees of freedom = n1 + n2 - 2

5+5-2=8
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T:
5. Calculate the degrees of freedom
6. Find the critical value T* for the particular significance you are
working to
from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of t:
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
and find the critical value from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of t:
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
and find the critical value from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of t:
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
from the table

At the 0.05 level tcrit = 2.306


t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of t:
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
and find the critical value from the table

If T < T* (critical value) then there is no significant difference


between the two sets of data ,i.e. null hypothesis is Accepted
If T >=T* ( critical value) then there is a significant difference
between the two sets of data i.e. null hypothesis is Rejected
Statistical Methods for
Evaluation-
Hypothesis testing,

difference of means,

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

ANNOVA
Advantages of Nonparametric Tests
1. Used With All Scales
2. Easier to Compute
3. Make Fewer Assumptions

EPI 809 / Spring 2008


4. Need Not Involve Population
Parameters
5. Results May Be as Exact
as Parametric Procedures

© 1984-1994 T/Maker Co.


Disadvantages of Nonparametric Tests

1.May Waste Information © 1984-1994 T/Maker Co.

Parametric model more efficient


if data Permit

EPI 809 / Spring 2008


2.Difficult to Compute by
hand for Large Samples
3.Tables Not Widely Available
Popular Nonparametric Tests

1.Sign Test

EPI 809 / Spring 2008


2.Wilcoxon Rank Sum Test

3.Wilcoxon Signed Rank Test


Wilcoxon Rank Sum
Test

EPI 809 / Spring 2008


Wilcoxon Rank-Sum Test
A Nonparametric Method

• Makes no assumptions about the


underlying probability distributions
Wilcoxon Rank Sum Test
1.Tests Two Independent Population Probability
Distributions

Independent, Random Samples

Populations Are Continuous

EPI 809 / Spring 2008


2.Corresponds to t-Test for 2 Independent Means

3.Assumptions

4.Can Use Normal Approximation If ni  10


Wilcoxon Rank Sum Test
Procedure

1. Assign Ranks, Ri, to the n1 + n2 Sample


Observations
If Unequal Sample Sizes, Let n1 Refer to Smaller-Sized Sample

EPI 809 / Spring 2008


Smallest Value = 1

2. Sum the Ranks, Ti, for Each Sample


Test Statistic Is TA (Smallest Sample)
Null hypothesis: both samples come from the same underlying
distribution
Distribution of T is not quite as simple as binomial, but it can be
computed
Wilcoxon Rank Sum Test
Example

• You’re a production planner.


• You want to see if the operating rates for 2 factories

EPI 809 / Spring 2008


is the same.
• For factory 1, the rates are
• 71, 82, 77, 92, 88.
• For factory 2, the rates are
• 85, 82, 94 & 97.
• Do the factory rates have the same probability
distributions at the .05 level?
Wilcoxon Rank Sum Test
Solution
• H0: Test Statistic:
• Ha:
• =
• n1 = n2 =
• Critical Value(s): Decision:

Conclusion:

 Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right
• =
• n1 = n2 =
• Critical Value(s): Decision:

Conclusion:

 Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right
•  = .05
• n1 = 4 n2 = 5
• Critical Value(s): Decision:

Conclusion:

 Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum
Table 12 (Rosner) (Portion)
 = .05 two-tailed
n1
4 5 6 ..

EPI 809 / Spring 2008


TL TU TL TU TL TU ..
4 10 26 16 34 23 43 ..
n2 5 11 29 17 38 24 48 ..
6 12 32 18 42 26 52 ..
: : : : : : : :
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right
•  = .10
• n1 = 4 n2 = 5
• Critical Value(s): Decision:

Reject Do Not Reject


Conclusion:
Reject

12 28  Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 85
82 82
77 94
92 97
88 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85
82 82
77 94
92 97
88 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85
82 82
77 2 94
92 97
88 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85
82 3 82 4
77 2 94
92 97
88 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 6 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 7 97
88 6 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97
88 6 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank

EPI 809 / Spring 2008


71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum 19.5 25.5
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right T2 = 5 + 3.5 + 8+ 9 = 25.5
•  = .05 (Smallest Sample)
• n1 = 4 n2 = 5
• Critical Value(s): Decision:

Reject Do Not Reject


Conclusion:
Reject

12 28  Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right T2 = 5 + 3.5 + 8+ 9 = 25.5
•  = .05 (Smallest Sample)
• n1 = 4 n2 = 5
• Critical Value(s): Decision:

Do Not Reject at  = .05


Reject Do Not Reject
Conclusion:
Reject

12 28  Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right T2 = 5 + 3.5 + 8+ 9 = 25.5
•  = .05 (Smallest Sample)
• n1 = 4 n2 = 5
• Critical Value(s): Decision:

Do Not Reject at  = .05


Reject Do Not Reject
Conclusion:
Reject
There is No evidence for
12 28  Ranks
unequal distrib
EPI 809 / Spring 2008
Statistical Methods for
Evaluation-
Hypothesis testing,

difference of means,

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

ANNOVA
Type I and Type II errors
Type I error refers to the situation when we reject the null hypothesis when it is
 
true (H0 is wrongly rejected). Denoted by

Type II error refers to the situation when we accept the null hypothesis when it is
 
false. (H0 is wrongly Accepted). Denoted by
Type I and Type II errors

Which one is more


dangerous Type I
or Type II error ?
Justify your
answer.
Statistical Methods for
Evaluation-
Hypothesis testing,

difference of means,

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

ANNOVA
Power and Sample Size
• The
  power of a test is the probability of correctly rejecting the
null hypothesis
• It is denoted by , where (1-is the probability of a type II error.
• The power of a test improves as the sample size increases
• power is used to determine the necessary sample size.
• power of a hypothesis test depends on the true difference of
the population means.
• A larger sample size is required to detect a smaller difference
in the means.
• In general, Effect size d= difference between the means
• It is important to consider an appropriate effect size for the
problem at hand
Power and Sample Size
A larger sample size better identifies a fixed effect size
Statistical Methods for
Evaluation-
Hypothesis testing,

difference of means,

wilcoxon rank–sum test,

type 1 type 2 errors,

power and sample size,

ANNOVA
ANOVA (Analysis of Variance)

A generalization of the hypothesis testing of the difference of two population


means

Good for analyzing more than two populations

ANOVA tests if any of the population means differ from the other population
means
ANOVA (Analysis of Variance)
Find the mean for each of the groups.

Find the overall mean (the mean of the groups combined).

Find the Within Group Variation; the total deviation of each member’s score
from the Group Mean.

Find the Between Group Variation: the deviation of each Group Mean from
the Overall Mean.

Find the F critical and F statistic: the ratio of Between Group Variation to
Within Group Variation.

F statistic < F critical accept Ho else reject H0 and accept Ha


Syllabus

Statistical Methods for Evaluation- Hypothesis testing, difference of means,


wilcoxon rank–sum test, type 1 type 2 errors, power and sample size,
ANNOVA

Advanced Analytical Theory and Methods: Clustering- Overview, K means-


Use cases, Overview of methods, determining number of clusters,
diagnostics, reasons to choose and cautions.
Advanced Analytical Theory
and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Overview of Clustering

Clustering is the use of unsupervised techniques for grouping similar objects


Supervised methods use labeled objects

Unsupervised methods use unlabeled objects

Clustering looks for hidden structure in the data, similarities based on attributes


Often used for exploratory analysis

No predictions are made
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis

1/19/22
• create thematic maps in GIS by clustering feature spaces
• detect spatial clusters and explain them in spatial data mining

Data Mining: Concepts and


Techniques
• Image Processing
• Economic Science (especially market research)

• WWW
• Document classification
113
• Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications

 Marketing: Help marketers discover distinct groups in


their customer bases, and then use this knowledge to

1/19/22
develop targeted marketing programs
 Land use: Identification of areas of similar land use in an

Data Mining: Concepts and


Techniques
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters 114
should be clustered along continent faults
CLUSTERING

• Cluster: a collection of data objects similar to one another within


the same cluster
• Dissimilar to the objects in other clusters
• The distance between points in a cluster is less than the distance
between a point in the cluster and any point outside it
• Data can be clustered on different attributes
• Clustering differs from classification
• Unsupervised learning
• No predefined classes (no a priori knowledge)

115
+
+

Cluster analysis: Finding similarities between data according


to the characteristics found in the data and grouping similar
data objects into clusters
116
Clustering Methods

• Given a cluster Km of N points { tm1,tm2, tmk} , the centroid or middle


of the cluster computed as
• Centroid = Cm = ∑ tmi / N is considered as the representative of the
cluster (there may not be any corresponding object)
• Some algorithms use as representative a centrally located object
called Medoid

Centroid Medoid

117
Advanced Analytical Theory
and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


K-means Algorithm

Given a collection of objects each with n measurable attributes and a chosen


value k that is the number of clusters, the algorithm identifies the k clusters
of objects based on the objects proximity to the centers of the k groups.

The algorithm is iterative with the centers adjusted to the mean of each
cluster’s n-dimensional vector of attributes
Use Cases
• Clustering is often used as a lead-in to
classification, where labels are applied to the
identified clusters
• Some applications
• Image processing
• With security images, successive frames are examined for change
• Medical
• Patients can be grouped to identify naturally occurring clusters
• Customer segmentation
• Marketing and sales groups identify customers having similar
behaviors and spending patterns
Advanced Analytical Theory
and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


K-Means Example
Given:
Given: {2,4,10,12,3,20,30,11,25},
{2,4,10,12,3,20,30,11,25}, k=2
k=2

Randomly
Randomly assign
assign means:
means: m
m11=3,m
=3,m22=4
=4

KK11={2,3},
={2,3}, KK22={4,10,12,20,30,11,25},
={4,10,12,20,30,11,25}, m
m11=2.5,m
=2.5,m22=16
=16

KK11={2,3,4},K
={2,3,4},K22={10,12,20,30,11,25},
={10,12,20,30,11,25}, m
m11=3,m
=3,m22=18
=18

KK11={2,3,4,10},K
={2,3,4,10},K22={12,20,30,11,25},
={12,20,30,11,25}, m
m11=4.75,m
=4.75,m22=19.6
=19.6

KK11={2,3,4,10,11,12},K
={2,3,4,10,11,12},K22={20,30,25},
={20,30,25}, m
m11=7,m
=7,m22=25
=25
K-means Method
Four Steps
Choose the value of k and the initial guesses for the centroids

Compute the distance from each data point to each centroid, and assign each point
to the closest centroid

Compute the centroid of each newly defined cluster from step 2

Repeat steps 2 and 3 until the algorithm converges (no changes occur)
K-means Method- for two dimension
Example – Step 1
• Choose the value of k and the k initial guesses for the centroids.
• In this example, k = 3, and the initial centroids are indicated by the points
shaded in red, green, and blue
K-means Method- for two dimension
Example – Step 2
•   are assigned to the closest centroid.
Points
In two dimensions, the distance, d, between any two points,(x1,y1) and (x2,y2) is
expressed by Euclidean distance measure+
K-means Method- for two dimension
Example – Step 3
•  
Computecentroidsof the new clusters. In two dimensions, the centroid
(Xc,Yc) of the m points is calculated as follows
(Xc,Yc)= ,
K-means Method- for two dimension
Example – Step 4
• Repeat steps 2 and 3 until convergence
• Convergence occurs when the centroids do not change or when
the centroids oscillate back and forth
• This can occur when one or more points have equal distances from
the centroid centers
• Videos
• http://www.youtube.com/watch?v=aiJ8II94qck
• https://class.coursera.org/ml-003/lecture/78
K-means - for n dimension
• To
  generalize the prior algorithm to n dimensions, suppose
there are M objects, where each object is described by n
attributes or property values (P1,P2,….,Pn). Then object i is
described by for (Pi1,Pi2,….,Pin) for i= 1,2,..., M.
• For a given point, Pi, at (Pi1,Pi2,….,Pin)and a centroid, q, located
at (q1,q2,….qn), the distance, d, between Piand q, is expressed
as shown in

• The centroid q of a cluster of m points, (Pi1,Pi2,….,Pin) , is


calculated as shown in
• (q1,q2,…qn) = , , ……
Advanced Analytical Theory
and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Determining Number of Clusters

• k clusters can be identified in a given dataset, but what


value of k should be selected?
• The value of k can be chosen based on a reasonable guess
or some predefined requirement.
• How to know better or worse having k clusters versus k – 1
or k + 1 clusters
• Solution:
• Use heuristic – e.g., Within Sum of Squares (WSS)
• WSS metric is the sum of the squares of the distances
between each data point and the closest centroid
• The process of identifying the appropriate value of k is
referred to as finding the “elbow” of the WSS curve
Determining Number of Clusters
(WSS Method)
1. Compute clustering algorithm (e.g., k-means clustering) for
different values of k. For instance, by varying k from 1 to 10
clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered
as an indicator of the appropriate number of clusters.

• where: xi -is a data point belonging to the cluster Ck


• μk is the mean value of the points assigned to the cluster Ck
Determining Number of Clusters
Example of WSS vs #Clusters curve

The elbow of the curve appears to occur at k = 3.


Advanced Analytical Theory
and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Diagnostics
When the number of clusters is small,
plotting the data helps refine the choice of k

The following questions should be


considered
• Are the clusters well separated from each other?
• Do any of the clusters have only a few points
• Do any of the centroids appear to be too close to
each other?
Diagnostics
Example of distinct clusters
Diagnostics
Example of less obvious clusters
Diagnostics
Six clusters from points of previous figure
Advanced Analytical Theory
and Methods
Clustering- Overview,

K means- Use cases,

Overview of methods,

determining number of clusters,

diagnostics,

reasons to choose and cautions.


Reasons to Choose and Cautions
• Decisions the practitioner must make
• What object attributes should be included
in the analysis?
• What unit of measure should be used for
each attribute?
• Do the attributes need to be rescaled?
• What other considerations might apply?
Reasons to Choose and Cautions
Object Attributes

• Important to understand what attributes will be


known at the time a new object is assigned to a
cluster
• E.g., information on existing customers’ satisfaction or
purchase frequency may be available, but such information
may not be available for potential customers .
• Eg. information like age and income of existing customers is
available but may not be available, for new customers
• Best to reduce number of attributes when possible
• Too many attributes minimize the impact of key variables
• Identify highly correlated attributes for reduction
• Combine several attributes into one: e.g., debt/asset ratio
Reasons to Choose and Cautions
Object attributes: scatterplot matrix for seven attributes
Reasons to Choose and Cautions
Units of Measure

• K-means algorithm will identify different clusters depending on the


units of measure

k=2
Reasons to Choose and Cautions
Units of Measure

Age
dominates
k=2
Reasons to Choose and Cautions
Rescaling

• Rescaling can reduce domination effect


• E.g., divide each variable by the appropriate standard
deviation

Rescaled
attributes

k=2
Reasons to Choose and Cautions
Additional Considerations

K-means sensitive to starting seeds


• Important to rerun with several seeds – R has the nstart option

Could explore distance metrics other than Euclidean


• E.g., Manhattan, Mahalanobis, etc.

K-means is easily applied to numeric data and does


not work well with nominal attributes
• E.g., color
Additional Algorithms

K-modes clustering
• kmod()

Partitioning around Medoids (PAM)


• pam()

Hierarchical agglomerative clustering


• hclust()
Summary
• Clustering analysis groups similar objects based on the
objects’ attributes
• To use k-means properly, it is important to
• Properly scale the attribute values to avoid
domination
• Assure the concept of distance between the
assigned values of an attribute is meaningful
• Carefully choose the number of clusters, k
• Once the clusters are identified, it is often useful to label
them in a descriptive way
References
• https://
www.slideshare.net/darlingjunior/hypothesis-testing?from_acti
on=save
• https://www.mathsisfun.com/data/standard-deviation.html
• http:/www2.aueb.gr/users/koundouri/resees/uploads/Chapter1
0.ppt
• https://researchbasics.education.uconn.edu/wp
-content/uploads/sites/1215/.../ttest.pps
• https://msu.edu/~fuw/teaching/Fu_ch9_Nonpara.ppt
• http://www.statisticshowto.com/probability-and-statistics/t-test
/

You might also like