Professional Documents
Culture Documents
(BE-2015 Pattern)
Unit II
Basic Data Analytic Methods
Syllabus
difference of means,
ANNOVA
What is Hypothesis?
• A hypothesis is an educated guess about
something in the world around you. It should
be testable, either by experiment or
observation. For example:
• A new medicine you think might work.
• A way of teaching you think might be better.
What is a Hypothesis Statement?
• Hypothesis statement will look like this:
• “If I…(do this to an independent variable)….then (this will happen to the
dependent variable).”
• For example:
• If I (decrease the amount of water given to herbs) then (the herbs will
increase in size).
• If I (give patients counseling in addition to medication) then (their overall
depression scale will decrease).
What is Hypothesis Testing?
•Hypothesis testing refers to
1. Making an assumption, called hypothesis,
about a population parameter.
2. Collecting sample data.
3. Calculating a sample statistic.
4. Using the sample statistic to evaluate the
hypothesis
Hypothesis Testing :Population & sample
Hypothesis Testing
HYPOTHES
IS
TESTING
Null hypothesis, H0
Alternative hypothesis,HA
State the hypothesized value of All possible alternatives other
the parameter before sampling. than the null hypothesis.
The assumption we wish to test E.g
(or trying to reject) µ≠20
E.g µ >20
µ = 20 µ<
20
There is no difference between
There
coke and diet coke is a
Hypothesis Testing
difference of means,
ANNOVA
mean, variance , standard deviation
Mean ●
μ if working with population
(or Average)
denoted by
●
X̄ if working with samples
Standard ●
σX or σ (for population)
deviation
denoted by
●
sX or s (for sample))
Mean – is a simple average of
given data values
• Example
• 4,5,9,2,14,6
• But we know that the two data sets are not identical! The
variance shows how they are different.
( x X )
N
How to Calculate variance?
• The average of the squared deviations about the mean is called
the variance.
x X
2
2
N
For sample variance
x X
2
s 2
n 1
Example 1- Variance
Score XX (X X ) 2
X
1
3
2
5
3
7
4
10
5
10
Total 35
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals 35 38
Example 1- Variance
Score XX (X X ) 2
X
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals 35 38
x X
2
38
s
2
7.6
n 5
Example 1- Variance
Score XX (X X ) 2
X
1
7 7-7=0 0
2
7 7-7=0 0
3
7 7-7=0 0
4
7 7-7=0 0
5
7 7-7=0 0
Totals 35 0
x X
2
38
s
2
0/5 =0
7.6
n 5
Example 2- Variance
1 28 5 25
2 22 -1 1
3 21 -2 4
4 26 3 9
5 18 -5 25
Totals 115 0 64
x X
2
N
Example – Standard Deviation
Dive Mark's Score XX ( X X )2
X
1 28 5 25
2 22 -1 1
3 21 -2 4
4 26 3 9
5 18 -5 25
Mark’s Variance =
Totals 115 0 64 64 / 5 = 12.8
σ = √21704
= 147.32...
= 147 (to the nearest mm)
Example- Variance & Standard Deviation
• And the good thing about the Standard Deviation is that it is
useful. Now we can show which heights are within one
Standard Deviation (147mm) of the Mean:
Interpret results.
Hypothesis Testing Procedures
H y p o th e s is
T e s tin g
P ro c e d u re s
●
Example: Probability Distributions, Independence
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
The size of the standard deviation also influences the
outcome of a t test.
Given the same difference in means, groups are more
with
smaller
likely to standard
report a significant
deviations difference than groups
with larger standard deviations.
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
From a practical standpoint, we can see that smaller
standard
than larger
deviations Less overlap
standardproduce
deviations. would indicate
less overlap betweenthat
the
the groups are more different from each other.
groups
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
10
9
8
7
6
5
4
3
2
1
12 13 14 15 16 17 18 19 20 21 22 23 24 25
Spelling Test Scores
Difference of Means
Two populations – same or different?
How do we Are the scores for the two
means from the same subject
determine which (or related subjects)?
t test to use…
Yes No
Paired t test Are there the same
(Dependent t-test; number of people in
Correlated t-test) the two groups?
No
Yes
Equal Variance Are the variances of
Independent t test the two groups same?
(Pooled Variance
Independent t-test)
No
yes (Significance Level
(Significance Level for Levene (or F-Max)
Equal Variance for Levene (or F-Max) is p >=.05
Independent t test is p<.05
(Pooled Variance Unequal Variance
Independent t test) Independent t-test
(Separate Variance
Independent t test)
Difference of Means
Two Parametric Methods
Student’s t-test
●
Assumes two normally distributed populations, and that they have equal variance
Welch’s t-test
●
Assumes two normally distributed populations, and they don’t necessarily have equal
variance
Student’s t-test
Student’s t-test assumes that distributions of the two
populations have equal but unknown variances.
• significance level
• degree of freedom df =n1+n2-2
• is pooled variance
• significance level
1. Calculate the mean and standard deviation for the data sets
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
A B
46 31
57 35
54 50
51 35
38 36
Total
Mean
Standard
deviation
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
Dog A Dog B
46 31
57 35
54 50
51 35
38 36
Total 246 187
Mean 49.2 37.4
Standard 7.463 7.301
deviation
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
5+5-2=8
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of T:
5. Calculate the degrees of freedom
6. Find the critical value T* for the particular significance you are
working to
from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of t:
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
and find the critical value from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of t:
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
and find the critical value from the table
t-test independent samples
1. Calculate the mean and standard deviation for the data sets
. 2. Calculate the magnitude of the difference between the two means
3. Calculate the standard error in the difference
4. Calculate the value of t:
5. Calculate the degrees of freedom
6. Find the critical value for the particular significance you are working to
from the table
difference of means,
ANNOVA
Advantages of Nonparametric Tests
1. Used With All Scales
2. Easier to Compute
3. Make Fewer Assumptions
1.Sign Test
3.Assumptions
Conclusion:
Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right
• =
• n1 = n2 =
• Critical Value(s): Decision:
Conclusion:
Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right
• = .05
• n1 = 4 n2 = 5
• Critical Value(s): Decision:
Conclusion:
Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum
Table 12 (Rosner) (Portion)
= .05 two-tailed
n1
4 5 6 ..
12 28 Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Computation Table
Factory 1 Factory 2
Rate Rank Rate Rank
12 28 Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right T2 = 5 + 3.5 + 8+ 9 = 25.5
• = .05 (Smallest Sample)
• n1 = 4 n2 = 5
• Critical Value(s): Decision:
12 28 Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
• H0: Identical Distrib. Test Statistic:
• Ha: Shifted Left or Right T2 = 5 + 3.5 + 8+ 9 = 25.5
• = .05 (Smallest Sample)
• n1 = 4 n2 = 5
• Critical Value(s): Decision:
difference of means,
ANNOVA
Type I and Type II errors
Type I error refers to the situation when we reject the null hypothesis when it is
true (H0 is wrongly rejected). Denoted by
Type II error refers to the situation when we accept the null hypothesis when it is
false. (H0 is wrongly Accepted). Denoted by
Type I and Type II errors
difference of means,
ANNOVA
Power and Sample Size
• The
power of a test is the probability of correctly rejecting the
null hypothesis
• It is denoted by , where (1-is the probability of a type II error.
• The power of a test improves as the sample size increases
• power is used to determine the necessary sample size.
• power of a hypothesis test depends on the true difference of
the population means.
• A larger sample size is required to detect a smaller difference
in the means.
• In general, Effect size d= difference between the means
• It is important to consider an appropriate effect size for the
problem at hand
Power and Sample Size
A larger sample size better identifies a fixed effect size
Statistical Methods for
Evaluation-
Hypothesis testing,
difference of means,
ANNOVA
ANOVA (Analysis of Variance)
ANOVA tests if any of the population means differ from the other population
means
ANOVA (Analysis of Variance)
Find the mean for each of the groups.
Find the Within Group Variation; the total deviation of each member’s score
from the Group Mean.
Find the Between Group Variation: the deviation of each Group Mean from
the Overall Mean.
Find the F critical and F statistic: the ratio of Between Group Variation to
Within Group Variation.
Overview of methods,
diagnostics,
●
Supervised methods use labeled objects
●
Unsupervised methods use unlabeled objects
Clustering looks for hidden structure in the data, similarities based on attributes
●
Often used for exploratory analysis
●
No predictions are made
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
1/19/22
• create thematic maps in GIS by clustering feature spaces
• detect spatial clusters and explain them in spatial data mining
• WWW
• Document classification
113
• Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications
1/19/22
develop targeted marketing programs
Land use: Identification of areas of similar land use in an
115
+
+
Centroid Medoid
117
Advanced Analytical Theory
and Methods
Clustering- Overview,
Overview of methods,
diagnostics,
The algorithm is iterative with the centers adjusted to the mean of each
cluster’s n-dimensional vector of attributes
Use Cases
• Clustering is often used as a lead-in to
classification, where labels are applied to the
identified clusters
• Some applications
• Image processing
• With security images, successive frames are examined for change
• Medical
• Patients can be grouped to identify naturally occurring clusters
• Customer segmentation
• Marketing and sales groups identify customers having similar
behaviors and spending patterns
Advanced Analytical Theory
and Methods
Clustering- Overview,
Overview of methods,
diagnostics,
Randomly
Randomly assign
assign means:
means: m
m11=3,m
=3,m22=4
=4
KK11={2,3},
={2,3}, KK22={4,10,12,20,30,11,25},
={4,10,12,20,30,11,25}, m
m11=2.5,m
=2.5,m22=16
=16
KK11={2,3,4},K
={2,3,4},K22={10,12,20,30,11,25},
={10,12,20,30,11,25}, m
m11=3,m
=3,m22=18
=18
KK11={2,3,4,10},K
={2,3,4,10},K22={12,20,30,11,25},
={12,20,30,11,25}, m
m11=4.75,m
=4.75,m22=19.6
=19.6
KK11={2,3,4,10,11,12},K
={2,3,4,10,11,12},K22={20,30,25},
={20,30,25}, m
m11=7,m
=7,m22=25
=25
K-means Method
Four Steps
Choose the value of k and the initial guesses for the centroids
Compute the distance from each data point to each centroid, and assign each point
to the closest centroid
Repeat steps 2 and 3 until the algorithm converges (no changes occur)
K-means Method- for two dimension
Example – Step 1
• Choose the value of k and the k initial guesses for the centroids.
• In this example, k = 3, and the initial centroids are indicated by the points
shaded in red, green, and blue
K-means Method- for two dimension
Example – Step 2
• are assigned to the closest centroid.
Points
In two dimensions, the distance, d, between any two points,(x1,y1) and (x2,y2) is
expressed by Euclidean distance measure+
K-means Method- for two dimension
Example – Step 3
•
Computecentroidsof the new clusters. In two dimensions, the centroid
(Xc,Yc) of the m points is calculated as follows
(Xc,Yc)= ,
K-means Method- for two dimension
Example – Step 4
• Repeat steps 2 and 3 until convergence
• Convergence occurs when the centroids do not change or when
the centroids oscillate back and forth
• This can occur when one or more points have equal distances from
the centroid centers
• Videos
• http://www.youtube.com/watch?v=aiJ8II94qck
• https://class.coursera.org/ml-003/lecture/78
K-means - for n dimension
• To
generalize the prior algorithm to n dimensions, suppose
there are M objects, where each object is described by n
attributes or property values (P1,P2,….,Pn). Then object i is
described by for (Pi1,Pi2,….,Pin) for i= 1,2,..., M.
• For a given point, Pi, at (Pi1,Pi2,….,Pin)and a centroid, q, located
at (q1,q2,….qn), the distance, d, between Piand q, is expressed
as shown in
Overview of methods,
diagnostics,
Overview of methods,
diagnostics,
Overview of methods,
diagnostics,
k=2
Reasons to Choose and Cautions
Units of Measure
Age
dominates
k=2
Reasons to Choose and Cautions
Rescaling
Rescaled
attributes
k=2
Reasons to Choose and Cautions
Additional Considerations
K-modes clustering
• kmod()