Professional Documents
Culture Documents
Statistics Using Excel
Statistics Using Excel
Tsagris Michael
BSc in Statistics
Emai: mtsagris@yahoo.gr
Tsagris Michael
Tsagris Michael
Contents
1.1 Introduction............................................................................................................4
2.1 Data Analysis..........................................................................................................5
2.2 Descriptive Statistics..............................................................................................6
2.3 Z-test for two samples............................................................................................8
2.4 t-test for two samples assuming unequal variances ............................................9
2.5 t-test for two samples assuming equal variances ..............................................10
2.6 F-test for the equality of variances .....................................................................11
2.7 Paired t-test for two samples...............................................................................12
2.8 Ranks, Percentiles, Sampling, Random Numbers Generation ........................13
2.9 Covariance, Correlation, Linear Regression.....................................................15
2.10 One-way Analysis of Variance..........................................................................18
2.11 Two-way Analysis of Variance with replication .............................................19
2.12 Two-way Analysis of Variance without replication........................................21
3.1 Statistical Functions.............................................................................................23
3.2 Spearmans (non-parametric) correlation coefficient ......................................27
3.3 Wilcoxon Signed Rank Test for a Median.........................................................28
3.4 Wilcoxon Signed Rank Test with Paired Data ..................................................29
Tsagris Michael
1.1 Introduction
One of the reasons for which these notes were written was to help students and
not only to perform some statistical analyses without having to use statistical software
such as Splus, SPSS, and Minitab e.t.c. It is reasonable not to expect that excel offers
much of the options for analyses offered by statistical packages but it is in a good
level nonetheless. The areas covered by these notes are: descriptive statistics, z-test
for two samples, t-test for two samples assuming (un)equal variances, paired ttest for two samples, F-test for the equality of variances of two samples, ranks
and percentiles, sampling (random and periodic, or systematic), random
numbers generation, Pearsons correlation coefficient, covariance, linear
regression, one-way ANOA, two-way ANOVA with and without replication and
the moving average. We will also demonstrate the use of non-parametric statistics in
Excel for some of the previously mentioned techniques. Furthermore, informal
comparisons with the results provided by the Excel and the ones provided by SPSS
and some other packages will be carried out to see for any discrepancies between
Excel and SPSS. One thing that is worthy to mention before somebody goes through
these notes is that they do not contain the theory underlying the techniques used.
These notes show how to cope with statistics using Excel.
Tsagris Michael
Picture 1
Tsagris Michael
Picture 2
Picture 3
Tsagris Michael
Picture 4
Column1
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence
Level(95.0%)
194.0418719
5.221297644
148.5
97
105.2062324
11068.35133
-0.79094723
0.692125308
451
4
455
78781
406
10.26422853
Tsagris Michael
Picture 5
Tsagris Michael
Variable 1
3.76501977
1
100
0
Variable 2
5.810701181
9
80
-5.84480403
2.53582E-09
1.644853627
5.07165E-09
1.959963985
Table 2: Z-test
Picture 6
Tsagris Michael
Picture 7
The results are the same with the ones provided by SPSS. What is worthy to
mention and to pay attention is that the degrees of freedom (df) for this case are equal
to 178, whereas in the previous case were equal to 96. Also the t-statistics is slightly
different. The reason it that different kind of formulae are used in both cases.
10
Tsagris Michael
Picture 8
11
Tsagris Michael
Picture 9
12
Tsagris Michael
Variable 2
5.81070118
8.07373334
80
Picture 8
13
Tsagris Michael
Picture 9
If you are interested in a random sample from a know distribution then the
random numbers generation is the option you want to use. Unfortunately not many
distributions are offered. The window of this option is at picture 10. In the number of
variables you can select how many samples you want to be drawn from the specific
distribution. The white box below is used to define the sample size. The distributions
offered are Uniform, Normal, Bernoulli, Binomial, and Poisson. Two more options
are also allowed. Different distributions require different parameters to be defined.
The random seed is an option used to give the sampling algorithm a starting value but
can be left blank as well.
Picture 10
14
Tsagris Michael
Picture 11
Column
1
Column
2
Column
1
1.113367
Column
2
0.531949
7.972812
Table 7: Covariance
The above table is called the variance-covariance table since it produces both
of these measures. The first cell (1.113367) refers to the variance of the first column
and the last cell refers to the variance of the second column. The remaining cell
(0.531949) refers to the covariance of the two columns. The blank cell is white due to
the fact that the value is the covariance (the elements of the diagonal are the variances
and the others refer to the covariance). The window of the linear regression option is
presented at picture 12. (Different normal data used in the regression analysis). We
fill the white boxes with the columns that represent Y and X values. The X values can
contain more than one column (i.e. variable). We select the confidence interval
option. We also select the Line Fit Plots and Normal Probability Plots. Then by
pressing OK, the result appears in table 8.
15
Tsagris Michael
Picture 12
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.875372
R Square
0.766276
Adjusted R
0.76328
Square
Standard Error
23.06123
Observations
80
ANOVA
df
Regression
Residual
Total
Intercept
X Variable 1
1
78
79
Coefficients
-10.6715
0.043651
SS
MS
136001
41481.97
177483
Standard
Error
8.963642
0.00273
136001
531.8202
255.7274
Significance
F
2.46E-26
t Stat
P-value
Lower 95%
-1.19053
15.99148
0.237449
2.46E-26
-28.5167
0.038217
Upper 95%
7.173767
0.049085
Tsagris Michael
Y
Predicted Y
100
50
0
0
2000
4000
6000
X Variable 1
200
150
100
50
0
0
20
40
60
80
100
120
Sample Percentile
17
Tsagris Michael
normality hypothesis of the residuals. Excel produced also the residuals and the
predicted values in the same sheet. We shall construct a scatter plot of these two
values, in order to check (graphically) the assumption of homoscedasticity (i.e.
constant variance through the residuals). If the assumption of heteroscedasticity of the
residuals holds true, then we should see all the values within a bandwidth. We see that
almost all values fall within 40 and -40, except for two values that are over 70 and
100. These values are the so called outliers. We can assume that the residuals exhibit
constant variance. If we are not certain as for the validity of the assumption we can
transform the Y values using a log transformation and run the regression using the
transformed Y values.
120
100
80
Residuals
60
40
Series1
20
0
-20
50
100
150
200
250
-40
-60
Predicted Values
18
Tsagris Michael
Picture 13
Anova: Single Factor
SUMMARY
Groups
Count
Column 1
253
Column 2
73
Column 3
79
ANOVA
Source of
SS
Variation
Between
1909939.2
Groups
Within Groups
2536538
Total
4446477.2
Sum
62672
7991
8114
Average
247.7154
109.4658
102.7089
Variance
9756.887
500.5023
535.4654
MS
954969.6
151.3471
402
404
6309.796
df
P-value
1E-49
F crit
3.018168
19
Tsagris Michael
other words the first combination the two factors are the cells from B2 to B26. This
means that each combination of factors has 24 measurements.
Picture 14
From the window of picture 3, we select Anova: Two-Factor with
replication and the window to appear is shown at picture 15.
Picture 15
We filled the two blank white boxes with the input range and Rows per
sample. The alpha is at its usual value, equal to 0.05. By pressing OK the results are
presented overleaf. The results generated by SPSS are the same. At the bottom of the
table 10 there are three p-values; two p-values for the two factors and one p-value for
the interaction. The row factor is denoted as sample in Excel.
20
Tsagris Michael
C3
Total
S1
Count
Sum
Average
Variance
24
8229
342.875
6668.288
24
2537
105.7083
237.5199
24
2378
99.08333
508.4275
72
13144
182.5556
15441.38
24
6003
250.125
10582.46
24
2531
105.4583
416.433
24
2461
102.5417
515.7373
72
10995
152.7083
8543.364
24
7629
317.875
7763.679
24
2826
117.75
802.9783
24
2523
105.125
664.8967
72
12978
180.25
12621.15
72
21861
303.625
9660.181
72
7894
109.6389
505.3326
72
7362
102.25
553.3732
SS
df
MS
P-value
F crit
19856.59
938845.2
18409.53
3128.936
6.346116
300.0526
5.883638
0.002114
6.85E-62
0.000167
3.039508
3.039508
2.415267
S2
Count
Sum
Average
Variance
S3
Count
Sum
Average
Variance
Total
Count
Sum
Average
Variance
ANOVA
Source of
Variation
Sample (=Rows)
Columns
Interaction
Within
Total
39713.18
1877690
73638.1
647689.7
2638731
2
2
4
207
215
21
Tsagris Michael
Picture 16
Anova: Two-Factor Without Replication
SUMMARY
Row 1
Row 2
Row 3
Count
Column 1
Column 2
Column 3
3
3
3
Sum
553
544
525
Average
184.3333
181.3333
175
Variance
11385.33
21336.33
15379
3
3
3
975
340
307
325
113.3333
102.3333
499
332.3333
85.33333
df
MS
P-value
F crit
68.11111
47252.11
424.2778
0.160534
111.3707
0.856915
0.000311
6.944272
6.944272
ANOVA
Source of
Variation
Rows
Columns
Error
136.2222
94504.22
1697.111
2
2
4
Total
96337.56
SS
22
Tsagris Michael
AVEDEV calculates the average of the absolute deviations of the data from
their mean.
CHITEST calculates the result of the test for independence: the value from the
chi-squared distribution for the statistics and the appropriate degrees of
freedom.
DEVSQ calculates the sum of squares of deviations of data points from their
sample mean. The derivation of standard deviation is very straightforward,
simply dividing by the sample size or by the sample size decreased by one to
get the unbiased estimator of the true standard deviation.
23
Tsagris Michael
FREQUENCY calculates how often values occur within a range of values and
then returns a vertical array of numbers having one or more elements than
Bins_array.
FTEST returns the result of the one-tailed test that the variances of two data
sets are not significantly different.
INTERCEPT calculates the point at which a line will intersect the y-axis.
LINEST generates a line that best fits a data set by generating a two
dimensional array of values to describe the line.
LOGEST generates a curve that best fits a data set by generating a two
dimensional array of values to describe the curve.
24
Tsagris Michael
MAX returns the largest value in a data set (ignore logical values and text).
MAXA returns the largest value in a set of data (does not ignore logical values
and text).
MIN returns the largest value in a data set (ignore logical values and text).
MINA returns the largest value in a data set (does not ignore logical values
and text).
PEARSON returns a value that reflects the strength of the linear relationship
between two data sets.
25
Tsagris Michael
PROB calculates the probability that values in a range are between two limits
or equal to a lower limit.
RANK calculates the rank of a number in a list of numbers: its size relative to
other values in the list.
RSQ calculates the square of the Pearson correlation coefficient (also met as
coefficient of determination in the case of linear regression).
STDEVA estimates the standard deviation of a data set (which can include
text and true/false values) based on a sample of the data.
STDEVPA calculates the standard deviation of a data set (which can include
text and true/false values).
STEYX returns the predicted standard error for the y value for each x value in
regression.
26
Tsagris Michael
VARA estimates the variance of a data set (which can include text and true/
false values) based on a sample of the data.
VARPA calculates the variance of a data population, which can include text
and true/false values.
Column1
183
168
163
163
146
145
141
141
133
131
130
122
121
121
121
121
121
121
121
121
Rank
1
2
3
3
5
6
7
7
9
10
11
12
13
13
13
13
13
13
13
13
Percent
100.00%
98.50%
95.70%
95.70%
94.30%
92.90%
90.10%
90.10%
88.70%
87.30%
85.90%
84.50%
69.00%
69.00%
69.00%
69.00%
69.00%
69.00%
69.00%
69.00%
27
Tsagris Michael
Column 1 contains the values, Rank contains the ranks of the values, Percent
contains the cumulative percentage of the values (the size of the values relative to the
others) and the first column (Points) indicates the row of each value. In the above
table, Excel has sorted the values according to their ranks. The first column indicates
the exact position of the values. We have to sort the data with respect to this first
column, so that the format will be as in the first place. We will repeat these actions for
the second set of data and then calculate the correlation coefficient of the ranks of the
values. Attention is to be paid at the sequence of the actions described. The ranks of
the values must be calculated separately for each data set and the sorting need to be
done before calculating the correlation coefficient. The results for the data used in this
example calculated the Spearmans correlation coefficient to be equal to 0.020483
whereas the correlation calculated using SPSS is equal to 0.009. The reason for this
difference in the two correlations is that SPSS has a way of dealing the values that
have the same rank. It assigns to all values the average of the ranks. That is, if three
values are equal (so their ranks are the same), SPSS assigns to each of these three
values the average of their ranks (Excel does not do this action).
28
Tsagris Michael
test statistics due to the different handling of the tied ranks and the use of different test
statistic. There is also another way to calculate a test statistics and that is by taking the
sum of the positive ranks. Both Minitab and SPSS calculate another type of test
statistic, which is based on either the positive or the negative ranks. What is worthy to
mention is that the second formula is better used in the case when there are no tied
ranks. Irrespectively of the test statistics used the result will be the same as for the
rejection of the null hypothesis. Using the second formula the result is 1401, whereas
Minitab provides a result of 1231.5. As for the result of the test (reject the null
hypothesis or not) one must look at the tables for the 1 sample Wilcoxon signed rank
test. The fact that Excel does not offer options for calculating the probabilities used in
the non-parametric tests in conjunction with the tedious work, makes it less popular
for use.
Values
(Xi)
m-Xi
absolute(m-Xi)
Ranks of
absolute values
307
350
318
304
302
429
454
440
455
390
350
351
383
360
383
13
-30
2
16
18
-109
-134
-120
-135
-70
-30
-31
-63
-40
-63
13
30
2
16
18
109
134
120
135
70
30
31
63
40
63
64
47
67
60
56
19
13
17
11
32
47
43
37
41
37
positive or
negative
differences
1
-1
1
1
1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
Ranks
Ri
64
-47
67
60
56
-19
-13
-17
-11
-32
-47
-43
-37
-41
-37
Squared
Ranks
Ri2
4096
2209
4489
3600
3136
361
169
289
121
1024
2209
1849
1369
1681
1369
29
Tsagris Michael
Values Y
Values X
Values
Y-X
Absolute
differences
307
350
318
304
302
429
454
440
455
390
350
351
383
360
383
225
250
250
232
350
400
351
318
383
400
400
258
140
250
250
82
100
68
72
-48
29
103
122
72
-10
-50
93
243
110
133
82
100
68
72
48
29
103
122
72
10
50
93
243
110
133
Ranks of
absolute
differences
8
6
11
9
13
14
5
3
9
15
12
7
1
4
2
Positive or
negative
differences
1
1
1
1
-1
1
1
1
1
-1
-1
1
1
1
1
Ranks
Ri
8
6
11
9
-13
14
5
3
9
-15
-12
7
1
4
2
Squared
Ranks
Ri2
64
36
121
81
169
196
25
9
81
225
144
49
1
16
4
Table 14: Procedure of the Wilcoxon Signed Rank Test with Paired Data
30