Professional Documents
Culture Documents
Excel Data Analysis Tools
Excel Data Analysis Tools
This section of the notes is meant to introduce you to many of the tools that
are provided by Excel under the Tools/Data Analysis menu item. If your
computer does not have that tool loaded, you need to go to Tools/Add-Ins
and then check the box Analysis ToolPak. When you do so, you may be
prompted to enter your original CD to load the tools.
Tools for Summarizing Data
There are two principal analysis tools for summarizing data. They are
Histogram and Descriptive Statistics.
Histograms
2. Select the tool entitled "Histogram." The dialog box below will then
appear. All of the analysis tools in Excel provide a similar dialog box
Analysis Tools - 1
Stat 5969
3. In the dialog box specify where the data are you want to analyze and
where you want the output to go. Specify the location of the data
either by typing the cell range, or by dragging the mouse over the cells
containing the data. For now, skip the box asking for the bin range
(see below for how to use the bin range input). If you have indicated
the row that has the variable name or heading, click in the labels box.
In the box asking for the output range, type or click on the cell
reference where you want the output to begin. Do not mark the box
next to "Pareto." If you want Excel to draw the histogram, click in the
appropriate box. The Cumulative Percentage box will give you the
ogive. Then click OK.
Analysis Tools - 2
Stat 5969
Example Output:
Bin
16.9
18.84
20.78
22.72
24.66
26.6
28.54
30.48
32.42
34.36
36.3
38.24
40.18
42.12
44.06
More
Frequency
1
2
13
38
64
56
26
18
8
8
1
2
1
1
0
1
Cumulative %
.42%
1.25%
6.67%
22.50%
49.17%
72.50%
83.33%
90.83%
94.17%
97.50%
97.92%
98.75%
99.17%
99.58%
99.58%
100.00%
Histogram
80
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
.00%
60
40
20
0
16
.9
20
.78
24
.66
28
.54
32
.42
36
.3
40
.18
44
.06
Frequency
Bin
Analysis Tools - 3
Frequency
Cumulative %
Stat 5969
There are two things about Excels histogram output that I dont like. The
first is the way it handles the first bin. It always sets the first bin value equal
to the smallest number in the data set. Hence its frequency is almost always
equal to 1. In almost every case, I choose to combine this bin with the next
one. To do so, I add the frequency of this first bin to the frequency of the
second bin, and then delete the first row of the output given by Excel. For
the example above, the first two rows of my modified frequency distribution
would look like this.
Bin
18.84
20.78
Frequency Cumulative %
3
1.25%
13
6.67%
Analysis Tools - 4
Stat 5969
The second thing that I dont like is that the chart that Excel automatically
constructs is actually a bar graph. To make it look more like a histogram,
we need to have no space between the bars. To remove the space, double
click on the bars of the chart, then select the Options tab, and change the
Gap width to 0, then select OK.
If you dont like the bin values that Excel uses, you can create your own.
Below I describe the process that I would follow to do it. As you can see, it
is quite a bit longer, and my preference is to let Excel choose the bins
values.
1. First determine the number of bins. Say that the number of observations
you have is n. Then a rule for the number of bins is (2*n)1/3 (i.e., the
cube root of 2n). You will usually have to round this number to an
integer. The usual suggestion is to round up. For the example above,
there were 240 data points. Then (2*240) 1/3 = 7.83. We round up to 8
to get 8 bins.
2. To find the bin width, take the range of the data (largest minus smallest),
and divide by the number of bins found in step 1 above. Again you will
want to round up to determine the actual bin width, but it is quite
subjective as to how to round (you can go to the nearest integer, tenth,
hundredth, etc.). For the example, the smallest and largest of the 240
values were 16.9 and 46. To find the interval, we use (46-16.9)/8 = 3.64.
The original data had two decimal places, so it is convenient to use two
decimal places for the bin width. To make it an even number, I
decided to use 3.65 as the bin width.
Analysis Tools - 5
Stat 5969
3. When creating the bin boundaries, I take the smallest number and add bin
width to it to obtain the starting bin value. If you dont like fractions or
uneven numbers, you can round to a neighbor that fits your criteria for
a good starting value. Excel will take the first number that you put in the
bin range, and then find how many numbers in the data set are less than
or equal to that number. Then it will take the 2nd number in the bin
range, and find how many are greater than the first bin number, but less
than or equal to the second bin number.
For the example, say my original data are in cells A2:A241 and cell A1
contains a label. In cells C2:C8 I can enter the numbers 20.5, (which is
close to 16.9 + 3.65), 24.15, 27.8, 31.45, 35.1, 38.75, 42.4 (notice I only
entered 7 numbers, even though there are 8 binsthe 8th bin will be
created by Excel and called More). In cell C1 I should enter some
label for the bins. The most obvious choice is to just type Bin in C1.
(If you check the Labels in First Row box, you must add a label to the
bins as well.) Now use Data Analysis from the Tools menu. Input
A1:A241 in the data input range. In the bin input range, enter C1:C8.
Choose the other options as normal. Then hit OK.
Below is the resulting output, including the chart (after adjusting the gap
width to 0).
Bin
20.5
24.15
27.8
31.45
35.1
38.75
42.4
More
Frequency
13
88
93
28
13
2
2
1
Analysis Tools - 6
Cumulative %
5.42%
42.08%
80.83%
92.50%
97.92%
98.75%
99.58%
100.00%
Stat 5969
Frequency
Histogram
100
120.00%
80
100.00%
80.00%
60
60.00%
40
40.00%
20
20.00%
.00%
20.5 24.15 27.8 31.45 35.1 38.75 42.4 More
Bin
Analysis Tools - 7
Stat 5969
Descriptive Statistics
You can do descriptive statistics on several variables at once. You just need
to be sure that the variables are next to each other in the spreadsheet, and
then refer to all the columns in the input portion of the dialog box.
1.2
0.01206
1.18
1.23
0.04
0.0016
-1.11302
0.50417
0.12
1.15
1.27
13.2
11
0.03822
Analysis Tools - 8
Stat 5969
Box plots
Automobile
Public
10
20
30
40
50
Covariance
n
.
n 1
Analysis Tools - 9
Stat 5969
Day
Day
Hour
Prep Time
Wait Time
Travel Time
Distance
3.906276
0
0.212155
1.123469
0.193243
0.166276
Hour
5.271967
-0.19626
-0.83906
0.221318
0.143933
1.110149
0.310482
0.033146
-0.02226
10.60447
-0.29553
-0.06857
Travel
Time
Distance
3.59392
1.799046
1.02825
The numbers on the diagonals are variances (except they are divided by n),
and all other numbers are covariances. The matrix is symmetric, so only
numbers on one side of the diagonal are shown.
Correlation
We can also use the spreadsheet to find the sample correlation matrix, and
the procedure is identical to that of finding the covariance, except that we
choose the Correlation Analysis Tool.
Day
Hour
Prep Time
Wait Time
Travel Time
Distance
Day
1.0000
0.0000
0.1019
0.1746
0.0516
0.0830
Hour
Prep Time
1.0000
-0.0811
-0.1122
0.0508
0.0618
1.0000
0.0905
0.0166
-0.0208
1.0000
-0.0479
-0.0208
1.0000
0.9359
1.0000
The off-diagonal terms are the sample correlation coefficients between pairs
of variables. Excel does these computations correctly and no adjustments
are necessary.
Analysis Tools - 10
Stat 5969
Excel has a utility called a Pivot Table that allows us to create and analyze
tabular summaries (contingency tables) of qualitative data. It can also be
used with quantitative data or combinations of quantitative and qualitative
data.
To use the pivot table feature, data must be entered in columns and each
column must have a title or header. Before invoking the procedure, be sure
that the cursor is in one of the cells containing a header or data.
In step 3, click on the button called Layout. You will be presented with
the following dialog box (except the buttons on the right will change
according to the data set you are using).
Analysis Tools - 11
Stat 5969
At this point, click on and drag the button corresponding to the variable that
you want to be on the rows of your output table to the area labeled Row
and the variable you want in columns to the area that says Column. Then
drag either of the two buttons that you just used to the Data area. I
recommend always dragging one of the qualitative variables buttons. The
button should change to say Count of VARIABLE where VARIABLE is the
name of the variable that you dragged to the middle. Then say OK.
To complete the procedure there are a few other options you can change if
you desire, but I usually just click on Finish at this point and change options
later if the output is not what I desire. If you have used a quantitative
variable, you will likely want to group it. To do so, right click on the
variable name in the table. One item in the pop-up menu should say Group.
Choose it, and then specify how you want the variable to be grouped.
The pivot table can display several different types of summary measues.
The default or normal state is to display total counts. There may be times
that you want to display the numbers in the table as overall percentages, as
row percentages, etc. To change the display, click any where in the table
and go again to the Data/PivotTable and PivotChart Report menu item. You
should be at step 3 again. Click on Layout and then double click what is in
the middle of the table (it should say Count of). Then select options.
A drop down menu that says Show Data As will be in the middle of the
dialog box. Use the drop down menu to say how you want to display the
data. Then exit out of all of the boxes.
The default way that Excel lists the categories in qualitative variables is
alphabetically. You may want them listed in some kind of logical ascending
order (for example, you may want to list class standing as Freshman,
Sophomore, Junior and Senior). To tell Excel how you want the labels to
be ordered, go to the Tools menu, select options, and then click on the tab
called Custom Lists. Then you can type in the list items in the order you
want them (separate them with a comma or return) in the List Entries section.
Or you can import the list in the order that you want by identifying the cells
where they are listed.
Analysis Tools - 12
Stat 5969
Random Sampling
We can obtain a random sample from a set of data using the analysis tools.
The tool is called "Sampling." Before using the tool, I suggest including a
column in the data file that is a numbered label. After selecting Data
Analysis, choose the Sampling tool. Next indicate the location of the
numbers to be sampled from (which would be the location of the data
labels), input the first cell of the output block, choose random (rather than
periodic), then indicate how many samples you want to draw (i.e., the
sample size). Then hit OK.
Analysis Tools - 13
Stat 5969
The best way I know to look for a duplicate is to sort the data. The sort
routine is under the DATA menu or can be found on the tool bar
.
To find the actual data associated with the label, we can use the function
=VLOOKUP. Suppose that my labels are in cells A2:A301 and the data
from which I want the random sample is in cells B2:B301. Suppose also
that I started the output from the Sampling tool in C2 and drew a sample of
25 (so the sampled labels are in cells C2:C26). I will also assume that I
dont have any duplicates. Then in cell D2 I would enter the function
=VLOOKUP(C2,$A$2:$B$301,2). This function says look for what is in
cell C2 in the first column of A2:B301. When you find the number report
back what the corresponding number in the second column of A2:B301 (the
2 is what tells it to report back what is in the second column). Then I would
copy cell D2s contents down through cell D26.
Inference Tools
The majority of the tools in Excel are for statistical inference. I will discuss
the how to use the tools for confidence intervals on one mean, hypothesis
tests on one and two means, analysis of variance, and regression.
Confidence Intervals
Analysis Tools - 14
Stat 5969
Height
Mean
Standard Error
1.2
0.01206
Count
Confidence Level(99.0%)
11
0.03822
Analysis Tools - 15
Stat 5969
12.08667
0.041238
15
#DIV/0!
0
0
14
1.652907
0.060295
2.624492
0.120591
2.976849
Conclusions:
Analysis Tools - 16
Hypothesized
Mean
12
0
15
Stat 5969
From the data, can we conclude that the two shifts have the same
productivity level? It looks like the second shift completes the task in less
time, but is the difference due to sampling, or because the mean times are
really different. The output from the two procedures is given on the next
page.
Analysis Tools - 17
Stat 5969
Mean
Variance
Observations
Pooled Variance
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Shift 1
Shift 2
61.325 49.7875
207.3564 89.50125
8
8
148.4288
0
14
1.894011
0.039536
1.761309
0.079072
2.144789
Mean
Variance
Observations
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Analysis Tools - 18
Shift 1
Shift 2
61.325 49.7875
207.3564 89.50125
8
8
0
12
1.894011
0.041288
1.782287
0.082575
2.178813
Stat 5969
For this procedure, Excel only calculates one-sided values. If the test is
two-sided (as it usually is) you have two options. First, you can divide the
given value of by 2, and input the result as the level of significance. The
second option is to always use the p-value criterion and for a two-sided test,
multiply the one-sided p-value by 2.
Mean
Variance
Observations
df
F
P(F<=f) one-tail
F Critical one-tail
Shift 1
61.325
207.356
8
7
2.317
0.145
3.787
Analysis Tools - 19
Shift 2
49.7875
89.501
8
7
Stat 5969
ANOVA:
Excel can do one and two-way analysis of variance. I only describe the
single factor case below. If you are interested in two-way ANOVA, Excels
help should guide you through it. It should also be very similar to what is
described below.
After selecting Data Analysis, choose the option called, "Anova: Single
Factor" in Excel. Next specify the input block, which will contain the data
from all groups. Each group should be in its own column or row. If the
groups have differing numbers of samples, be sure to highlight to include all
samples. Excel will handle the blank spaces without a problem. Indicate
where to send the output, and then input a value of . Check the box
indicating whether the groups are entered in columns or rows, and check the
label box if you have included labels in your input block. Then start the
procedure.
Example
Three different automatic milling machines at Castmetal, Inc. were set up to
mill the same type of part. Observations were taken at random times to find
out how many parts were being produced per hour by each machine. Only
four observations were taken on machine 3 since the inspector became ill
and had to go home before he could complete his work. These data were
entered into Excel in cells A1:C5. Can we conclude that the mean hourly
output for the three machines is different?
Machine 1
105
105
110
107
102
Machine 2 Machine 3
91
104
99
106
89
99
95
109
103
Analysis Tools - 20
Stat 5969
Anova: Single-Factor
Summary
Groups
Count
Sum
Average
Variance
5
5
4
529
477
418
105.8
95.4
104.5
8.7
32.8
17.6667
Machine 1
Machine 2
Machine 3
ANOVA
Source of Variation
Between Groups
Within Groups
SS
313.8571
219
df
MS
F P-value
F crit
2 156.9286 7.882257 0.007518 7.205699
11 19.90909
Total
532.8571
13
Conclusions:
Analysis Tools - 21
Stat 5969
Regression:
Doing regression in Excel is very similar to using the other analysis tools.
With regression, however, having the data in the right form is more
important. First, all data should be entered in columns. Second, all
independent variables should be next to each other (i.e., in a contiguous set
of cells).
Once the data are entered correctly, select "Regression" from the Tools/
Data Analysis menu item in Excel. You will be presented with the dialogue
box shown below.
In the Input Y Range, enter the cell range referring to the column containing
the dependent variable. In the Input X Range, enter the range of cells
containing all independent variables. This is why the X variables need to be
next to each other. If your range of cells included a row of labels, click the
label box.
Analysis Tools - 22
Stat 5969
I never click the Constant is Zero box. In some physical systems it only
makes sense for the intercept to be 0, so we can force it do so. In our
examples that will never be the case. If you want a confidence interval for
the values other than a 95% confidence interval, click in the Confidence
Level box and enter a different confidence level.
Next, indicate where you want the output to go. Finally, click on the box
next to Residuals. I leave all other boxes blank, because I dont like the
way that Excel does the rest of the residual analysis or the normal probability
plot. Then hit enter.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.936248
0.87656
0.874991
0.670277
240
Analysis of Variance
Regression
Residual
Total
Intercept
Day
Hour
Distance
df
3
236
239
Sum of
Squares
752.919
106.028
858.947
Coefficients
Standard
Error
t Statistic
1.156832
-0.02521
-0.00592
1.754525
0.229887
0.022013
0.018919
0.042988
Analysis Tools - 23
Mean
Square
F
Significance F
250.973 558.6225 7.1E-107
0.449271
P-value
Lower
95%
Upper
95%