This action might not be possible to undo. Are you sure you want to continue?
Statistical Software Packages
Data Analysis Tools
This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool loaded, you need to go to Tools/Add-Ins and then check the box Analysis ToolPak. When you do so, you may be prompted to enter your original CD to load the tools. Tools for Summarizing Data
There are two principal analysis tools for summarizing data. They are “Histogram” and “Descriptive Statistics.”
We can use a spreadsheet to obtain a histogram. In the process it finds the frequency distribution and then it will draw the plot. It also has the option of finding an ogive. Below is the procedure. 1. To get to the Analysis Tools, select Tools/Data Analysis. This will bring up the list of statistical methods.
2. Select the tool entitled "Histogram." The dialog box below will then appear. All of the analysis tools in Excel provide a similar dialog box
Analysis Tools - 1
Statistical Software Packages
3. In the dialog box specify where the data are you want to analyze and where you want the output to go. Specify the location of the data either by typing the cell range, or by dragging the mouse over the cells containing the data. For now, skip the box asking for the bin range (see below for how to use the bin range input). If you have indicated the row that has the variable name or heading, click in the labels box. In the box asking for the output range, type or click on the cell reference where you want the output to begin. Do not mark the box next to "Pareto." If you want Excel to draw the histogram, click in the appropriate box. The “Cumulative Percentage” box will give you the ogive. Then click OK.
The result of this procedure will be a frequency distribution. The first column will show the value which defines the right (or maximum) value of the class interval, which Excel refers to as a “bin.” The second column will show the number of observations in the bin, and the third column will contain the cumulative percentage of observations falling in or below the bin.
Analysis Tools - 2
06 More Frequency 1 2 13 38 64 56 26 18 8 8 1 2 1 1 0 1 Cumulative % .50% 83.00% 100.42 36 .00% 80.92% 98.Stat 5969 Statistical Software Packages • Example Output: Bin 16.58% 99.66 26.12 44.66 28 .24 40.00% 40.17% 72.50% 49.84 20.58% 100.00% 20.9 18.83% 94.54 32 .50% 97.00% Histogram 80 Frequency 60 40 20 0 16 .72 24.25% 6.00% .42% 1.42 34.54 30.3 38.6 28.17% 99.67% 22.9 20 .06 120.18 42.78 24 .3 .3 40 .17% 97.48 32.18 44 .36 36.00% 60.75% 99.78 22.00% Frequency Cumulative % Bin Analysis Tools .33% 90.
in what is above. It always sets the first bin value equal to the smallest number in the data set.06. but greater than the first bin number. Hence its frequency is almost always equal to 1.9. The first frequency number is the number of data points that have values less than or equal to the first bin number. The next frequency number is the number of data points less than or equal to the second bin number. For example. In the example above. The other numbers are interpreted similarly. Bin 18. I choose to combine this bin with the next one.84 20. The last bin always says “More.84.67% Analysis Tools .25% 13 6. The first is the way it handles the first bin. Minor Fixes to Excel’s Output • There are two things about Excel’s histogram output that I don’t like. there is one number in the data set that is less than or equal to 16.78 Frequency Cumulative % 3 1. I add the frequency of this first bin to the frequency of the second bin.9 and less than or equal to 18. In almost every case. and then delete the first row of the output given by Excel. For the example above. There are 2 numbers in the data set larger than 16.Stat 5969 Statistical Software Packages • The way to interpret the frequency distribution is as follows. the first two rows of my modified frequency distribution would look like this.” The corresponding frequency number tells us how many numbers in the data set are larger than the second to last bin number. 1 number in the data set is larger than 44.4 . To do so.
take the range of the data (largest minus smallest). the smallest and largest of the 240 values were 16.9)/8 = 3. Selecting Your Own Bin Values • If you don’t like the bin values that Excel uses. You will usually have to round this number to an integer.64. To find the bin width. you can create your own. hundredth. For the example above. then select OK. Again you will want to round up to determine the actual bin width. For the example.65 as the bin width. double click on the bars of the chart. To make it an “even” number. Then (2*240) 1/3 = 7.e. Below I describe the process that I would follow to do it. and divide by the number of bins found in step 1 above. I decided to use 3. Say that the number of observations you have is n. 1. To remove the space. we need to have no space between the bars.). there were 240 data points. then select the Options tab. Then a rule for the number of bins is (2*n)1/3 (i. we use (46-16. but it is quite subjective as to how to round (you can go to the nearest integer. etc. As you can see.. To find the interval. The usual suggestion is to round up.5 . and my preference is to let Excel choose the bins values. it is quite a bit longer. 2. First determine the number of bins.9 and 46. so it is convenient to use two decimal places for the bin width. and change the Gap width to 0. To make it look more like a histogram.83.Stat 5969 Statistical Software Packages • The second thing that I don’t like is that the chart that Excel automatically constructs is actually a bar graph. tenth. The original data had two decimal places. We round up to 8 to get 8 bins. Analysis Tools . the cube root of 2n).
including the chart (after adjusting the gap width to 0). In the bin input range.9 + 3. 24.00% Analysis Tools .65). even though there are 8 bins—the 8th bin will be created by Excel and called “More”). In cell C1 I should enter some label for the bins.15. I take the smallest number and add bin width to it to obtain the starting bin value. In cells C2:C8 I can enter the numbers 20.1 38.) Now use Data Analysis from the Tools menu.6 .8. • Below is the resulting output.Stat 5969 Statistical Software Packages 3. Then hit OK. The most obvious choice is to just type “Bin” in C1. but less than or equal to the second bin number. 27. 31. you can round to a neighbor that fits your criteria for a good starting value.5.4 (notice I only entered 7 numbers.75 42. enter C1:C8. 42.45.83% 92. When creating the bin boundaries.58% 100.1.50% 97.92% 98.75% 99. 38. you must add a label to the bins as well. (which is close to 16. For the example.08% 80. Excel will take the first number that you put in the bin range.15 27. (If you check the “Labels in First Row” box.5 24. and then find how many numbers in the data set are less than or equal to that number. If you don’t like fractions or “uneven” numbers.42% 42. 35. Choose the other options as normal.75.4 More Frequency 13 88 93 28 13 2 2 1 Cumulative % 5.45 35. Then it will take the 2nd number in the bin range. and find how many are greater than the first bin number. say my original data are in cells A2:A241 and cell A1 contains a label. Bin 20. Input A1:A241 in the data input range.8 31.
45 35.00% 40.5 24.00% .00% 20.00% 80.00% 100.15 27.1 38.4 More Bin 60 40 20 0 • The interpretation of the frequency distribution is exactly the same as before.Stat 5969 Statistical Software Packages Histogram 100 80 Frequency 120.8 31.75 42.00% 20.7 .00% 60. Analysis Tools .
I recommend always clicking on the “Summary Statistics” box. I rarely use the other boxes.01206 1. and then refer to all the columns in the input portion of the dialog box.15 1.Stat 5969 Statistical Software Packages Descriptive Statistics • To use Excel to obtain a listing of descriptive statistics. You can do descriptive statistics on several variables at once.23 0.2 0. and then click the box next to "Labels.12 1.03822 • • Analysis Tools . If you want the data set to have a descriptive title. This time.0016 -1." Indicate where the data are located." Specify where you want the output to go. I also recommend checking the Confidence Level for Mean box (and filling in the confidence level) if you are interested in confidence intervals for the mean. instead of selecting "Histogram. Height Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(99.2 11 0.0%) 1.8 . You just need to be sure that the variables are next to each other in the spreadsheet. we again use the Analysis Tools.04 0." select "Descriptive Statistics.50417 0.11302 0.18 1. and select whether they are in rows or columns. Here is some example output. you can include the label in the first entry above the data.27 13.
**WARNING** This analysis tool divides the cross products by n rather than by n-1. and then select OK. you should multiply all of the numbers by n . It has some faults. Below is a sample of what it produces. The file is called Multiple Boxplots. Example output is shown at the top of the next page. It is also limited to data sets of no more than 500 observations.9 . I have created a template that will do up to 4 simultaneous box plots. n −1 Analysis Tools . and where you would like the output to go. Select Data Analysis.Stat 5969 Statistical Software Packages Box plots • There is nothing built in to Excel to do box plots.xls. Identify the input area. Indicate whether the data are grouped by column or row. Automobile Public 0 10 20 30 40 50 Covariance • A covariance matrix can be obtained from the spreadsheet by using the Covariance Analysis Tool. then Covariance. and whether labels are being used. but it is not bad. If you want true sample variances and covariances.
123469 0.0508 0.110149 0.Stat 5969 Statistical Software Packages Day Day Hour Prep Time Wait Time Travel Time Distance • Hour Prep Time Wait Time Travel Time Distance 3.0830 Hour 1.1122 0. so only numbers on one side of the diagonal are shown.0208 1.166276 5.0000 0.59392 1.1019 0.271967 -0.0000 -0. and all other numbers are covariances.310482 0.0811 -0.02226 10. and the procedure is identical to that of finding the covariance.0000 -0.0000 • The off-diagonal terms are the sample correlation coefficients between pairs of variables.0000 0.0516 0.212155 1.143933 1.0166 -0. Correlation • We can also use the spreadsheet to find the sample correlation matrix.0618 Prep Time Wait Time Travel Time Distance Day Hour Prep Time Wait Time Travel Time Distance 1.10 .033146 -0. Day 1.0000 0.06857 3. except that we choose the Correlation Analysis Tool.221318 0.83906 0.799046 1.02825 The numbers on the diagonals are variances (except they are divided by n).9359 1. Analysis Tools . • Here is the correlation matrix for the pizza example. Excel does these computations correctly and no adjustments are necessary.906276 0 0.19626 -0. The matrix is symmetric.60447 -0.0479 -0.0208 1.0905 0.0000 0.29553 -0.1746 0.193243 0.
In the first step.” You will be presented with the following dialog box (except the buttons on the right will change according to the data set you are using). • To use the pivot table feature.Stat 5969 Statistical Software Packages Summarizing Qualitative Data in Tables • Excel has a utility called a Pivot Table that allows us to create and analyze tabular summaries (contingency tables) of qualitative data. • To start the “wizard. Before invoking the procedure. It can also be used with quantitative data or combinations of quantitative and qualitative data. • Analysis Tools . verify that the data range shown contains all of the data that you want to analyze.11 . then click on Next again. be sure that the cursor is in one of the cells containing a header or data. In step 3. just click on Next (the default values are what we want). In the second step.” go to Data/PivotTable and PivotChart Report. data must be entered in columns and each column must have a title or header. click on the button called “Layout.
you will likely want to group it. You may want them listed in some kind of logical ascending order (for example.Stat 5969 Statistical Software Packages • At this point. • Analysis Tools . go to the Tools menu.” Then drag either of the two buttons that you just used to the “Data” area. There may be times that you want to display the numbers in the table as overall percentages. Junior and Senior). To complete the procedure there are a few other options you can change if you desire.12 . To change the display. Then select options. Then say OK. You should be at step 3 again. etc. you may want to list class standing as Freshman. right click on the variable name in the table. Choose it.” Then you can type in the list items in the order you want them (separate them with a comma or return) in the List Entries section. The default or “normal” state is to display total counts. Then exit out of all of the boxes. To do so. The default way that Excel lists the categories in qualitative variables is alphabetically. To tell Excel how you want the labels to be ordered. and then click on the tab called “Custom Lists. Use the drop down menu to say how you want to display the data. • • The pivot table can display several different types of summary measues. One item in the pop-up menu should say Group. Click on Layout and then double click what is in the middle of the table (it should say “Count of…”). I recommend always dragging one of the qualitative variables’ buttons. A drop down menu that says “Show Data As” will be in the middle of the dialog box. Sophomore. Or you can import the list in the order that you want by identifying the cells where they are listed. If you have used a quantitative variable. but I usually just click on Finish at this point and change options later if the output is not what I desire. The button should change to say “Count of VARIABLE” “where VARIABLE is the name of the variable that you dragged to the middle. and then specify how you want the variable to be grouped. select options. click any where in the table and go again to the Data/PivotTable and PivotChart Report menu item. click on and drag the button corresponding to the variable that you want to be on the rows of your output table to the area labeled “Row” and the variable you want in columns to the area that says “Column. as row percentages.
choose the Sampling tool. That way I can tell if I have duplicates. Very Good. I suggest including a column in the data file that is a numbered label. The tool is called "Sampling. I simply continue to draw more samples until I have a sufficient number of distinct items for the desired sample size." Before using the tool. Then hit OK.g.” Random Sampling • We can obtain a random sample from a set of data using the analysis tools. then indicate how many samples you want to draw (i. I created a custom list in Excel as “Good. it is possible to obtain repeated items in the sample (e. the same item could be drawn twice). input the first cell of the output block. That is why I use the label column rather than the original data column to create the sample. Excellent.13 ..Stat 5969 Statistical Software Packages • Below is a portion of an Excel worksheet with both qualitative and quantitative variables. It shows both a portion of the original data and the the resulting pivot table. • Analysis Tools . With the above procedure. Next indicate the location of the numbers to be sampled from (which would be the location of the data labels).e. If I do obtain a duplicate. the sample size). After selecting Data Analysis. choose random (rather than periodic)..
Below I have repeated part of the printout from above.$A$2:$B$301. Then we use the first and last two numbers in the Descriptive Statistics output to create the confidence interval. The last number. and regression. To find the actual data associated with the label. Suppose also that I started the output from the Sampling tool in C2 and drew a sample of 25 (so the sampled labels are in cells C2:C26). which Excel calls Confidence Level(xx%) (which I consider to be a very poor name) is the margin of error. The tool is useful for cases where we have the data and we do not know the population standard deviation. Then I would copy cell D2’s contents down through cell D26.2). Then in cell D2 I would enter the function =VLOOKUP(C2. The sort routine is under the DATA menu or can be found on the tool bar .14 . When you find the number report back what the corresponding number in the second column of A2:B301 (the 2 is what tells it to report back what is in the second column). I will discuss the how to use the tools for confidence intervals on one mean. The first number is the sample average. which is what we use to do confidence intervals. Suppose that my labels are in cells A2:A301 and the data from which I want the random sample is in cells B2:B301. Confidence Intervals • I have already described the Descriptive Statistics Tool. we can use the function =VLOOKUP. • Inference Tools • The majority of the tools in Excel are for statistical inference. hypothesis tests on one and two means. Analysis Tools .Stat 5969 Statistical Software Packages • The best way I know to look for a duplicate is to sort the data. analysis of variance. This function says look for what is in cell C2 in the first column of A2:B301. I will also assume that I don’t have any duplicates.
20. Then from the Data Analysis Tools select "t-Test: Paired Two-Sample for Means" in Excel. The (hypothesized) difference should always be 0 or can be left blank. if you labeled your columns and included them in the Variable 1 and Variable 2 input portions. then click the labels box. Before going to the Tools menu you need to add another column which consists only of the hypothesized value µ 0. Analysis Tools . The drained weights in ounces for a sample of 15 cans of fruit from PC had a mean value of 12. The easiest way to do this is to enter µ 0 once. and give a level of significance (α ) value. Variable 1 input will be the column where the original data are located. Variable 2 input will be the column where the hypothesized value is located.0%) 11 0.2 0. Indicate where you want the output to go. Example: Pineapple Corporation (PC) maintains that their cans have always contained an average of 12 ounces of fruit. and then use the fill down command to put it in the rest of the cells. Finally.15 . Use a significance level of .01206 M Count Confidence Level(99. The production group believes that the mean weight has changed.03822 Hypothesis Test on One Mean: This procedure is used when you do not know the population standard deviation and you have all of the data given.09 and a standard deviation of .Stat 5969 Statistical Software Packages Height Mean Standard Error M 1. next to each value of the original data.01. Use an appropriate hypothesis test to determine if the data show evidence of a change in mean weight. The output is presented on the next page.
624492 0.041238 15 #DIV/0! 0 0 14 1.652907 0.976849 Hypothesized Mean 12 0 15 Conclusions: Analysis Tools .08667 0.060295 2.Stat 5969 Statistical Software Packages t-Test: Paired Two-Sample for Means Weight Mean Variance Observations Pearson Correlation Pooled Variance Hypothesized Mean Difference df t P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail • 12.16 .120591 2.
if you labeled your columns (or rows) and included them in the Variable 1 and Variable 2 input portions. The output from the two procedures is given on the next page.1 42. can we conclude that the two shifts have the same productivity level? It looks like the second shift completes the task in less time. • Consider the following example. Shift 1 Shift 2 81.5 49. To test her hypothesis.8 65.6 58. Variable 2 input will be the column (or row) where the second set of data is located.2 40. then click the labels box. Whatever we decide.Stat 5969 Statistical Software Packages Testing Two Means (with unpaired or unmatched samples) If we want to test the relationship between two means.9 From the data. The hypothesized difference will usually be 0.7 49. Indicate where you want the output to go. we have two choices: "t-Test: Two-Sample Assuming Equal Variance" or "t-Test: Two-Sample Assuming Unequal Variance.4 39.6 56.9 42. A manager is interested in determining whether the productivity of workers that work during two different shifts is the same. Variable 1 input will be the column (or row) where the first set of data is located. but is the difference due to sampling.2 72. Finally.6 45.6 62. the manager randomly samples 8 workers from each shift and records the average time (in minutes) needed to complete a given assembly-line task.2 56. but not always.17 . Analysis Tools .8 76. with the results given below. the procedure in Excel is identical once we have chosen made our choice. and give a level of significance (α ) value. or because the mean times are really different.8 48." The choice obviously depends on what we believe the relationship is between the population variances of the two groups.
7875 207.178813 Mean Variance Observations Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Analysis Tools .3564 89.Stat 5969 Statistical Software Packages t-Test: Two-Sample Assuming Equal Variances Shift 1 Shift 2 61.144789 Mean Variance Observations Pooled Variance Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail t-Test: Two-Sample Assuming Unequal Variances Shift 1 Shift 2 61.761309 0.082575 2.325 49.18 .039536 1.3564 89.325 49.7875 207.50125 8 8 148.782287 0.079072 2.041288 1.4288 0 14 1.50125 8 8 0 12 1.894011 0.894011 0.
Finally.Stat 5969 Statistical Software Packages Testing Two Variances • To do this type of problem on the computer. and select "F-test Two-Sample for Variances. If the test is two-sided (as it usually is) you have two options.356 8 7 2. First. The second option is to always use the p-value criterion and for a two-sided test. Excel only calculates one-sided values.325 207.787 Shift 2 49. For the example: F-Test: Two-Sample for Variances Shift 1 61. For this procedure.145 3. Give the value of α and indicate where you want the output to go. multiply the one-sided p-value by 2. Variable 2 input will be the column (or row) where the second set of data is located. then click the labels box.501 8 7 • • Mean Variance Observations df F P(F<=f) one-tail F Critical one-tail Analysis Tools .19 . go to the Data Analysis Tools. and input the result as the level of significance. Try to use the variable with the largest sample variance as variable 1." Variable 1 input will be the column (or row) where the first set of data is located.7875 89. you can divide the given value of α by 2.317 0. if you labeled your columns (or rows) and included them in the Variable 1 and Variable 2 input portions.
After selecting Data Analysis. Inc. Next specify the input block. Only four observations were taken on machine 3 since the inspector became ill and had to go home before he could complete his work. If you are interested in two-way ANOVA. and check the label box if you have included labels in your input block. Then start the procedure. It should also be very similar to what is described below. Check the box indicating whether the groups are entered in columns or rows. Can we conclude that the mean hourly output for the three machines is different? Machine 1 105 105 110 107 102 Machine 2 Machine 3 91 104 99 106 89 99 95 109 103 Analysis Tools . were set up to mill the same type of part. Observations were taken at random times to find out how many parts were being produced per hour by each machine. I only describe the single factor case below. These data were entered into Excel in cells A1:C5. choose the option called. "Anova: Single Factor" in Excel.20 . Excel will handle the blank spaces without a problem. Indicate where to send the output. If the groups have differing numbers of samples.Stat 5969 Statistical Software Packages ANOVA: • Excel can do one and two-way analysis of variance. Each group should be in its own column or row. which will contain the data from all groups. • • Example Three different automatic milling machines at Castmetal. and then input a value of α . Excel’s help should guide you through it. be sure to highlight to include all samples.
90909 13 Conclusions: Analysis Tools .9286 7.882257 0.8571 df MS F P-value F crit 2 156.8 17.205699 11 19.007518 7.6667 SS 313.7 32.21 . Anova: Single-Factor Summary Groups Machine 1 Machine 2 Machine 3 ANOVA Source of Variation Between Groups Within Groups Total • Count 5 5 4 Sum 529 477 418 Average 105.4 104.5 Variance 8.Stat 5969 Statistical Software Packages • Below is the dialog box and output for the example.8571 219 532.8 95.
First. enter the cell range referring to the column containing the dependent variable.. In the Input X Range. all data should be entered in columns. in a contiguous set of cells). With regression. In the Input Y Range. however. enter the range of cells containing all independent variables. You will be presented with the dialogue box shown below.e. select "Regression" from the Tools/ Data Analysis menu item in Excel. Analysis Tools .Stat 5969 Statistical Software Packages Regression: • Doing regression in Excel is very similar to using the other analysis tools. having the data in the right form is more important. This is why the X variables need to be next to each other.22 . click the label box. Second. all independent variables should be next to each other (i. Once the data are entered correctly. If your range of cells included a row of labels.
042988 Mean Square F Significance F 250. • Below is some sample output.018919 0.1E-107 0. In our examples that will never be the case.87656 0. Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations Analysis of Variance df 3 236 239 Sum of Squares 752.670277 240 Regression Residual Total Coefficients Intercept Day Hour Distance 1.754578 -0.31297 0. click in the Confidence Level box and enter a different confidence level.754525 t Statistic P-value Lower 95% Upper 95% 5.973 558.449271 0.6225 7.609725 -1.947 Standard Error 0.00592 1.703939 1.23 .14541 0.669837 1.028 858.229887 0. click on the box next to “Residuals. Next.839213 Analysis Tools .02521 -0.018153 -0.Stat 5969 Statistical Software Packages I never click the Constant is Zero box. indicate where you want the output to go.032169 9.156832 -0. If you want a confidence interval for the β values other than a 95% confidence interval.874991 0.919 106. so we can force it do so.8147 1E-109 1. In some physical systems it only makes sense for the intercept to be 0.936248 0.54E-07 0.253183 -0.” I leave all other boxes blank.04319 0. because I don’t like the way that Excel does the rest of the residual analysis or the normal probability plot.022013 0.06858 0. Then hit enter.031351 40. Finally.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.