You are on page 1of 84

Excel Reference Manual: A gentle overview

Table of Contents

1. Introduction to MS Excel What is Excel Importing Data 4 5

2. Data Analysis and Statistical Concepts

Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling

9 29 53 47 59 65

Concept 6 Confidence Intervals

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

These reference manuals have been developed to assist students in the basics of statistical computing sort of a Statistical Computing for Dummies. It is not our intention to use this manual to teach statistical concepts 1but rather to demonstrate how to utilize previously taught statistical and data analysis concepts the way that professionals and practitioners apply them through the able assistance of computing. Proficiency in software allows students to focus more on the interpretation of the output and on the application of results rather than on the mathematical computations. We should pause here and strongly make the point that computers should serve as a medium of expediency of calculation not as a substitution for the ability to execute a calculation. In the Basic Concepts manual, we present statistical concepts, context for their use, and formulas where appropriate. We provide exercises to execute these concepts by hand. Then, in each subsequent manual, the concepts are applied in a consistent manner using each of the five major statistical computing packages Excel, SPSS, Minitab, R and SAS.

Readers of this manual are assumed to have completed some introductory statistics course. For individuals wishing to review statistical concepts, we recommend Introduction to Stats by DeVeaux, Velleman and Bock. 3 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Microsofts Excel 2010

What is Excel? This spreadsheet software package is ubiquitous. It represents a very basic and efficient way to organize, analyze and present data. Employers today expect that, at a minimum, new hires with college degrees will have a working knowledge of Excel. Excel is used anywhere that data is available which is everywhere. Excel is found in offices, libraries, schools, universities, home offices and everywhere in between. In addition to its role as a data analysis package, Excel is often used as a starting point to capture and organize data and then import it into more sophisticated analysis packages such as SPSS, Minitab or SAS. And, after analysis is complete, datasets can be exported back to Excel and shared with others who may not have access to (or have the ability to use) other analysis packages (we gently refer to this group as the great statistical unwashed). For product information regarding Excel, please visit: http://office.microsoft.com/en-us

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

When you open Excel, the interface includes row and columns, with cells at the intersections. You can input data or formulas into the individual cells. Here is a screen shot of a blank Excel 2010 page:

You can move easily through much of the functionality in Excel 2010 by clicking on the tab headers at the top. Most of the time, you will be on the Home tab.

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Getting Data into Excel


At this point, we need to access the WidgeOne.xls dataset in Excel 2010. To access the dataset, click on the File tab at the top right of the sheet and select Open. You should see the following screen:

Use the explorer window to browse to where the WidgeOne.xls file is saved.
6 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

If your data set is not an Excel file, but in a .txt, .dat or even a .csv file, Excel can still import it with the Text Import Wizard. This function is very useful as data retrieved from the internet could be any number of different types. Lets pretend that the WidgeOne data is a .txt file that has been copied from somewhere and pasted into Notepad. It would look something like this (yuck the variable names dont even match the columns!):

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

To properly import this text file, select the data tab on the ribbon, and then click on the from text icon:

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

This will open up a browser where you can navigate to where the text file is located.

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Select your file, click open, and the text import wizard box will pop up.

10

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Notice that there are two choices for file type: delimited or fixed with. Delimited means that the column values are separated by a character, such as comma, tab, stars anything, really. Fixed width indicates that the values are lined up in columns with an even number of spaces between them. The columns in the WidgeOne.txt file are not perfectly aligned with the column headings, so the fixed width would not work. Everything is separated by a tabs worth of space, however, so the delimited file type would work here. Select delimited and click Next.

Excel is clever enough to guess that tabs delimited this text file and has organized the data accordingly. Of course, this step has the option to indicate a different delimiter if needed. Select the appropriate delimiter and click Next for step 3 .

11

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

In this step, the data format can be changed. Generally, General will generate (ok, well stop) the correct format, unless there is a need for something special, like a date. Click Finish and be sure to save it as an Excel Workbook to have a permanent Excel copy of the data with an .xlsx file extension.

12

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Once you have opened the WidgeOne file in Excel, you should see this:

13

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Data can be read in directly from a website much the same way.

First, click on the Data tab in the ribbon.

Then click on the From Web icon.

A box will pop up. This is where you insert the web address of the data you want to import (we used http://quickfacts.census.gov/qfd/states/13000.html which is the 2010 Census data for Georgia). Once Excel navigates to the website, simply click the yellow arrows to select the tables you want imported, then click Import.

14

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Type the web address and click Go

Choose a table by clicking the yellow arrow

Click Import

After that, you simply specify which Excel spreadsheet you want the data stored in.

15

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Concept 1: Using Excel for Measurements of Central Tendency


The three measurements of central tendency can be executed in Excel using pre-programmed formulas and the fx button. Prior to executing the Mean, Median and Standard Deviation, lets insert an additional column on the left hand side. To do this, first place your cursor on the A in the first column and click, so that the entire column is highlighted. N ow, under the Home tab, in the Cells tools, select Insert:

16

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

At this point, the entire dataset should have shifted to the right, and the new column A is blank. Go ahead and name it Measurements by double clicking on cell A1 and typing Measurements. Now, go to the bottom of the dataset to cell A43. In cells A43, A44 and A45, type Mean, Median and Mode, respectively:

17

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Not all variables will lend themselves to these calculationsremember that we only execute mean and median calculations on quantitative variables. So, it would be helpful if we could see the column headers to remind us what is in each column. This can be done using a split screen. To do this, scroll back to the top row, click on the View tab and then within the Window tools, select Freeze Panes>Freeze Top Row:

18

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

For which columns should we report the measurements of central tendency? The quantitative values include JOBGRADE, SOCREL (social relations score2), YRONJOB (number of years on the job), PRDCTY (Productivity) and JOBSAT (job satisfaction). The calculation of the mode for the qualitative variables (PLANT, GENDER and POSITION) will be addressed below. Move your cursor to position F43. This is where we will place the mean for the JOBGRADE variable. With your cursor in this cell, click on the fx button. From the dialogue box, select Statistical. From the list of function names, click on the second entry AVERAGE. You will see this:

Psychology, Sociology and Marketing Majors will recognize that this is Likert Data. For the purposes of this manual, Likert Data will be treated as quantitative. However, it should be noted that pure mathematicians treat Likert Data as qualitative. 19 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Once you select Average, the next dialogue box will request an array of numbers. Excel is pretty clever. You may already have the array populated in the first field (Number 1). For the JOBGRADE variable, this will be cell F2 through cell F41. If it is not already populated for you, simply click on the little spreadsheet button and highlight the cells F2 through F41. Note that cell F42 is empty. If it is included, it will be ignored. However, if there was a 0 in cell F42, it would be includedand a different mean would be calculated. It is always best to only include the relevant cells in your calculations.

After you have selected cells F2 through F41 as the array for the mean calculation, click OK. You should now see 6.6.
20 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Now, lets copy this function across to column J. With your cursor in cell F43, go to The Home tab and select Copy from the Clipboard tools. Highlight cells G43 through J43. Then select Paste from the Clipboard tools. To populate the Median cells, we will use the same process. Place your cursor in cell F44 and click on the function button. From the Statistical functions, select MEDIAN and select the same array F2:F41. Click OK. Copy and paste the function in cell F44, across to cell J44. Although it is not typically used as the best measurement of central tendency of quantitative data, you can provide the mode for these variables using the same process. Both MODE.MULT and MODE.SNGL are available in the Statistical category. MODE.MULT returns the most frequent values, while MODE.SNGL returns the single most frequent value. In this instance, we are looking for the classical a single value mode, so the function MODE.SNGL is the function to use.

21

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Your screen should now look like this:

While these are all mathematically correct, which is the best measurement of central tendency? For this, we need to better understand the dispersion.

22

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Concept 2: Using Excel for Measurements of Dispersion


Recall from Basic Concepts Manual, the most common measurements used to describe the dispersion of a variable include the standard deviation and the frequency table. The standard deviation will be calculated in Excel using the function button. Returning to the WidgeOne.xls dataset, enter a label for the Standard Deviation below the measurements of central tendency. You probably noticed that the words Standard Deviation do not fit neatly into cell A47 they spilled over into B47 and C47. Remember that what you see in Excel is not necessarily what Excel sees. In reality, cells B47 and C47 are still empty from Excels perspective. But, this looks a little untidy. There are several ways to tidy this. We can expand column A until the words are visually contained within the column. This is accomplished by aligning the cursor between the A and the B at the top of the spreadsheet until the cursor looks like this and then double clicking. Column A will widen enough to accommodate the longest string of characters in the columnin this case Standard Deviation. A second method of accommodating the text is by wrapping the text into the cell. This is accomplished by going to the Home tab, and the Alignment tools, and selecting Wrap Text. After the text has been wrapped, you can then slightly widen the columns or narrow the rows (using the same process as for the columns), as needed. Once the label has been established, select the function button. Within the Statistical category, there are several standard deviation options to choose from. STDEV.P is the standard deviation of the population, STDEV.S is the standard deviation of a sample of the population, STDEVA includes logical values, such as true or false, in the calculation of the standard deviation for a sample, and STEDEVP includes logical values in the calculation of the standard deviation for the population. We dont have any logical values, and we the standard deviation for the all observations, so select the STDEV.P option and the same range as before F2:F41 and click OK. The default is 2 decimal places, so you should see 1.55. This is the standard deviation of the JOBGRADE variable. As before, copy this formula across to column J.
23 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

We now have the basic descriptive statistics for the quantitative variables. One day, you may have different number of decimal places for other values. As Statisticians, we like things to be tidy, so the entire data set would need to be adjusted. To format all of the data to have a consistent number of decimal points, click on the cell in the far upper left corner the cell to the LEFT of the A column and ABOVE the first row. This will highlight the entire spreadsheet. Then from the Home tab, select the comma button from the Number tools. This will make all of the numbers in the spreadsheet have two decimal points.

24

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Now, your very tidy spreadsheet should look like this:

Note that if you needed to add or subtract decimal points, you could easily do so by selecting the cells of interest and then clicking on the increase or decrease decimals as circled above.

25

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

In practice, if you need to provide multiple descriptive statistics on a variable, this is not the process that you would go through. For multiple descriptive statistics, you would go to the Data tab and from the Analysis Tools, select Data Analysis3. This path will bring up the following:

Select the Descriptive Statistics option.

You will then see the following dialogue box:

In the event that you do not see the Analysis Tools under the Data tab, click on the File tab in the upper left corner. Select Options at the bottom and then Add-Ins and then GO. Ensure that the Analysis Tool Pak is ticked and click on OK. It should be there now. Note that if you have an unauthorized copy of Excel, you probably wont have access to this very important functionality.
3

26

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Highlight the quantitative variable(s) of interest (F1:J41)

Identify that you have labels in the first row

Identify that you want to produce summary statistics

Now click OK.

27

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should now see this:

Again, pretty untidy. Format the spreadsheet to have two decimal points for all values and expand the columns to accommodate the labels. Your tidy version should look like this:

28

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Notice that we have reproduced all of the measurements from before as well as several more4. This is a more efficient way to produce the descriptive statistics of a variable(s).

For detailed information on the additional statistics produced, we recommend Statistical Methods and Data Analysis by Ott and Longnecker. Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

29

In the Basic Concepts Manual, we presented the concept of a frequency table as another method of displaying the spread of a dataset. As discussed, frequency tables are one of the most commonly used methods to display data understanding how to create a frequency table from a quantitative variable is a critical skill. The table created on in Basic Concepts Manual was created in Excel. We will reproduce it here. The first step to creating a frequency table from a quantitative variable is to determine the categories that need to be developed for the quantitative variable (this process will effectively transform a quantitative ratio-scale variable into a qualitative ordinal variable). Previously, we determined that the job tenure variable (YRONJOB) should be categorized into three levels less than 5 years, 5-10 years and more than 10 years. Recall that the categories must be mutually exclusive and collectively exhaustive. To accommodate these categories in Excel, we will create bins, where the TOP of each category identifies e ach bin.

30

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

In our WidgeOne.xls dataset, lets create a bin range for YRONJOB in column L:

These are the bins for the Histogram for Job Tenure. Category 1 is 0-4.99, Category 2 is 5-10.00 and Category 3 (which does not need to be entered) is everything above 10.00. Notice that these are the tops of the bins; they indicate the first number that is not part

of the bin.

31

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Once these bins have been created, select the Data Tab, and then from the Analysis Tools, select Data Analysis and then Histogram:

Click OK.

32

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

This will bring up a dialogue box, asking for information regarding the quantitative variable to be analyzed, and the associated Bin Range:

Highlight the range of the YRONJOB variable (including the label) Highlight the Bin Range (including the label)

Ensure that the Labels option is checked

33

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Now you should see this:

Againa little untidybut this is the base of what we need for the frequency table. Lets clean this up and add some columns to reproduce the table from Basic Concepts Manual. First, replace the bin titles with the real category labels of
34 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Less than 5 years, 5-10 years and More than 10 years. Second, expand the columns as needed. Third, total the bottom of the frequency column using the SUM option in cell B5, type =SUM(B2:B4) (the SUM function can be found in the Math & Trig category of functions). Next, create two addition column headers Relative Frequency and Cumulative Frequency. Your sheet should look like this:

35

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

The Relative Frequency column will display the percentage of observations in each column an important piece of informationparticularly when comparing populations of different sizes. This is done by simply taking each frequency and dividing it by the total. For example, in cell C2, we would type =B2/B5. This would result in .2750 (11/40). Rather than typing this same formula again and again to capture the relative frequencies of the next two categories, we would like to copy this formula into cells C3 and C4. Do this now. Did you get #DIV/0? The problem is that when the formula =B2/B5 is copied down one cell, it becomes =B3/B6. There is nothing in cell B6. Since any number divided by 0 is undefined, we receive this error message. If we want to copy the formula into the cells below, we need to nail down the reference to the Total cell and prevent the reference from changing. To do this, we place a $ in front of the B and another $ in front of the 5 $B$5 instead of B5. This can also be accomplished by placing the cursor in between the B and the 5 and hitting the F4 button on your computer. Once you have nailed down the Total cell as a reference cell, you can copy the formula into cells C3 and C4. The Cumulative Frequency column will display the cumulative percentage of observations from 0 to the top of the category in question. This is accomplished by adding the relative frequency of a category to all of the relative frequencies before it. In Excel, we would type =C2 in cell D2 the first entry in the Cumulative column will always equal the first entry in the Relative Frequency column. In cell D3, we would enter =D2+C3. This will add the cumulative value (D2) plus the Relative Frequency for the category (C3). Andwe can now copy this formula into cell D4. Clearly, this is a lot of manual work in Excel for a relatively small table. However, our focus is on helping to build the Excel skills necessary to execute this kind of analysis for any size table or dataset.

36

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should now have this:

37

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You probably can guess what is next lets make it a bit more tidy and presentable. First, lets convert the decimals to percentages since that is the way most people would expect to see the data. Highlight cells C2 through D5. Then go to the Home tab and click on the % sign in the Numbers tools. This should have converted all of the numbers to percentages with no decimals. If you would like to see decimals, you can increase the decimals be selecting the increase decimal button in the Numbers tools on the Home tab. Second, lets format the text to ensure that it is all the same (right now some text is italicized and may not be the same font). Highlight the entire table of data (cells A1:D5). From the Font tools, select a common font (we prefer Palatino Linotype ). Also, you can take off the italics by clicking on the I in the Font tools (you may have to click it twice). Finally, if you want to standardize the appearance of the gridlines, from the Font tools, select the Border Box. From the pull down menu, identify that you want no borders. Then, go back and identify that you want a Thick Box Border.

38

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

At this point, your table should look something like this:

39

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

For nascent users of Excel, we understand that this seems like a lot of work. To this mild protest, we have two points. First - most recipients of your analysis will ONLY see your tables and/or graphics (next section). So you need to spend as much time making your analysis look clean and professional as you do ensuring that it is mathematically and logically correct. Second as you will see in the subsequent Manuals, some of these executions, which appear awkward in Excel are quite easy in other software applications.

40

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Concept 3: Using Excel for Visualization/Organization of Univariate Data


In this section, we will provide the steps needed to create Histograms, Pie Charts and Bar Charts. The Stem and Leaf Plot and the Box Plot as outlined in Basic Concepts Manual, while important, are not easily executed in Excel. These visualization tools are however easily executed in the other software applications and will be addressed in subsequent Manuals. To reproduce the histogram from Basic Concepts Manual, we will follow most of the same process, which was used to create the frequency table in the previous section. Starting with the Plant_Survey sheet open, go to the Bins that were developed from the previous exercise. Lets create five categories instead of just three:

41

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Remember that when creating bins in Excel, we identify the TOP of each categoryand the highest category does not need to be identified.

42

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Once these bins have been created, select the Data Tab, and then from the Analysis Tools, select Data Analysis and then Histogram:

Click OK. Following the same process as was used to create the frequency table, identify the Input Range and the Bin Range, and ensure that the Labels box is checked. This time, also check the Cumulative Percentage and Chart Output options:

43

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Selecting this option will convert the frequency table into a histogram.

Now click OK.

44

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should see this:

You guessed ita little untidy. Lets format a few things on our histogram. First, as before, lets change the Bin names to what we really want: Less than 3, 3-6, 7-10, 11-14 and 15+. These changes can be made in the frequency table the histogram will be automatically updated because the graphic is dynamically linked to the table.
45 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Second, highlight the legend and delete it (it does not really communicate any meaningful information). Third, double click on the x-axis and format the font as needed (we prefer Palatino Linotype). Do the same for the other axis. Finally, double click on one of the bins and youll see this:

Change the Gap Width to No Gap, and then change the border color to a solid black line.

46

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should get something like this:

JobTenure of Widge One Employees


14 12 Frequency 10 8 6 4 2 0 Less than 3 3-6 Years 7-10 Years Years on Job 11-14 Years 15+ Years

Well Done!

47

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

To produce a pie chart, begin by bringing up the sheet which contains the frequency chart that you created in the previous section. Go to the Insert tab:

48

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Highlight cells A1:B4. Do not include the total. Then, select Pie Chart from the Chart tools:

49

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Select the first option the basic 2D chart. You should now see this:

Now, the primary issue with this pie chart, is that we have no information regarding the percentages that comprise each slice which is the whole reason to use a pie chart. To insert percentage values, go to the Layout tab, select Data Labels, and then go to the bottom of the drop down and select More Label Options. You should see this:
50 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Deselect value and select percentage.

After you have identified that you want a percentage value, click on Close. You can click on the Frequency title and change it to something more meaningful like Years on Job. You should now see this:

51

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Pie Chart of Jobtenure N=40 Employees

27% 38% Less than 5 years 5-10 years more than 10 years

35%

Well done! Keep in mind that some recipients of your data may be colorblind. Although Excel is typically does not place colors such as green, red and brown together, should you need to override the default colors provided in Excel (or include patterns to accommodate printing in black and white), simply go to the Design tab and make an alternative selection. Bar charts are created using a very similar process. We will create a bar chart of the same information (bar charts are not histograms on their sides).

52

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

To reproduce the bar chart in Basic Concepts Manual, begin by bringing up the sheet, which contains the frequency chart that you utilized to create the pie chart. Go to the Insert tab. Highlight the same data as before A1:B4 (be sure not to include the totals). From the Charts tools, select Bar and then select the first option. You should see the following chart:

Where Pie Charts are used to explain relative proportions (percentages) Bar Charts are used to communicate counts. So the units in this chart are fine. You may want to double click on the Frequency title and give it a more meaningful name. Also, you may want to delete the frequency legend, since it does not communicate any meaningful information.

53

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Concept 4: Using Excel for Visualization/Organization of Multivariate Data


In this section, we will provide the steps necessary to create Contingency Tables, Stacked Bar Charts, 100% Stacked Bar Charts and Scatterplots in Excel 2010. Contingency Tables are one of the most common and useful methods of communicating the relationships between and among variables in a dataset. In Excel 2010, the pivot table tool used to create a contingency table is particularly useful and very flexible (this is one of the few examples where Excel may rival or outperform the more sophisticated applications). To create a contingency table using a pivot table, return to the Plant_Survey page of the WidgeOne.xls dataset. From the Insert tab, select Pivot Table from the Tables tools. You should see this:

54

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You can leave everything as its default, but for the Table/Range box, click on the little spreadsheet button as circled and highlight the entire dataset. It is particularly important to make sure that you include the titles from the first row. Select OK. At this point you should see something that looks like this:

In the event that your sheet does not look like this, do not let your heart be troubled. We can fix this. If your pivot table template DOES NOT look like this, place your cursor inside the table and right click. Go to Pivot Table options. Go to the Display tab and tick the box for Classic Pivot Table Layout and click OK. You should now have the screen as shown above.
55 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You can think of this as an empty template with rows and columnswaiting to be filled with data. Lets begin by placing the Plant variable in the column and the Gender variable in the row:
You can click and drag the variables into the right position either in the listing or in the table itself.

Now, if we are simply trying to ascertain counts for a basic contingency table, should we place the Plant or the Gender variable in the center of the table (where it says drop value fields here)?

56

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

The answer isit does not matterthe counts will be the same. We placed the Gender variable in the center and generated the following table:

Try placing the Plant variable in that positionyou should generate the same numbers. Cool. From this table it is easy to see that the Mode of the Plant variable is Dallas and there is no mode for the Gender variable we have the same number of Males and Females. As we did before, lets look at this data in a few different ways. First, change the data to be a percentage of row. This can be done by clicking on the Count of Gender entry as circled above. Select Value Field Settings. You should see the following:

57

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Click on the Show Values As tab.

58

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Select Show values as % of row as indicated below:

Select OK.

59

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should now see this:

This now tells us that Of all of the females, 65% work in Dallas. You could change this display to be the percent of columns or the percent of totalsthey all communicate subtly different messages.

60

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

If we want to incorporate an additional piece of information like the average job tenure by plant and by gender, we could do this by substituting the YRONJOB variable in the Data position. Do this by dragging the Gender (or Plant) variable from the values box and placing the YRONJOB variable in the same place:

The problem with our table at this point is that we really wanted the average Years on Job, but this is the summation of the total years on job for each intersection (summation is the defalut statistic for quantative values). To change the summation to the mean, click on the Sum of YRONJOB in the Values box, and select Value Field Settings. From the Summarize By tab, change the default from Sum to Average and click OK.

61

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should now have the following screen:

As before, this is a little untidy. You can format the cells to have consistently two decimal places.

62

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Much better! This table now provides information such as In the Dallas plant, women have an average of 8.85 years on the job. Now you can copy and paste this table into other documents or into another Excel sheet. As you can see, Pivot Tables are very useful and very flexible. However, because they are so flexible, they do require a bit of manipulation. Mastering Pivot Tables in Excel is a great differentiating skill, but will require practice (and patience). Stacked bar charts are easy to create and manipulate. To reproduce the stacked bar chart from Basic Concepts Manual, we will use the first Pivot Table created above that indicated the frequency counts by gender and by plant. Go to the Pivot Table, convert the data back to counts and copy these cells and paste them into another part of the spreadsheet:

Note: DO NOT COPY THE PORTION OF THE PIVOT TABLE WITH THE DROP DOWN ARROWS.

This will disengage the data from the Pivot Table. You will see why this is helpful soon.

63

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Now, highlight all of the data EXCEPT for the totals. With the data highlighted, go to the Insert tab and the Bar option in the Charts tools. Select the second option. You should see this:

F M D

10

15

20

25

This chart is finebut a little untidy. Because the chart is dynamically linked to the table, you can update the N to read Norcross and D to read Dallas and the same for the genders. We should also apply a title. Go to the Layout tab and the Labels tools and select Chart Title. Add the title.

64

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should now have something like this:

Men and Women at the Two Plants


Norcross Female Male Dallas

10

15

20

25

Well Done! Whenever you have different population sizes, as is the case with the Dallas and Norcross Plants, it is helpful sometimes to scale both populations to 100% to more easily compare the two. This is the purpose of a 100% Stacked Bar Chart. To execute this chart, you start the same way highlight the data, go to the Insert tab, click on the Bar option but this time select the third option (all the bars are the same length).

65

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should see something like this:

Norcross

Female Male Dallas

0%

20%

40%

60%

80%

100%

You can think of this visualization as side by side pie charts. This graphic communicates the proportion of males and females within each plant. It is easy to see from this graphic that there are proportionately fewer women in Norcross than in Dallas. Remember that stacked bar charts and 100% stacked bar charts are both generated from qualitative data. If the variables are qualitative, a scatter plot would be more appropriate.

66

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

The final visualization in this section is the scatter plot. Scatter plots are typically used to determine if there is a meaningful relationship between two quantitative variables. They can also be a great tool to look at the data to spot any anomalies, such as outliers. To reproduce the scatter plot in the Basic Concepts Manual, lets return to the Plant_Survey sheet. To see if there is a relationship between Job Productivity and Job Tenure, we will plot these two variables in a scatter plot. It is important to note that in a scatter plot, we are NOT trying to establish any causation, only correlation. First, it is helpful to have the two variables next to each otherwhich they are not. To move the PRDCTY variable next to the YRONJOB variable, click on the G at the top of the column where the PRDCTY variable is located. This will select the entire column. Now, click on the Ctrl button and the X button. There should be chasing lights around the variable column. Click on the J at the top of the cell to the right of YRONJOB variable, so that column is highlighted. Now click on Ctrl/Shift/+ at the same time. Cool. The variables should now be next to each other (this is actually a Lagniappe). Once the variables of interest are side-by-side, highlight both. Go to the Insert tab and select the Scatterplot option from the Chart tools. Delete the legend. You should see something like this (can you spot the outlier?):

67

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

PRDCTY
120.00 100.00 80.00 60.00 40.00 20.00 5.00 10.00 15.00 20.00

Againpretty untidy. We will do three things to clean up the appearance of this graph: rescale the y-axis, add titles and take away the decimals.

68

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

If the y-axis needs to be rescaled, select the Layout tab and select Axes from the Axes tools. Then select Primary Vertical Axis>Primary Vertical Axis More Options. You should see this:

Here is where you can change the minimum and maximum

Now, recall that graphics are typically dynamically linked to data in Excel. So, if we change the data, we change the graphic. In the Plant_Survey sheet, highlight both variables and decrease the decimals. This will change the appearance of the scatterplot as well. Finally, to add titles to the axes, select the Layout tab. Choose the Axis Title option from the

69

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Labels tools. Begin by assigning a Primary Vertical Axis title, and then a Primary Horizontal Axis title. Name them appropriately. Then, rename your chart title. Your scatter plot should look something like this:

Productivity versus Years on Job


120.00
100.00 80.00 Productivity 60.00

40.00
20.00 5.00 10.00 Years on Job 15.00 20.00

We can derive additional information from this graphic by adding a trendline to the data. To add a trendline, select the Layout Tab, click on Trendline, and then click on More Trendline Options.

70

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should see this:

Identify that the trend is linear.

Identify that you want the Equation and the R-squared values on the chart.

71

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Select Close.

You should see this:

Productivity versus Years on Job


120.00 100.00 80.00 Productivity 60.00 40.00 20.00 y = -0.1395x + 84.135 R = 0.0018

5.00

10.00
Years on Job

15.00

20.00

Again, that nasty little outlier is messing things up.

72

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

For comparisons sake, if it were not there, the scatter chart would look like this:

Productivity versus Years on Job


100 95 90 85 80 75 70 65 60 y = -0.571x + 89.31 R = 0.112

Productivity

10
Years on Job

15

20

See how the outlier is influencing the trend line, as well as the R-squared value? It went from .0018 to .1124. As explained in Basic Concepts Manual, this information now provides us with the best linear equation, which fits the relationship between Productivity (y) and Job Tenure (x). The R-squared value of .1124 indicates that this is not a particularly strong relationship Job Tenure only explains 11.24% of the change in Productivity. These concepts form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling and Outlier Treatment, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.

73

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Concept 5: Using Excel for Random Number Generation and Simple Random Sampling
Our WidgeOne.xls dataset is fairly small only 40 observations. As a result, it would be unusual that we would want to extract a sample from such a small dataset. However, for the purposes of executing the application of random number generation in Excel, lets assume that we want to randomly select ten individuals with whom we want to conduct in depth interviews. Lets begin by assigning random numbers to each individual. Go back to the Plant_Survey sheet and create a new column label RANDOM. Place your cursor in the first cell under the column label (row 2). Click on the formula button. Ensure that ALL is selected as the Function Category. Scroll down through the Function Names until you see RAND. Select RAND and click OK. This will generate the following:

There are three pieces of information you need to understand from this box: 1. The function takes no arguments which means that we do not need to provide any information; 2. The function will return an evenly distributed (uniform distribution) random number between 0 and 1; 3. The function is volatile which means that the value returned will change EVERY time the spreadsheet is manipulated.
74 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Click OK. You should see some number between 0 and 1 in this cell (your result will be different each time since the random number is generated using your computers internal clock). Remember that Excel reads this cell as =RAND not as the number that you see. Now copy the formula in this cell down to the bottom of the dataset. Did you notice that your original number in row 2 changed? This is because it is volatile. Sometimes we need to have volatile arguments in (not with) Excel. Most of the time we do not. To convert the numbers you see from volatile to stable (unchanging), highlight the entire column, from the Home Tab, select Copy and then Paste>Paste Values.

75

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Now, you should have a column of unchanging random numbers (your numbers will be different from ours):

Now, sort the entire dataset on the random numbers just created. Highlight the entire dataset (be sure to check my data has headers or else those will be sorted as well). Then go to the Data tab. From the Sort and Filter tools, select Sort.
76 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

When the dialogue box appears, click on the down arrow in the Sort by option. You will get a drop down of all of the variables. Select Random. It does not matter if you sort smallest to largest or largest to smallest it is random.

Click OK. Then, select the first 10 individuals for the interviews. This is a fairly simple, but very useful process.

77

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Concept 6: Using Excel for Confidence Intervals


The penultimate section in this chapter will aid in the calculation of Confidence Intervals one of the most commonly used techniques in Inferential Statistics. From Basic Concepts Manual lets assume that the WidgeOne.xls dataset is a representative sample of a larger manufacturing firm with hundreds of employees in Norcross, GA and Dallas, TX (if we have access to the entire organizations data, we would not calculate confidence intervals of any population parameter we would report the descriptive statistics). Lets also assume that the HR department at WidgeOne has been charged with understanding the level of job satisfaction among employees. For cost reasons, they were unable to survey the entire organization, so they surveyed the 40 employees in our dataset. Report the job satisfaction for all WidgeOne employees, using the sample of 40. Use a 95% level of confidence. Go back to the Plant_Survey sheet. We previously calculated the mean job satisfaction to be 6.85 and the standard deviation to be 1.02. Using this information, we can use Excel to compute the confidence interval. To execute this computation, go into blank portion of the spreadsheet and click on the function button. Ensure that the Statistical function category is selected and then scroll through the function names until you get to the CONFIDENCE.NORM (uses normal distribution) function. Click OK.

78

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

You should see the following:

Alpha is 1-(the Confidence Level)


The STD would have been previously calculated Size is the sample sizein this case 40

If we are computing a 95% confidence interval, we would enter .05 for the alpha value (you can think of alpha as the probability you are willing to accept of being wrong). The standard deviation, which was computed previously for job satisfaction, was 1.025. The (sample) size is 40.

An important note in spreadsheet development: you could enter 1.02 in this box or enter the cell reference J47. You would generate the same

answer. However, you are almost ALWAYS better off entering the cell reference rather than hard coding a number. This make s the formula more portable.

79

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Once this information is entered, the resulting computation should be .32. This is the margin of error for job satisfaction at a confidence level of 95%. You would then add and subtract this to/from the mean (6.85) to create the full interval. The full interval would then be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53.

Excel 2010 Lagniappe

What is a Lagniappe? This word derives from New World Spanish la apa, the gift. The word came into the Creole dialect of New Orleans and there acquired a French spelling. It is still used in the Gulf States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to a purchase. Our lagniappe for our readers includes the extra and interesting things that we have learned to do with these software packages that might not be easily found or well known. A little extra information at no extra cost! One concept that is valuable to understand in Excel is the If statement and the Nested If statement. Lets discuss If statements in the context of converting a quantitative variable into a qualitative variable. Lets say that we want to categorize employees into high productivity (defined as a productivity score greater than 90) and low productivity (defined as a productivity score less than 90). One way to do this is through the application of an If statement.

80

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Find a new, clean column on the right of the dataset. Title the column Productivity Category. You should see this:

81

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

The quantitative productivity score is in column G. In cell K2, enter the formula =IF(G2<90,"LOW","HIGH"). Copy this formula to the bottom of the dataset. You should see the following:

82

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Cool. Now, lets assume that we want to add a third categorization medium. Lets define medium productivity as a productivity score between 80 and 90, and a low productivity score as a score less than 80 (high is still above 90). Replace the formula in cell K2 with this =IF(G2<80,"LOW",IF(G2<90,"MEDIUM","HIGH")). You should see the following:

83

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

Congratulations on your new mastery of Excel. We hope you enjoyed your descent into Excel Geekdom. You should be proud.

84

Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University