You are on page 1of 65

Excel: Statistical Features

Windows ©2009

Excel: Statistical Features v6.1.1 60 Pages


About IT Training & Education
The University Information Technology Services (UITS) IT Training & Education program at Indiana University
offers instructor-led computing workshops and self-study training resources to the Indiana University community and
beyond. We deliver training to more than 30,000 participants annually across all Indiana University campuses. Our
staff is comprised of enthusiastic professionals who enjoy developing and teaching computing workshops. We
appreciate your feedback and use it to improve our workshops and expand our offerings. We have received several
international awards for our materials and they are being used at universities across the country. Please keep your
questions, comments and suggestions coming!

In Bloomington, contact us at ittraining@indiana.edu or call us at (812) 855-7383.


In Indianapolis, contact us at ittraining@iupui.edu or call us at (317) 274-7383.

For the most up-to-date information about workshops and schedules, visit us at:

http://ittraining.iu.edu/

Copyright 2008 - The Trustees of Indiana University

These materials are for personal use only and may not be copied or distributed. If you would like to use our materials
for self-study or to teach others, please contact us at: IT Training & Education, 2711 East 10th Street, Bloomington,
IN 47408-2671, phone: (812) 855-7383. All rights reserved.

The names of software products referred to in these materials are claimed as trademarks of their respective companies
or trademark holders.
Contents
Welcome and Introduction . . . . . . . . . . . . . . . . . . 1 Generating the Histogram and Frequencies . . . . . 20
What You Should Already Know . . . . . . . . . . . . 1 Modifying a Histogram. . . . . . . . . . . . . . . . 23
What You Will Learn . . . . . . . . . . . . . . . . . . 1 Generating a Second Histogram . . . . . . . . . . . 27
What You Will Need to Use These Materials . . . . . . 1 Exploring Multiple Variables . . . . . . . . . . . . . . . . 32
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . 2 Creating Pivot Tables . . . . . . . . . . . . . . . . . . 33
Today’s Project . . . . . . . . . . . . . . . . . . . . . . . . 3 Performing a t-Test . . . . . . . . . . . . . . . . . . . . 39
Opening an Existing Worksheet . . . . . . . . . . . . . . . 3 Sorting the Data . . . . . . . . . . . . . . . . . . . 40
What the Data Represents . . . . . . . . . . . . . . . . 4 Running an F Test for Unequal Variances . . . . . . 41
Interpreting the Results . . . . . . . . . . . . . . . . 43
Calculating CANX and MANX . . . . . . . . . . . . . 5
Running the T-Test . . . . . . . . . . . . . . . . . . 43
Correcting the Error Settings . . . . . . . . . . . . . 7
Interpreting the Results . . . . . . . . . . . . . . . . 45
Using an IF Function . . . . . . . . . . . . . . . . . 8
Resorting the Data . . . . . . . . . . . . . . . . . . 46
Pasting the Formulas Down the Column . . . . . . . 10
Finding Correlations . . . . . . . . . . . . . . . . . . . 47
Using Descriptive Statistics Functions . . . . . . . . . . . . 11
Running the Correlation Procedure . . . . . . . . . 47
Using the Insert Function Command . . . . . . . . . . . 11
Interpreting the Results . . . . . . . . . . . . . . . . 49
Finding the Average . . . . . . . . . . . . . . . . . 11
Simple Linear Regression . . . . . . . . . . . . . . . . 49
Using Functions Without the Insert Function Command 13
Interpreting the Results . . . . . . . . . . . . . . . . 51
Finding a Standard Deviation . . . . . . . . . . . . 13
Creating a Scatterplot Chart . . . . . . . . . . . . . 52
Using the Data Analysis Tool . . . . . . . . . . . . . . . . 14
Further Interpreting the Results . . . . . . . . . . . 59
Obtaining Descriptive Statistics . . . . . . . . . . . . . 15 Exiting Excel . . . . . . . . . . . . . . . . . . . . . . . . . 59
Loading the Analysis ToolPak . . . . . . . . . . . . 15 Wrapping Up . . . . . . . . . . . . . . . . . . . . . . . . . 59
Using Descriptive Statistics . . . . . . . . . . . . . 16
Contributions to These Materials . . . . . . . . . . . . . . 60
Creating Histograms and Frequency Distributions. . 19
Welcome and Introduction
Welcome to Excel: Statistical Features.

What You Should Already Know


You should have already attended Excel: The Basics or have the equivalent
skills. Specifically, you should be able to:

• understand what a spreadsheet is


• insert data into Excel
• work with formulas and functions
• be able to switch from one worksheet to another
• have a basic understanding of statistical concepts

What You Will Learn


This workshop introduces Excel users to a variety of available statistical pro-
cedures using the Insert Function command and add-ins, and provides hands-
on practice how to:

• find the mean for a range of data


• find the standard deviation of a variable
• generate a pivot table
• find the correlation among several variables
• perform simple regressions and t-tests
• interpret the results of these procedures

What You Will Need to Use These Materials


To complete this workshop successfully, you will be provided with:

• the use of Microsoft Excel 2007


• the exercise file Stats.xlsx

Excel: Statistical Features v6.1.1 1


Getting Started
These materials presume you will begin work from the desktop, and have any
required exercise files located in an epclass folder there. For instructions on
obtaining the exercise files, see below.

If you need assistance logging on or starting an application, please consult


your instructor.

Finding Help
If you have computer-related questions not answered in these materials, you
can look for the answers in the UITS Knowledge Base, located at:

http://kb.iu.edu/

Self-Study Training
Want to learn more on your own?

IT Training Online makes self-study computer-based courses available on a


wide range of IT topics. You may also purchase STEPS workshop materials
to use in learning on your own. To find out more, go to:

http://ittraining.iu.edu/online/

Getting the Exercise Files


Most of our workshops use exercise files, listed at the bottom of page 1 of the
materials. In our computer-equipped classrooms, these files are located in the
epclass folder, which should already be on the computer desktop. If you are
using our materials in a different location, you may obtain the exercise files
from our Web site at:

http://ittraining.iu.edu/workshops/files/

Once you are logged on and have the needed files in an epclass folder on your
desktop, you are ready to proceed with the rest of the workshop.

2 Excel: Statistical Features v6.1.1


Today’s Project
There may be times when you have a set of data and wish to compute some
rather basic statistics but do not have the time to learn a statistical package or a
statistical package may not be readily available. Excel can accomplish many
of the basic and some of the more advanced statistical procedures.

This course is designed to introduce the user to some of the basic statistical
procedures available in Excel 2007 as well as provide instruction on how to
use them. If you are already very familiar with Excel, SPSS, SAS or other sta-
tistical packages, this course may be a bit basic.

NOTE: Throughout these materials, many of the listed results have been
rounded to two decimal places, while Excel may display more in the
actual spreadsheet.

Let’s start by opening the data we’ll be working with today and then see what
the data represents.

Opening an Existing Worksheet


Since all of us are familiar with the basics of Excel, we’ll open the file
Stats.xlsx, which will be used to demonstrate some of the statistical proce-
dures available in Excel.

1. Launch Excel using the Start menu or a desktop shortcut.

Excel loads and you see the opening screen.

2. To open an existing file,

™ , ™ Open

You see the Open dialog box.

We need to specify the name and location of the file to open.

Setting the Location for Opening Your File


When the dialog box opens, it lists a default location from where the file will
be opened. All of our exercise files are contained in the epclass folder, located
on the desktop. We’ll want to change our location to this folder.

Excel: Statistical Features v6.1.1 3


We will start at the desktop, since our exercise file folder, epclass, is located
there.

1. To move to the desktop,

The current location is now set to the desktop. All of our exercise files are
contained in the epclass folder, located on the desktop.

2. To open the epclass folder,

The epclass folder contents are now visible.

3. To select the file Stats.xlsx,

£ Stats.xlsx

The workbook opens.

What the Data Represents


The data you see are from a study that investigated computer anxiety in
middle-school children. The data were collected from forty ninth-graders from
three different school systems. The first row of the data set contains the labels,
or names, of the variables.

The information collected on each student includes:

• ID number (ID)
• Gender of the student (SEX)
• Amount of previous computer experience (EXP)
• School system (SCHOOL)
• Responses from two ten-item questionnaires that show computer and math
anxiety, respectively (C1-C10, M1-M10)
• Test scores from a computer course and a mathematics course for a given test-
ing period (COMPSCOR, MATHSCOR)
• Two variables that are measures of computer and math anxiety, based on the
questionnaire responses (CANX, MANX)

4 Excel: Statistical Features v6.1.1


Each row of data is a single subject; that is, each row of data tells us about an
individual student. Each column is a single variable, with the name for each
variable appearing as the column label.

Excel is often used for data entry or manipulating data files for use in other
statistical applications, such as SPSS or SAS. Knowledge of particular limita-
tions in the other statistical applications can help avoid some problems with
data importing, and can determine some aspects of the Excel file. In SPSS, for
instance, we are required to give each variable a name, and there are other
restrictions about characters that can be used in names and the length of
names. To avoid conflicts in SPSS, the variable names used should be eight
characters or less and start with a letter.

The first column of data, ID, is the student ID number, a number that the
researcher has assigned to each student. The second column, SEX, is the stu-
dent’s gender. The third is the variable EXP, which tells us roughly the
number of years of previous computer experience. The numbers correspond to
the following time periods:

• 1: one year or less


• 2: about two years
• 3: three years or more
The next column, SCHOOL, contains numbers that identify which of three
school systems the student attends. The numbers correspond to the following
school systems:

• 1: rural school system


• 2: suburban school system
• 3: city school system
The following twenty columns, C1 to C10, and M1 to M10, are the responses
to questionnaires designed to test computer and math anxiety, with a 1 indicat-
ing highest anxiety, and 5 indicating lowest anxiety. The next two columns,
MATHSCOR and COMPSCOR, are scores from tests given in a math course
and a computer course, respectively, for a given time period.

Calculating CANX and MANX


The two final columns for data, MANX and CANX are not yet calculated.
Both will contain summary data from the anxiety questionnaires given to the
students, whose responses are recorded in columns E through X. The ques-
tionnaires were designed so that responses close to 1 indicate higher anxiety
about a given subject, and responses close to 5 indicate lower anxiety. By
summing all of the responses for, say, computer anxiety, we get a number that
gives us a measure of total computer anxiety.

Excel: Statistical Features v6.1.1 5


To illustrate different methods of data summary and description, we will be
recording the values in columns AA and AB slightly differently. CANX (in
column AA) will simply be the sum of the responses to the computer anxiety
questions. So a larger value of CANX for a particular student means a greater
level of comfort with computers.

MANX (in column AB) will be displayed as a dichotomous variable. High


responses, greater than or equal to 30, will be set to 0. Low responses, less
than 30, will be set to 1. We will use an IF formula to determine this variable.

Let’s first calculate the value of CANX for this data set.

1. To set the location for this function,

™ cell AA2

The results of the survey questions relating to computer anxiety are in col-
umns E through N.

2. To begin the sum function, in the Editing group at the right of the Ribbon,

Excel inserts the function, and attempts to guess which cells you would like
to sum. However, Excel’s guess is incorrect.

3. To select the correct cells,

¢ from cell N2 to cell E2

The formula in the formula bar should read: =SUM(E2:N2).

4. To complete the formula, press:

The result, 23, appears in cell AA2.

6 Excel: Statistical Features v6.1.1


Correcting the Error Settings
The result of our addition is located in a cell removed from the data it is sum-
ming. According to Excel’s default error-checking settings, this is a cause for
alarm. Therefore, Excel alerts a user by placing an error message marker in the
upper-left-hand corner of the cell:

However, we know that this formula is correct. For the purposes of this work-
sheet, we can inform Excel to ignore this error.

1. To select the appropriate cell,

™ cell AA2

When the cell is selected, the Error Checking button pops up to the left of
the cell. All of the error-checking settings may be accessed via the button’s
drop-down menu.

2. To view the error options,

™ , ™ Error Checking Options...

The Excel Options dialog box opens. The “Error checking rules” are locat-
ed at the bottom of the dialog box.

3. To turn off the appropriate error checking setting, in the “Error checking
rules section,”

™ the “Formulas which omit cells in a region” checkbox

The checkbox is now empty. This setting will be effective for all of the cells
in this spreadsheet.

NOTE: To access the Excel Options without using the Error Checking button,

™ , ™ Excel Options, and then ™ Formulas.

Excel: Statistical Features v6.1.1 7


4. To close the dialog box,

The green error marker disappears from the cell.

This CANX formula could now be copied to the rest of the cells in column
AA. However, it will be faster to enter the function for the MANX variable
into cell AB2, then copy both the CANX formula and the MANX function
to all relevant cells at the same time.

Let’s calculate the value of MANX for this data set.

Using an IF Function
MANX also uses a SUM function, but we will go an additional step further by
placing this sum function within an IF function to create a dichotomous vari-
able.

1. To place the cursor in the appropriate place,

™ cell AB2

2. To begin the process of inserting an IF function, in the formula bar,

The Insert Function dialog box appears. We want one of the more common-
ly used functions, IF.

NOTE: If IF is not apparent in the initial function list, select “Logical” from
the category drop-down list; the IF function will appear as a member
of this sub-group.

3. To select the IF function, in the “Select a function” list,

™ IF, ™

The Function Arguments dialog box opens.

The IF function requires a user to first define the logical test. In this case,
the test will be whether or not the sum of a given set of cells (O2:X2) is
greater than or equal to 30. The next two elements of the function ask what

8 Excel: Statistical Features v6.1.1


data to return based on whether the logical test returns a true or a false re-
sponse. In this case, we want test scores greater or equal to 30 to be given
the variable 0, and scores less than 30 the variable 1.

4. To indicate the logical test, in the first field type:

SUM(o2:x2)>=30 ø

5. To indicate the true return value, type:

6. To indicate the false return value, type:

The completed formula should look like this:

7. To complete the function,

The value 1 appears in the cell AB2, indicating that this student’s MANX
sum was less than 30.

Let’s copy the CANX formula and the MANX function into the remaining
cells in column AA and AB.

Excel: Statistical Features v6.1.1 9


Pasting the Formulas Down the Column
Now that we have the correct formula and function, we need to paste them
down the column for every student. We could select the cells holding the data
to be copied, and use the AutoFill handle; however, this option can be difficult
for large sets of data. Instead, we will use a keyboard shortcut to select the
appropriate cells, and then fill the formula into those cells via the Ribbon.

To begin this process, the cell containing the CANX formula must be the
active cell.

1. To indicate the appropriate active cell,

™ in cell AA2

2. To select the appropriate cells, press:

ƒºï

The range AA2: AB41 is selected.

3. To fill the formula into the selected cells, on the Ribbon, in the Editing
group,

™ ,™ Down

4. Deselect the cells.

The cells in column AA now contain the sums of the students’ responses to
questions concerning the computer anxiety. The cells in column AB now
contain the correct value indicating either high (1) or low (0) math anxiety.
We will now save the our work.

5. To save the workbook, on the Quick Access Toolbar,

We can now move on to generating some descriptive statistics for our data.

10 Excel: Statistical Features v6.1.1


Using Descriptive Statistics Functions
Descriptive statistics organize, summarize, and describe the data. Prior to per-
forming any of the more complex statistical procedures, it is necessary to have
a good idea of the characteristics of the data. For example, it would be impor-
tant to know the central tendency and distribution of each variable. These
characteristics are measured by statistics such as the mean, median, and stan-
dard deviation. It is important to have this information because the character-
istics of the data play a large part in determining which subsequent statistical
procedures are appropriate.

Using the Insert Function Command


The Insert Function command is an Excel feature that guides the user
through various functions. These functions are organized into categories that
include statistical, engineering, mathematical, and financial functions.

Once you become familiar with functions and how they are generated, you
will learn how to enter functions directly. We’ll use the Insert Function feature
in the first exercise.

Finding the Average


Suppose an individual wants to find the average of math test scores (i.e., the
average of the MATHSCOR). Before determining the average, he or she must
tell Excel where to place the output. When generating output, Excel will over-
write any data that is in the active cell.

We’ll generate the average and place it in an empty cell at the bottom of the
MATHSCOR column.

1. To select the location to place the results of the following exercise,

™ cell Y42

NOTE: When performing a statistical function based on a column of data, it is


best to paste your function in the same column, below the data.

Excel: Statistical Features v6.1.1 11


2. To open the Insert Function dialog box, on the formula bar,

You see the Insert Function dialog box:

3. To select the statistical function we want, in the “Select a function” field,

™ Average

4. To confirm this choice,

You see the Function Arguments dialog box:

12 Excel: Statistical Features v6.1.1


The Function Arguments dialog box displays with the computed range
Y2:Y41 highlighted in the Number1 field. We chose to place our average
function in cell Y42. Excel takes the average of the cells above Y42 but ig-
nores any labels in the first row of the column.

5. To confirm that this is the correct range,

The result, 48.725, is displayed in cell Y42.

Using Functions Without the Insert Function


Command
Using the Insert Function feature is only one way of generating statistics with
Excel. Another method involves simply typing in the proper commands for
the desired statistical measure.

Let’s see how to find a univariate statistic by typing in the needed function.

Finding a Standard Deviation


The standard deviation is a measure of the spread of the values of a variable
about its mean. The mean provides information about one characteristic of a
variable. Using only the mean to describe a variable could be misleading.

For example, suppose there are two towns each with ten residents and each
with a mean annual income of $10,000. With no further information, you can
say nothing about how the income is distributed. Further investigation reveals
that in the first town, each resident had an annual income of $10,000 while in
the second town nine residents had no income while one had an income of
$100,000. While both towns have a mean annual income of $10,000, the dis-
tribution of the income within each is vastly different.

In the first town each resident has the same annual income so there is no devi-
ation around the mean of 10,000. The standard deviation is 0. In the second
town the incomes are not so evenly distributed. Nine residents have 0 income
and one has an income of $100,000. For the second town the standard devia-
tion is 31,622.78.

Using statistical measures like the standard deviation would uncover such
uneven distribution and allow for a more descriptive and accurate analysis to
be made. We will use the same variable, MATHSCOR, for this exercise.

Excel: Statistical Features v6.1.1 13


1. To select the result cell for the standard deviation,

™ cell Y43

2. To begin the command to compute the standard deviation, type:

=STDEV(

Now we must tell Excel the range of cells on which to calculate the standard
deviation. We could simply type in the correct range of cells, Y2:Y41, but
it is often easier to select the range with the mouse.

3. To select the range of cells,

¢ cell Y41 to cell Y2

4. To compute the standard deviation, press:

The standard deviation of MATHSCOR, 16.08, is now displayed in cell


Y43. What this means is that about two-thirds of the students have scores
that fall within 16 points above or below the mean.

NOTE: The STDEV function is appropriate for this case because our data are
a sample from a larger population. Whenever the data are the entire
population, one would use the command STDEVP. The STDEV func-
tion divides the sum of squared deviations by n-1, whereas the
STDEVP function divides the sum of squared deviations by n.

Using the Data Analysis Tool


In addition to using Insert Function feature to perform statistical procedures,
Excel also provides a pre-packaged set of statistical procedures. Data Analysis
tools allow users to select a statistical procedure and get results without having
to type in a long, cumbersome formula. The results generated by the Data
Analysis tools are much more inclusive than those obtained with the Insert
Function dialog box. Whereas the Insert Function dialog box provides you
with one statistical measure at a time, the Data Analysis tools can provide
many more measures.

14 Excel: Statistical Features v6.1.1


Obtaining Descriptive Statistics
Having more detailed information about our variables of interest is often nec-
essary prior to doing any more complex or in-depth analysis. Additional infor-
mation gives us a better sense of our data as well as allowing us to choose the
proper statistical procedure for the type of analysis we have in mind.

At this point, we want to generate some basic summary measures of center,


dispersion, skewness and kurtosis using the Descriptive Statistics option in the
Analysis ToolPak.

Loading the Analysis ToolPak


The Excel Analysis ToolPak allows users to generate a wide variety of statisti-
cal measures and tests on data. By default, it is not loaded when Excel 2007 is
first installed or opened, so our first task is to load the ToolPak.

1. To begin loading the Analysis ToolPak,

™ ,™

The Excel Options dialog box opens, listing the most commonly accessed
options. Since the Analysis ToolPak is an add-in, we must access the list of
Add-In options.

2. To access the Excel Add-Ins, in the Excel Options dialog box,

™ Add-Ins

The main viewing pane now lists the available Excel add-ins.

3. To manage the Excel Add-Ins, at the bottom of the dialog box,

The Add-Ins dialog box opens. We want to activate the Analysis ToolPak.

4. To load the Analysis ToolPak,

™ the Analysis ToolPak checkbox, ™

The Add-Ins dialog box closes and we are returned to our worksheet.

Excel: Statistical Features v6.1.1 15


The Analysis Toolpak will now be visible on the Data tab. We will access
it in order to generate some basic descriptive statistics.

Using Descriptive Statistics


We will begin with a basic exploration of some of the characteristics of our
data using the Data Analysis tools and the descriptive statistics option. The
descriptive statistics procedure generates a group of basic statistical measures
simultaneously, saving us the time it would take to generate these measures
one at a time.

The Data Analysis tool is located on the Data command tab.

1. To switch to the Data command tab, on the Ribbon,

™ the Data tab

2. To open the Data Analysis Tool, in the Analysis group,

You see the Data Analysis dialog box:

This dialog box lists the various statistical procedures available in the Data
Analysis ToolPak.

3. To select Descriptive Statistics, in the Analysis Tools list,

™ Descriptive Statistics

16 Excel: Statistical Features v6.1.1


4. To confirm this choice,

You see the Descriptive Statistics dialog box:

This box contains sections in which we specify the input, what variable we
wish to examine, the output, the types of statistical output to generate and
where we wish to have it displayed.

The top section of the window contains the input options. To indicate the
range of data we are interested in, we type its cell address in the Input
Range field.

We are interested in getting a set of descriptive statistics for the variable


CANX. It is located in column AA, in range AA1:AA41.

5. To set the input range of the CANX variable, type:

aa1:aa41

We can choose whether to organize our data in rows or columns by select-


ing the appropriate radio button. By default, our data is arranged by col-
umns with the variables in the columns and the observations (i.e., the
participants) in the rows.

Excel: Statistical Features v6.1.1 17


6. To indicate that our variables have labels in the first row,

™ the “Labels in First Row” checkbox

The bottom section of the window contains the options for the output dis-
play. By default, Excel sends the results to a new worksheet. We will keep
the default selection. We need only designate a name for the worksheet in
the New Worksheet Ply field.

7. To name the new worksheet,

™ in the “New Worksheet Ply” field, type: Sum

This name will remind us that this worksheet contains summary statistics.

8. To generate summary statistics,

™ the Summary statistics checkbox

The completed Descriptive Statistics dialog box should look like this:

9. To complete the procedure,

Excel computes the requested statistics and sends the output to a new work-
sheet.

Notice that the columns are not wide enough to display all of the data in
them. Let’s widen the columns before continuing.

18 Excel: Statistical Features v6.1.1


10. To increase the column widths,

£ between the column headings A and B

11. To save the workbook, on the Quick Access Toolbar,

Analysis of Output
Excel generates statistics that provide us with detailed descriptions of our vari-
able. We are given the mean of 29.92, and the standard deviation of 3.33. We
are also supplied with the minimum value, 23 and the maximum value, 40.
Additional statistics include: the mode, sample variance, kurtosis, skewness, a
sum of the values, and the total number of observations for the variable.

Creating Histograms and Frequency Distributions


A histogram is a table or chart that provides information on how the values for
a variable are distributed between the highest and lowest values. It indicates
whether the values are evenly spread out or whether they are bunched up
around just a few values. The histogram procedure in Excel generates both a
table giving the frequency distribution of the values of a variable as well as a
chart displaying in a graphical format the data shown in the table.

Constructing a histogram in Excel is a two-step process, so a bit of preparation


is necessary. First, the intervals of distribution, called bins, must be chosen for
the presentation of the data. The intervals can be either unique values or a
range of values.

For example, suppose you had a variable with possible values of 1, 2, 3, or 4.


You could have Excel calculate the frequency distribution for each value by
creating four bins. Alternately, if the variable you choose has integer values
ranging from 11 to 84, one optional distribution would be units of 10. The
result would show how many values fall between 1 and 10, how many
between 11 and 20, how many between 21 and 30, and so on. After you have
determined your bin intervals, the histogram is generated.

We will use the variable SCHOOL for this exercise. SCHOOL indicates the
type of school system the student attended. A value of 1 indicates a rural
school system, 2 indicates a suburban school system, and 3 indicates an urban
school system.

To begin construction of a histogram we must first change back to the


Stats.xlsx worksheet by clicking on the tab labeled stat_data at the bottom of
the sheet.

Excel: Statistical Features v6.1.1 19


1. To switch back to the stat_data worksheet,

™ the stat_data worksheet tab

To select a cell for the construction of our Histogram, we begin by moving


to an empty cell address. Here we will use AC1.

2. To select the result cell,

™ cell AC1

3. To give a name to the new variable, type:

BIN ©

We must designate the appropriate number of categories of our variable and


indicate them in the BIN variable. Pressing enter moves the active cell
down to AC2. We already know that there are 3 values for our variable
SCHOOL.

4. To identify the first bin, in cell AC2, type:

The cursor moves to the cell below.

5. To indicate the next bin, type:

6. To designate the final category, in cell AC4, type:

Generating the Histogram and Frequencies


We have now completed the first step in histogram construction and are ready
to move on to its generation.

20 Excel: Statistical Features v6.1.1


1. To begin generating the histogram, on the Data tab, in the Analysis group,

You see the Data Analysis dialog box:

2. To select the Histogram tool,

™ Histogram

3. To confirm the selection and begin creating the histogram,

The Histogram dialog box opens:

The top section of the window has the input options while the bottom part
contains the options for the output.

4. To designate the input range for the variable SCHOOL, in the Input Range
field, type:

d1:d41 ø

Excel: Statistical Features v6.1.1 21


5. To designate the bin range, in Bin Range field, type:

ac1:ac4

The vertical ranges just entered for the input and bin ranges each had data
labels as the first cell of each column. To indicate that these cells must be
treated as not having data to be considered for the histogram, we must select
the Labels checkbox.

6. To indicate Labels are located in the first row,

™ the Labels checkbox

7. To name the new worksheet,

™ in the “New Worksheet Ply:” field, type: Hist

Cumulative percentages provide a running tally of the our variable values.

8. To generate a running tally of percentages in our output,

™ the Cumulative Percentage checkbox

9. To generate a chart of our output,

™ the Chart Output checkbox

The completed Histogram dialog box should look like this:

22 Excel: Statistical Features v6.1.1


10. To generate the output,

Excel generates the output and sends it to a new worksheet.

Modifying a Histogram
The histogram that we now see is in Excel’s generic format. However, in sta-
tistical circles, the format is slightly different. Firstly, histograms by definition
are comprised of columns that have no gap between them, so we will remove
the space between the columns in our histogram. Secondly, histograms typi-
cally do not have a “More” column. This extra column is generated automati-
cally by Excel to capture any observations that did not fit into the bins. We
will remove this column from the output table and chart. Finally, we will
change the title of the histogram and change the axis labels to ensure that the
chart will be easier to understand.

Resizing the Chart


The chart is too small to really understand. We can enlarge the chart to
make it more readable.

1. To prepare to enlarge the chart,

™ on a blank area of the chart

You see a set of resize handles appear on the border of the chart.

2. To enlarge the chart,

¢ the lower-right corner handle down vertically

When the mouse is released, the columns lengthen and become easier to
read.

Changing Gap Width


First, we will remove the gap between the histogram columns by changing the
gap width to 0.

Excel: Statistical Features v6.1.1 23


1. To begin modifying the chart columns,

™ one of the histogram columns

Bounding boxes appear around all the columns, indicating that they have
been selected. Also, the Chart Tools contextual tab appears on the Ribbon,
with three sub-tabs: Design, Layout and Format. We need to activate the
Format tab.

2. To activate the Format tab, under Chart Tools,

™ the Format tab

The Ribbon changes to display the commands available in the Format tab.
We now want to begin formatting the selected columns.

3. To begin formatting the selected columns, in the Current Selection group,

The Format Data Series dialog box opens:

We want to change the Gap Width to zero. This will remove the space be-
tween the columns.

24 Excel: Statistical Features v6.1.1


4. To remove the gap between the histogram columns, in the Gap Width sec-
tion,

¢ the slider left to 0%, ™

The columns now touch, but it’s difficult to distinguish them from each
other. We will change the outline color of the columns to clearly define the
borders of the columns.

5. To change the outline color of the columns, in the Shape Styles group,

™ , ™ a contrasting color

6. Deselect the columns.

The columns are now easily distinguishable. Next we will remove the
“More” column.

Removing the “More” Column


In our diagram, Excel has automatically included a bin called “More.” This
bin is designed to capture any observations that may have been accidentally
omitted from the bins during the creation of the histogram. In our case, we
didn’t omit any observations, so the More bin has zero observations, and
shows up as a blank spot to the right of our histogram columns. We want to
remove this column from the Bins list as well as from the chart.

1. To select the More bin and its connected data, in the table,

¢ cells A5 through C5

These cells are now highlighted. At this point, it may seem that simply
pressing the Delete key will complete this task, but this is not so. Pressing
the Delete key will only remove the information from the table, but the
blank space will still be visible in the chart.

So we must use a different method here.

Excel: Statistical Features v6.1.1 25


2. To remove all traces of the More column,

µ the selected cells, ™ Delete...,

™ the “Entire row” radio button, ™

The bin disappears, and the blank space also disappears from the chart.
Now, we will change the title and axis labels.

Changing Chart Labels


We are familiar with the content and coverage of our chart, but our title and
axis labels render this chart unreadable to anyone who is not familiar with the
data. Hence, we will change the title and X-axis (value axis) label to increase
the clarity of the chart.

1. To select the histogram title, at the top of the chart,

™ the title, £ the title text

The title text is now selected and ready for editing.

2. To change the histogram title, type:

No. of Students by School System

The title changes. We are now ready to change the X-axis label.

3. To select the X-axis label, at the bottom of the chart,

™ the BIN label, £ BIN

The X-axis label is selected and ready for editing.

4. To change the X-axis label, type:

School System Type

The X-axis label is now easier to understand and the histogram is now ready
to be placed directly into a report.

26 Excel: Statistical Features v6.1.1


The chart should look similar to the one below:

The frequency distribution table indicates that the number of students is


fairly evenly distributed among the rural, suburban, and urban school dis-
tricts. The first two districts, rural and suburban, are each represented by 13
students and there are 14 urban students in the study. The chart displays the
same results.

NOTE: You can change the values in range A2:A4 to their category labels
(Rural, Suburban, and City, respectively) by simply typing those
labels into those cells. The chart will update to display those labels
instead of their numeric codes.

We will now save the workbook.

5. To save the workbook, on the Quick Access Toolbar,

Generating a Second Histogram


In the last example, we created a histogram where each possible value of the
variable in question had its own column in the histogram. The variable
SCHOOL only takes on three different possible values and we created one
column for each value. Suppose, however, that we would like to create a histo-
gram for a variable that takes on many different possible values, and we want
each column of the histogram to represent a range of values.

Consider the MATHSCOR variable. It can take on values from 0 to 100, and
we would not want a histogram to have 100 different columns. We might,
however, like to create a histogram for MATHSCOR where each column held
all of the values in a range of ten. In other words, the first column would
account for all MATHSCOR values that fell between 0 and 10, the next
column from 11 to 20, and so on.

Excel: Statistical Features v6.1.1 27


To begin construction of such a histogram we must change to the Hist2 work-
sheet of our workbook, where the MATHSCOR data and the set of proposed
classes have been recorded.

1. To access the Hist2 worksheet, at the bottom of the sheet,

™ the Hist2 worksheet tab

Now we will need to make new bins for each multiple of ten.

2. To select the heading cell of the bin range,

™ cell C1

3. To name this bin range, type:

BIN ©

4. To label the first bin, in cell C2, type:

10©

5. To label the next bin, in cell C3, type:

20 ©

Next we will use the AutoFill feature to finish the cells.

6. To select the two numbers we have typed,

¢ the range C2:C3

We will use the fill handle to fill in the rest of the range to the value of 100.

7. To perform the AutoFill operation,

§ to the fill handle, ¢ to cell C11

The selected cells are filled with quantities of 10, ending with 100.

8. To deselect the cells,

™ any other cell

28 Excel: Statistical Features v6.1.1


Generating the Histogram
Now that we have made the bins, we are ready to generate the histogram. This
option is located on the Data tab.

1. To access the Data tab, on the Ribbon,

™ the Data tab

2. To see the list of Data Analysis options, in the Analysis group,

The Data Analysis dialog box opens.

3. To select the Histogram tool, if necessary,

™ Histogram

4. To confirm the selection and begin generating the histogram,

The Histogram dialog box opens. The input range and the bin range contain
the data used in the previous exercise. We will replace these ranges. We
could simply type in the new values, but this time we will select the cells
with the mouse.

5. To designate the input range for the variable MATHSCOR, in the Input
Range field,

The Histogram dialog box is minimized to a simple title bar and data entry
field, as shown below:

This minimized dialog box contains previous input range data, but this will
not prevent us from selecting the new range.

Excel: Statistical Features v6.1.1 29


6. To select the input range, in column A,

¢ the range A1:A41

You see the selected range in the entry field:

7. To confirm this range and return to the Histogram dialog box, press:

The Histogram dialog box returns. We now select the bin range.

8. To designate the bin range, in the Bin Range field,

9. To select the bin range with the mouse,

¢ the range C1:C11

10. To confirm the selection and return to the Histogram dialog box, press:

All of the options we selected in our previous histogram--Labels, Cumula-


tive Percentage, and Chart Output--should remain selected. We will now
specify an Output Range starting at cell D1, since we want to generate the
histogram in this worksheet.

11. To create the histogram in this worksheet, in the Output options section,

™ the Output Range radio button

The Output Range field becomes active. Now, we will specify that we want
the histogram to be generated starting at cell D1.

12. To specify the first cell of the output range,

™ in the Output Range field, type: d1

30 Excel: Statistical Features v6.1.1


We see the completed Histogram dialog box:

13. To continue,

Excel generates the output and places it in the range starting with cell D1.

Let’s enlarge the chart.

14. To prepare to enlarge the chart,

™ on a blank area of the chart

You see a set of resize handles appear on the border of the chart.

15. To enlarge the chart,

¢ the lower-right corner handle down vertically

When the mouse is released, the columns lengthen and become easier to
read.

Excel: Statistical Features v6.1.1 31


16. Resize as needed until the chart is large enough to display all the numbers.

The chart should look something like:

Looking at the results in the table we see that the values of our data are fair-
ly evenly centered around bins 40, 50 and 60 - i.e., in the range 31-60. Re-
member the bins in the table represent the scores of 0-10, 11-20, 21-30, and
so on. The data show that 87.5% of students scored 69 or below. The chart
displays the same results.

Exploring Multiple Variables


Thus far we have examined variables in isolation from one another. We have
determined the mean, standard deviations of variables as well as their fre-
quency distributions. Determining these measures is a first step toward getting
a good understanding of the characteristics of the data as well as a start to
more elaborate measures and procedures.

The next stage of data analysis might be examining how variables interact
with one another. Often the most interesting questions and the focus of most
scholarly research seek to define the relationship between two or more vari-
ables as well as to determine the strength of the relationship.

32 Excel: Statistical Features v6.1.1


Creating Pivot Tables
Pivot tables are one of the Excel tools to summarize two or more variables. In
a simple two-variable pivot table, all of the values of one variable are com-
pared to the corresponding values of the second variable. For example, the
pivot table below shows how Democrats, Republicans, and Independents cast
their votes for candidates from three different parties.

Count of Party Vote


Democrat Republican Independent
Party cand. cand. cand. Total
Democrats 66.67% 22.22% 11.11% 100.00%

Republicans 0.00% 80.00% 20.00% 100.00%

Independents 30.03% 40.00% 30.00% 100.00%


Grand Total 31.03% 48.28% 20.69% 100.00%

The Democrats, Republicans, and Independents are listed under the Party
heading. Reading the Democrats row, note that 66% of the Democrats voted
for the Democratic candidate, 22% cast their vote for the Republican candi-
date, and 11% voted for the Independent candidate. The last row in the table,
Grand Total, shows the totals for each of the columns. The Democratic candi-
date received approximately 31% of the total vote, the Republican candidate
received 48%, and the Independent received 21%.

The example we will now explore looks at the relationship between the type of
school system and the amount of computer experience of its students. We
wish to know the percentage of students in each school system that is at each
level of computer experience. In this example, we will use the variables EXP
and SCHOOL. Recall that EXP is a measure of computer experience where 1
indicates one year or less experience, 2 indicates about two years, and 3 indi-
cates three or more years of experience.

We will create a pivot table that examines the relationship between school
system and computer experience. The rows will represent the levels of the
SCHOOL variable, and the columns will be the levels of EXP.

Accessing the Pivot Table Options


To begin generating the pivot table, we must return to our stat_data work-
sheet.

Excel: Statistical Features v6.1.1 33


1. To view the stat_data worksheet, at the bottom of the sheet,

™ the stat_data worksheet tab

We will now access the pivot table options via the Insert tab.

2. To activate the Insert tab, on the Ribbon,

™ the Insert tab

3. To begin creating the pivot table, in the Tables group,

The Create Pivot Table dialog box opens:

By default, Excel has selected all the data as the table range. We want to
change this to select only SCHOOL and EXP.

4. To indicate the range in which our data are located, in this case SCHOOL
and EXP, in the Table/Range field, type:

c1:d41

5. To place the pivot table in a new worksheet,

™ the New Worksheet radio button

34 Excel: Statistical Features v6.1.1


6. To proceed to the next step,

The dialog box closes and the pivot table layout is opened in a new work-
sheet.

To the far right, you see the PivotTable Field List:

The Field List is divided into two sections. The top section lists the fields
that we selected for the pivot table - EXP and SCHOOL. The bottom sec-
tion lists the areas to which we can apply the fields.

NOTE: It may be necessary to widen the Field List to view all of the options in
the bottom section.

We are now ready to specify the layout of the pivot table.

Creating the Pivot Table


Our table will list the different computer experience categories (EXP) as
column headings and the different school systems (SCHOOL) as row head-
ings. The body of the table will list, for each school system, the percentage of
students who fall into the three EXP categories.

1. To specify EXP as a column label in the pivot table, from the top of the
Field List,

¢ EXP down to the Column Labels area

Now we are ready to list SCHOOL as a row label.

Excel: Statistical Features v6.1.1 35


2. To specify SCHOOL as a row label in the pivot table, from the top of the
Field List,

¢ SCHOOL down to the Row Labels area

Now we can turn our attention to the Values area. In this area, we place the
fields which we want to summarize. The purpose of our pivot table is to
show us the percentage of students who fall into the three EXP categories.
Therefore, the EXP field is our vlue field. We will place the EXP field in
the Values area, then we will modify the field to suit our purpose.

3. To place EXP in the Values area,

¢ EXP to the Values area

The lower portion of the field list should look like:

When we placed EXP in the Values area, it defaulted to a “Sum of EXP.”


However, we want a count of the number of students at each experience
level. Once we have the count, we can easily calculate the percentage. To
do this, we will modify the Sum of EXP field.

4. To access the settings for the Sum of EXP field, in the Values area,

™ , ™ Value Field Settings...

The Value Field Settings dialog box opens. First, we will change the func-
tion being performed on EXP to a count. This will count the students. Then,
we will show the count values as row percentages.

36 Excel: Statistical Features v6.1.1


5. To change the selected field into a count, in the “Summarize value field
by” area,

™ Count

The Value Field Settings dialog box should look like:

The current table generates a count of all the students in an experience cat-
egory. What we want to know, however, is the percentage of students in
each school system that had a particular level of computer experience. That
is, we want to show the percentage of students in each school system who
are in each of the computer experience categories. We must therefore con-
vert these EXP counts to percentages of each row (i.e., of each school sys-
tem). To do so, we must activate the “Show values as” tab.

6. To view the appropriate option for percentages,

™ the “Show values as” tab

Excel: Statistical Features v6.1.1 37


7. To generate the row percentages, in the “Show values as” field,

™ , ™ % of row

The Value Field Settings dialog box now should look like:

8. To accept these settings and generate the pivot table,

We see the pivot table appear:

The pivot table provides much information. However, the generic labels
may make it difficult to understand. We can easily rename any of the labels.

9. To rename the Column Labels,

™ B3, type: Experience

10. To rename the row labels,

™ A4, type: Type of School

The labels are renamed and the pivot table is easier to understand.

38 Excel: Statistical Features v6.1.1


If we wished to further improve the ease of understanding this table, the nu-
meric codes for the category labels can be modified by typing the desired
values in the correct cells. The Row Labels would be Rural, Suburban, and
City, respectively. The Column Labels would be “Less than one year”,
“About two years”, and “Three years or more”.

11. To name the new worksheet,

£ the Sheet4 worksheet tab, type: Pivot ©

The results of this exercise are summarized below.

Experience is the column variable and type of school system is the row
variable. We see that 46% of those students in rural school systems
(SCHOOL=1) have had one year or less experience with comput-
ers(EXP=1). Almost 54% of students from suburban school systems
(SCHOOL=2) have about 2 years computer experience while approximate-
ly 57% of students from urban school systems (SCHOOL = 3) have had 3
or more years computer experience.

The Grand Total row at the bottom of the table indicates the percentage of
students in each experience category. Almost 38% of all the students have
had 1 year or less computer experience, 35% have had almost 2 years ex-
perience, and almost 28% have had 3 or more years experience.

NOTE: To run a simple test of statistical independence based on probability


theory, we can generate a table that computes overall percentages (%
of total) instead of row percentages. Once this table is generated,
independence is proven if each joint probability (located in the body
of the table) is equal to the product of its two marginal probabilities
(located at the edge of the table).

12. To save the workbook, on the Quick Access Toolbar,

Performing a t-Test
A t-test is an inferential statistical analysis used to decide whether two popula-
tion means are the same or diffferent. Although the theory of t-tests is rather
involved, it is not within the scope of this workshop to discuss in-depth how
the t-test works. A brief explanation will help us understand how we can use
Excel to perform a t-test.

Excel: Statistical Features v6.1.1 39


Suppose that we have two samples labeled “Group A” and “Group B” with
distributions shown as follows:
Group A Group B
0.4

0.3

Y
0.2

0.1

0.0
-2 -1 0 1 2
X

By looking at the mean and the variance of each sample, we can make an
inference about the likelihood that these two samples are from two populations
with the same mean. The null hypothesis is that the two samples come from
populations with the same mean. The alternative hypothesis is usually that the
two samples come from populations with different means.

With a t-test, we calculate the probability of obtaining two sample values as


far apart or farther apart than the observed values if the null hypothesis is true.
The typical interpretation is that if the probability of getting two sample means
at least as far apart as those observed is 5% or less, then we conclude that the
results are so unlikely under the null hypothesis that the null hypothesis is not
true. That is, we reject the null hypothesis that the samples were drawn from
populations with the same mean, and conclude that the samples come from
populations with different means. It is this sort of inference that leads to the
term “inferential statistics.”

Sorting the Data


We will test to see whether the mean of the variable MATHSCOR is signifi-
cantly different for two different groups. One group will be those students who
have a MANX of 0, the other will be those students who have a MANX of 1.
So, our dependent variable will be MATHSCOR, and our independent vari-
able will be MANX.

The first thing we will have to do is to sort our data according to the MANX
variable so that we can separate the two relevant groups of students in the
spreadsheet.

1. To return to our data,

™ the stat_data worksheet tab

40 Excel: Statistical Features v6.1.1


2. To return to the Home tab, on the Ribbon, if necessary,

™ the Home tab

3. To begin the process of sorting the data,

™ cell AB1

4. To prepare to sort the data, in the Editing group,

™ , ™ Sort A to Z

Excel rearranges the data in ascending MANX order.

NOTE: Excel will not perform the t-test correctly unless you sort the data
prior to running the test. Be aware that sorting your worksheet by
MANX rearranges the order of your entire worksheet. We will return
the worksheet to its original order after we have performed the t-test.

The type of t-test we run will depend on whether the MATHSCOR varianc-
es of the two MANX groups are equal or not. Therefore, we have two op-
tions. The first is to just run a t-test assuming unequal variances, since this
test is the more robust of the two. Another option is to first run a test for
unequal variances, and then choose the type of t-test depending on the re-
sult. We will do the latter today.

Running an F Test for Unequal Variances


Excel provides an F-test that we will use to test whether the variances of the
MATHSCOR observations for the two MANX groups are significantly differ-
ent from each other. We will run this test now, using the data analysis feature.

1. To switch to the Data command tab, on the Ribbon,

™ the Data tab

2. To see the list of data analysis options, in the Analysis group,

Excel: Statistical Features v6.1.1 41


3. To select the F-test,

™ F-Test Two-Sample for Variances

4. To continue,

We see the F-test Two-Sample for Variances dialog box.

First we need to specify the data range for the two MANX groups. The data
with a MANX score of 0 will be specified as Variable 1, and the data with
a MANX score of 1 will be specified as Variable 2.

5. To specify the Variable 1 Range, type:

y2:y29ø

6. To specify the Variable 2 Range, type:

y30:y41

7. To name the new worksheet,

™ in the “New Worksheet Ply:” field, type: Ftest

The F-Test Two-Sample for Variances dialog box should look like:

42 Excel: Statistical Features v6.1.1


8. To run the F-test,

We see a summary table of our F-test results.

Notice that the columns are not wide enough. Let’s widen them before con-
tinuing.

9. To increase the column widths,

£ between the column headings A and B, B and C, C and D

10. Deselect the cells.

We now have the results of our F-test.

Interpreting the Results


Cells B5 and C5 show the variances for the two samples: 191.68 and 231.45
respectively. The sample variances are clearly different, but we have to ask
another question before deciding whether the population variances are statisti-
cally different. What we want to know is the following: Assuming that the null
hypothesis is true (i.e., that the population variances are the same), what is the
probability that we would find this much of a difference in the sample vari-
ances?

The answer to this question is the “p-value” in cell B9; it is approximately


0.329 for a one-tailed test. Since we are dealing with a two-tailed test in this
case (variances equal vs. variances unequal), the p-value is approximately
0.329*2 = 0.658. By convention, we reject the null hypothesis at 5% signifi-
cance if this value is less than 0.05. Thus, we fail to reject the null hypothesis
in this case and conclude that the MATHSCOR variances of the two MANX
populations (0 and 1) are not statistically significantly different. Hence, we
will run our t-test assuming equal variances.

Running the T-Test


Now we can run the t-test. First, we must go back to the stat_data worksheet.

1. To access the stat_data worksheet,

™ the stat_data worksheet tab

Excel: Statistical Features v6.1.1 43


2. To see the list of data analysis options, in the Analysis group,

3. To select the t-test,

™ t-Test: Two-Sample Assuming Equal Variances

4. To continue,

We see the t-Test: Two-Sample Assuming Equal Variances dialog box.

Again, we need to specify the data range for the two MANX groups.

5. To specify the Variable 1 (MANX=0) Range, type:

y2:y29ø

6. To specify the Variable 2 (MANX = 1) Range, type:

y30:y41

7. To name the new worksheet,

™ in the “New Worksheet Ply:” field, type: Ttest

We see the t-Test: Two-Sample Assuming Equal Variances dialog box


with our specifications:

44 Excel: Statistical Features v6.1.1


8. To run the t-test,

We see a summary table of our t-test results.

Let’s widen our columns before continuing.

9. To increase the column widths,

£ between the column headings A and B, B and C, C and D

10. Deselect the cells.

Your t-test worksheet now looks like the one shown below:

Interpreting the Results


Cells B4 and C4 show the means for the two populations: 53.75 and 37,
respectively. Intuitively, this looks like a fairly significant difference. But
what we would really like to know is the following: Assuming that the null
hypothesis is true (i.e., that the population means are the same), what is the
probability that we would find this much of a difference in the sample means?
The answer to this question is the “P-value” in cell B13; it is right around
0.002.

By convention, we reject the null hypothesis at 5% significance if this value is


less than 0.05. Thus, we reject the null hypothesis in this case and conclude
that there is a statistically significant difference between the two populations
for the values of MATHSCOR. In other words, the students with low Math
anxiety have an average Math test score that is significantly different from the
average Math test score of students with high Math anxiety.

Excel: Statistical Features v6.1.1 45


Resorting the Data
Now that we are done, we will put the data back in its original order, arranged
by student ID.

1. To return to the stat_data worksheet,

™ the stat_data worksheet tab

2. To begin the process of sorting the data, if necessary,

™ cell AB1

3. To prepare to sort the data, on the Data tab, in the Sort and Filter group,

The Sort dialog box opens.

The “Sort by” field contains the value last used, sort by MANX. That value
needs to be changed to ID. It should also be set for ascending order to re-
store the data to its original sort order.

4. To change the Data to sort by to ID, on the Sort by drop-down list,

™ , ™ ID

The Sort dialog box should look like this:

46 Excel: Statistical Features v6.1.1


5. To accept these settings,

We see the data in its original arrangement.

6. To save the workbook, on the Quick Access Toolbar,

Finding Correlations
Another means of testing if any relationship exists between several variables
is determining whether they are correlated with one another. Correlation mea-
sures whether the magnitude of one variable predicts the magnitude of another
variable. A positive correlation means that large values in one variable are
associated with large values in another variable. An example of a positive cor-
relation would be height and weight. Tall people tend to weigh more than
short people. A negative correlation is when small values of one variable are
associated with large values in another. An example of a negative correlation
would be temperature and heating oil consumption: as the temperature outside
decreases, consumption of heating oil increases.

The strength of relationships between the variables is given by the value of the
correlation coefficient. Correlation coefficients can range from -1 to +1. A
value of -1 indicates a perfect negative correlation while a value of +1 indi-
cates a perfect positive correlation. In other words, the closer the relationship
between variables, the closer the correlation coefficient will be to 1 or -1. The
overwhelming majority of correlation coefficients will, however, be some-
where between these two values.

The variables we will use for the correlation exercise are MATHSCOR, and
COMPSCOR, CANX, MANX.

Running the Correlation Procedure


By examining the results of this correlation procedure, we will be able to tell
how well the magnitude of any of the variables MATHSCOR, COMPSCOR,
CANX, and MANX is able to predict the magnitude of the others.

1. To perform the correlation procedure, in the Analysis Group,

Excel: Statistical Features v6.1.1 47


2. To begin running a correlation,

™ Correlation

3. To continue,

The Correlation dialog box opens containing the options we must specify
for our correlation procedure.

The Input Range is the cell addresses for the complete range of data. This
means we will include the address of all variables we wish to correlate. For
this exercise, the first data point for the first variable, MATHSCOR is in
cell Y1. The last data point for the last variable, MANX, is in cell AB41.

4. To specify the input range, type:

y1:ab41

NOTE: Because our data is grouped by columns, be sure the columns option is
selected. It should be the default option.

5. To indicate that the labels for our variables are in the first row,

™ the “Labels in First Row” checkbox

6. To name the new worksheet,

™ in the “New Worksheet Ply” field, type: Cor

The Correlation dialog box with the correct settings looks like:

48 Excel: Statistical Features v6.1.1


7. To generate the Correlation results,

8. Deselect the table.

9. Adjust the column widths so that all of the cells’ contents are visible.

You should see the following results in the Cor worksheet:

Interpreting the Results


Several observations can be made from our results. First, the correlation
between scores on tests in a computer class (COMPSCOR) and the level of
computer anxiety (CANX) gives a moderate and positive coefficient of 0.657.
This means that high test scores in a computer class are associated with high
scores on the computer anxiety test, or low levels of anxiety about using com-
puters. (Remember, CANX is coded so that high scores reflect low levels of
computer anxiety.) We also see that there is very little relationship between
different levels of computer anxiety and scores on math tests (MATHSCOR).
This is seen in the low coefficient of 0.068. Finally, we see there is also a very
small relationship between scores on math tests and scores on computer tests,
0.149. In other words, performance in a computer class has little relationship
to your performance in a math class.

Simple Linear Regression


We may suspect that two or more variables are related and wish to test this
relation using a statistical technique somewhat more powerful than a simple
correlation. Moreover, we want to know with some precision how changes in
one variable may be reflected in changes in another variable. For example, we
have seen in our earlier example that a relationship exists between levels of
computer anxiety and the scores students receive on computer tests. A regres-
sion procedure will tell us that for each change in a student’s score there will
be a proportionate change in their level of computer anxiety. In regression ter-
minology, the level of computer anxiety would be the predicted variable, and
its value would be predicted by the students’ test scores. The proportionate
change estimated by the regression procedure is called the coefficient. We can
describe a model we will be testing as follows: CANX = a + b(COMPSCOR).

Excel: Statistical Features v6.1.1 49


CANX is our dependent variable, a and b are constants, and COMPSCOR is
our independent variable.

We must first switch back to the sheet containing our data.

1. To return to the main worksheet,

™ the stat_data worksheet tab

2. To perform the regression procedure, on the Data tab, in the Analysis


group,

3. To select the Regression procedure,

™ Regression

4. To continue,

We see the Regression dialog box with options similar to those we have
seen in previous exercises.

To input the Y range, we type in the cell address of our dependent variable
CANX. In this case the address is AA1:AA41. Note that in regression ter-
minology, Y typically denotes the dependent variable.

5. To enter the Y range, in the Input Y Range box, type:

aa1:aa41 ø

To input the X range, we type in the cell address of our independent vari-
able COMPSCOR. In this case the address is Z1:Z41.

6. To enter the X range, in the Input X Range box, type:

z1:z41

50 Excel: Statistical Features v6.1.1


7. To indicate the variable labels in the first row,

™ the Labels checkbox

8. To name the new worksheet,

™ in the “New Worksheet Ply” field, type: Reg

Our completed Regression dialog box looks like the one below:

9. To complete the procedure,

Excel runs the regression and sends the output to a new worksheet.

10. Deselect the summary table.

11. Adjust the column widths of the table.

Interpreting the Results


Our output contains quite a bit of information about the relationship between
CANX and COMPSCOR, much more than contained in a simple correlation.
The output is divided into three sections. We will limit our analysis to the
more frequently interpreted results, those in the first and third sections.

The first section, “Regression Statistics,” gives us an overview of how well


our model performed. It tells us how well we were able to account for the anx-
iety students feel about computers using only the information from their test

Excel: Statistical Features v6.1.1 51


scores. This performance indicator is the R Square statistic. The 0.43 tells us
that 43% of the variation in student computer anxiety can be accounted for by
variation in how well they did on computer tests.

The third section gives us more specific information on the relationship


between test scores and anxiety. In particular, the coefficient for COMPS-
COR, 0.133, indicates that for each additional point of a student’s score on a
computer test there will be an increase in their CANX score of .133. Remem-
ber that the higher the CANX value, the more comfortable the student feels
with computers. This makes sense; students that tend to do well on tests in
their computer class also tend to have lower levels of anxiety about computers.

The intercept coefficient is 23.19. This indicates the level of computer anxiety
a student would have if their score on computer tests was a zero. A student
who scored a zero would have a higher level of computer anxiety (a lower
CANX score) than one who scored some amount of points on the test.

Creating a Scatterplot Chart


When working with correlations and regressions, it is often beneficial to chart
the data in a scatterplot. In the intermediate state of data analysis, a scatterplot
can enable easy checking for anomalies or data entry mistakes. Because a scat-
terplot chart will allow for visual interpretation of the data at once, it is easier
to check for trends, outliers, etc. Additionally, the visual representation of data
on a chart can serve as a quick summary of the data trends in presentations or
reports.

We will first create a scatterplot of COMPSCOR against CANX, then add a


regression line (Excel refers to them as trendlines). The regression line is a
plot of the regression equation determined using the regression coefficient and
the regression constant—a line that provides the “best fit” for the raw data
pairs. This line is plotted on the same chart as the scatterplot to help show the
linear relationship between the independent and dependent variables.

NOTE: This line is also identical to the line that we generated when we ran the
regression earlier in this workshop. This is true in all cases of simple
linear regression (one y and one x variable).

Let’s first select the data we wish to chart, and then generate a scatterplot.

1. To return to the main worksheet,

™ the stat_data worksheet tab

52 Excel: Statistical Features v6.1.1


2. To select the first data set, COMPSCOR, in the column selection area,

The COMPSCOR column is now selected.

The second column we need to select is the CANX data, in column AA.
This is a contiguous selection; therefore we will use a selection technique
that allows for this type of selection.

3. To select the CANX data, in the column selection area,

press and hold º and ™

Both the COMPSCOR and the CANX columns are selected.

We will create a scatterplot of this data.

4. To switch to the Insert tab, on the Ribbon,

™ the Insert tab

5. To generate the scatterplot, in the Charts group,

™ ,™

The scatterplot opens in the current worksheet. Though the data was chart-
ed accurately, Excel’s default choices for labeling and formatting leave a
lot to be desired.

Creating a Title and Axis Labels


Let’s adjust the chart title and labels. We will adjust the formatting after plac-
ing the chart into a separate worksheet.

1. To create an appropriate chart title,

™ the title, £ the title text,

type: Correlation - COMPSCOR vs. CANX

Excel: Statistical Features v6.1.1 53


2. To begin to provide a label for the X (independent variable) axis, under
Chart Tools on the Ribbon,

™ the Layout tab

3. To provide a horizontal axis title, in the Labels group,

™ , § Primary Horizontal Axis Title,

™ Title Below Axis

A title placeholder appears below the X-axis. We will name the X-axis
COMPSCOR

4. To name the X-axis, type:

COMPSCOR ©

5. To provide a label for the Y (independent variable) axis, in the Labels


group,

™ , § Primary Vertical Axis Title,

™ Rotated Title

A title placeholder appears to the left of the Y-axis. We will name the Y-
axis CANX.

6. To name the Y-axis, type:

CANX ©

The chart and the axes now have appropriate labels.

We need to make one more small adjustment to this chart. Let’s remove the
Legend box since the data it contains is meaningless.

54 Excel: Statistical Features v6.1.1


7. To access the legend controls, in the Labels group,

™ , ™ None

The legend is removed from the chart.

Let’s place the chart in a new sheet.

8. To switch to the Design tab, on the Ribbon,

™ the Design tab

9. To change the location of the chart, in the Location group,

™ , ™ the New Sheet radio button

10. To name the sheet, in the New Sheet field, type:

scatterplot

11. To complete the process,

The chart appears as a separate sheet in the workbook.

Formatting the Chart


While the chart is technically correct, there are some formatting elements that
make the chart less efficient to read and understand.

Let’s adjust the axes so our data fills more of the available chart area.

1. To switch to the Layout tab, on the Ribbon,

™ the Layout tab

Excel: Statistical Features v6.1.1 55


2. To begin to adjust the X-axis, in the Axes group,

™ , § Primary Horizontal Axis,

™ More Primary Horizontal Axis Options...

The Format Axis dialog box opens.

We want to adjust the scale of the axis; namely, we want the leftmost point
to be 20 rather than 0, since we have no data points between 0 and 20.

3. To set the minimum value for the X-axis, in the Minimum area,

™ the Fixed radio button, ¢ the value, type: 20

4. To return to the chart,

Our data now is now more centered on the X-axis.

Let’s do the same to the Y-axis; again, since there are no data points be-
tween 0 and 20, we will simply adjust the minimum scale on the Y-axis.

5. To begin to adjust the Y-axis, in the Axes group,

™ , § Primary Vertical Axis,

™ More Primary Vertical Axis Options...

The Format Axis dialog box opens.

6. To set the minimum value for the Y-axis, in the Minimum area,

™ the Fixed radio button, ¢ the value, type: 20

56 Excel: Statistical Features v6.1.1


7. To return to the chart,

Our data is now more centered on the Y-axis.

Our final task is to add a regression line to this scatterplot graph.

Adding a Regression Line


Excel can produce a regression line for any given data set; as stated previ-
ously, this line is simply the “best fit” line that summarizes the relationship of
the raw data pairs. A regression line can be useful for predicting probable
values for hypothetical data sets; specifically, a regression line’s equation can
serve as a way to extrapolate data. Additionally, it enables one to see how
closely specific data points (upon which the regression line is based) adhere to
the data set’s aggregate description (the regression line itself).

1. To add a regression line, in the chart,

µ any data point, ™ Add Trendline...

The Format Trendline dialog box appears:

The default Trend/Regression type, Linear, is appropriate for our data.

Let’s make the regression equation visible as a part of the chart.

Excel: Statistical Features v6.1.1 57


2. To make the equation visible, near the bottom of the dialog box,

™ the “Display Equation on chart” checkbox

Regression can also help us to determine how much variance the indepen-
dent variables explain in the dependent variable. This information is sum-
marized by the R-square statistic. R-square ranges from 0 to 1, where 0
means the independent variables explain no variance and 1 means the inde-
pendent variables explain all the variance in the dependent variable. In sim-
ple linear regression, with only one independent variable, the square root of
R-square is equal to the correlation coefficient between the independent
variable and the dependent variable

The R-square statistic shows this regression model explains about 43% of
the variability in the dependent variable, CANX

3. To make the R-square value appear on the chart,

™ the “Display R-squared value on chart” checkbox

4. To return to the chart,

The trendline and the equation, y = 0.133x + 23.19, appear on the chart:

5. To adjust the placement of the equation, if necessary,

¢ the equation to a blank portion of the plot area

58 Excel: Statistical Features v6.1.1


6. To save the workbook, on the Quick Access Toolbar,

Further Interpreting the Results


Both the simple regression analysis and the scatterplot trendline yield the
same equation; when we put the coefficients into our model, it yields:

CANX=23.19 + .133(COMPSCOR)

With the information from our output we can predict a level of computer anx-
iety if we have a test score.

For example, a student with a test score of 70 would have a predicted anxiety
score of:

CANX= 23.19 + .133 * 70 = 32.5

A student with a test score of 30 would have a predicted anxiety score of:

CANX = 23.19 + .133 * 30 = 27.18

The second student, with the lower test score, has higher anxiety about com-
puters than the student who had a higher test score.

Exiting Excel
We are now finished with today’s exercises, so let’s exit Excel.

1. To quit Excel,

™ ,™ Exit Excel

Wrapping Up
We’ve reached the end of today’s workshop. Please follow your workshop
instructor’s guidance and take a few moments to fill out the workshop evalua-
tion form.

Excel: Statistical Features v6.1.1 59


Also, before leaving, please log off your computer.

Thank you for participating in


Excel: Statistical Features

Contributions to These Materials


Project Leader Greg Hanek

Development Team Rachel Anderson


Carol Cobine
Angela Henry
April Law
Rita Pavolka
Chris Payne

60 Excel: Statistical Features v6.1.1


Where to Go From Here
You can use the resources listed below to further build your computing skills.

Taking Other IT Training & Education Workshops


UITS IT Training & Education offers hands-on instructor-led computing
workshops aimed at a variety of skill levels, covering a broad range of topics.
We teach hundreds of workshops on more than 80 topics every year! For more
information, to see a detailed workshop schedule, or to register for a work-
shop, contact IT Training & Education:

Web: http://ittraining.iu.edu/
Email: (IUB) ittraining@indiana.edu; (IUPUI) ittraining@iupui.edu
Phone: (IUB) 812/855-7383; (IUPUI) 317/274-7383

Getting Help from Online Resources


University Information Technology Services – IU technology resources,
services and support:
http://uits.iu.edu/

IT Training Online – Self-paced IT courses you can take on your computer:


http://ittraining.iu.edu/online/

UITS Knowledge Base – Searchable database of computing questions:


http://kb.iu.edu/

Getting Help from Support Staff


Walk-in Support

(All IU Campuses) Walk-in Support Center. Locations and schedules at:


http://kb.iu.edu/data/abxl.html

(IUB & IUPUI) Consultants in the UITS Student Technology Centers

24 Hour Phone Support

(IUB) 812/855-6789
(IUPUI) 317/274-4357

E-mail Support

(All IU campuses) ithelp@iu.edu

You might also like