You are on page 1of 8

Lecture 05 Measures of Central Tendency

There are three main measures of central tendency: the mean, median, and mode. The purpose of measures of central tendency is to identify the location of the center of various distributions. For example, lets consider the data below. This data represents the number of miles per gallon that 30 selected four-wheel drive sports utility vehicles obtained in city driving.

12 16 15 10 19

17 18 16 14 13

16 17 16 15 16

14 16 15 11 18

16 17 16 15 16

18 15 19 15 20

However, in its current form it is difficult to determine where the center for the above data set lies. Thus, one way to help us get a better idea as to where the center of a distribution is located is to graph the data. Because the data is numerical, the most appropriate method of graphing the data would be to create a histogram. After inspecting the data, a bin size of 1 seems reasonable with a starting point of 10 mpg and an ending point of 20 mpg. The histogram for the gas mileage data is given below.
SUV Gas Mileage Data 10 9 8 7 Frequency 6 5 4 3 2 1 0 10 11 12 13 14 15 16 17 18 19 20 Miles per Gallon

If we rely on sight alone, it seems that the middle of the distributions lies at around 15 to 16 miles per gallon; however, because our senses can sometimes deceive us, we want to be a little more scientific in our methodology. The Mode The first measure of central tendency we will discuss is the mode. The mode is the observation that occurs most frequently. Thus, to find the mode for the above data set we simply locate the observation that occurs most frequently. In this case, the number 16 occurs 9 times, which is more than any other observation. Therefore, the mode of the data is 16. The Median The median is the middle observation in the data. This means that 50% of the data is below the median and 50% of the data is above the median. To find the median, we must first organize the data in order from the smallest to the largest observation. For example, the above gas mileage data would take on the following form: 10 11 12 13 14 14 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 17 17 17 18 18 18 19 19 20 To help us find the middle, or halfway point, probably the most intuitively appealing action would be to divide the number of observations (n or N) by 2. However, a better method is to divide n+1 by 2. In this case we have 30 observations so our halfway point is
30 + 1 = 15.5 . Next, to find the center, 2

we count in 15.5 spaces or observations from the starting or ending points of the data. This will put us directly between the two highlighted 16s. 10 11 12 13 14 14 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 17 17 17 18 18 18 19 19 20

We want the number between these two 16s. To overcome this problem we add up the two middle points and divide by 2 (essentially taking the average

of the two middle observations).

Thus, the median for the data set is

16 + 16 = 16 . To see how we would determine the median for an odd number 2

of data points, lets remove one of the points (the 10) so that instead of 30 observations we now only have 29. 11 12 13 14 14 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 17 17 17 18 18 18 19 19 20 We still need to find out the location of the middle observation. In this case, the center of the data is
29 + 1 = 15 numbers in from either end of the dataset. 2

When we count in 15 spaces from the starting and ending points of the data, we land on the same number, 16. Thus, our median, or the middle point of the data, is 16 as shown here 11 12 13 14 14 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 17 17 17 18 18 18 19 19 20

The Mean The mean is the arithmetic average of all the observations in the data. It is also the fulcrum or, balancing point, of the data. For instance, if you were to place the histogram of the gas mileage data onto a seesaw, the mean would be the point that would allow the histogram to be perfectly balanced. As discussed earlier, the mean is found by adding up all of the observations and dividing by the total number of observations, either N or n depending upon whether you are dealing with the population or sample. The formula for the population and sample mean are =
xi x and x = i respectively. N n

To find the mean of our gas mileage data we should first ask ourselves, is the data based upon a sample or does the data consist of all the observations in the population? Next, to calculate the average, we sum up all of the observation and then divide this sum by the total number of observations. For the gas mileage data the average is
x= 12 + 16 + L 20 = 15.7 30

For additional practice, lets find the mean, median, and mode for the below data set. A random sample of 25 elk was taken at a wildlife refuge near Estes Park Colorado, and each of their respective body temperatures were recorded (in degrees F).

98 97 102 104 97

97 96 98 103 98

98 99 101 97 100

105 99 95 99 97

102 101 100 96 96

x = _________
Medain = __________ Mode = _________

Answers: mean = 99, median = 98, mode = 97

Comparing the Mean, Median, and Mode

Now that we have had a little practice calculating the mean, median, and mode, lets take a closer look at how these measures of central tenancy are influenced by different distributions. To start, it may be informative to look at the distribution of the gas mileage and compare it to the distribution of elk temperatures (notice that we are not making comparisons between the studies, just the distributions).
Histogram 10 Frequency 8 6 4 2 0 10 11 12 13 14 15 16 17 18 19 20 Miles per Gallon

Elk Temps.
6 F re q u e n c y 5 4 3 2 1 0 96 97 98 99 100 101 102 103 104 105 Bin

Notice that for the gas mileage data the measures of central tendency were all very close to the center of the distribution and very similar in value (mean = 15.7, median = 16, and mode = 16). In contrast, the measures of central tendency for the elk population are more divergent in terms of their values (mean = 99, median = 98, mode = 97). Why do you suppose this is the case?

The shape of a distribution and whether any outliers are present have an affect on the closeness of the mean, median, and mode. When a distribution is symmetric with no outliers, the mean, median, and mode will generally have values close to each other. When a distribution is skewed to the right the relationship between mean, median and mode is usually described by: mode < median < mean (mode is the smallest and mean is the largest). When a distribution is skewed to the left, the opposite is generally true: mean < median < mode (mean is the smallest and mode is the largest). Before the advent of computers, the mode was often used as a measure of central tendency because it we easily found and quickly calculated. However, in general, the mode is a poor measure of central tendency-mainly because the mode can be found anywhere in the distribution and often times there can be more than one mode. We will see later that the mode is a poor measure of central tendency for other reasons as well. However, the mode is the only measure of central tendency that can be used for categorical data. The best measure of central tendency for skewed data is the median. This is because the median is resistant to the more extreme values in a data set. Extreme values are data points that numerically stray from the majority of the data points and are thus found in the tails of the distribution. In addition, these extreme data points are often classified as what statisticians call outliers. We will provide a more official definition of an outlier in the next lecture. An example of how the median is more resistant to extreme data points can be seen by revisiting our gas mileage data. 10 11 12 13 14 14 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 17 17 17 18 18 18 19 19 200 The last data point in our gas mileage data has been changed from 20 to 200. Even though 200 is an extreme data point, notice that the alteration of this point does not affect the value of the median; it is still 16. However, when we calculate the mean, we get a much different result. That is, when we replace 20 with 200, the value of the mean changes from 15.7 to 21.7. Extreme data points tend to influence the mean by pulling the value of the mean towards them. When this shift occurs, it is often the case that the mean is no longer an adequate measure of central tendency.

When the data is symmetrical, the mean is often the preferred measure of central tendency over the median and mode. Based on statistical theory, the mean is the estimator that does the best job of estimating the population parameter over the long run, even if the data is not symmetrical.

To Recap: The Mode is appropriate to use when The observation that is most frequently observed is desired. A quick estimate of central tendency is desired. The data is categorical. Do not use when The data is multi-modal, highly skewed, or uniform, because in these situations, the mode may provide an extremely poor estimate of the center of the distribution. A more accurate measure of central tendency such as the mean or median is available. The Median is appropriate to use when The center or the middle value of the data set is desired. One needs to determine whether additional data points fall either above or below the midpoint. The data is highly skewed. Outliers exist that will affect the mean. Do not use when The distribution of the data is symmetrical because the mean is preferred. The Mean is appropriate to use when The data is symmetrical or at least not really skewed. When the data is roughly symmetrical, the mean, median, and mode are all somewhat decent measures of central tendency. However, when the data is

symmetrical, the mean will provide the best estimate because over the long run it does the best job of estimating the center of the distribution. Do not use when The distribution of the data is extremely skewed. Outliers exist which will affect the mean more than an acceptable amount. OPTIONAL: Using Excel to find the mean, median, and mode Using Excel to find the mean, median, and mode will make our lives easy, but remember for tests and quizzes you will need to find each of these by hand. Finding the Mode Step 1: Click on an empty box on the spreadsheet. Step 2: From the tool bar click on the f x icon. Once the f x icon has been selected the Past Function dialogue box should appear. Step 3: From the Function category widow select Statistical. Step 4: Once Statistical is selected from Function Name chose Mode. Step 5: Click on the Array box (it might say Numbers), highlight your data, and click OK. Your result should be in the empty box from Step 1.

These same steps can be repeated to find the mean or median by replacing Mode in step 4 with Average or Median.