You are on page 1of 3

Examining Distributions Series

Topic #9: Using R to Calculate Central Tendency Statistics

Two central tendency statistics are the median and the mean. We call these
central tendency statistics because their purpose is to identify the center of the
distribution. Let’s learn how to use R to calculate these statistics.

I have started an R script and I titled it “descriptive statistics of passenger age.”


That title makes it obvious that I am going to use this script to calculate
descriptive statistics of the ages of passengers on the Titanic. Currently my script
just includes annotation and the line for reading in the Titanic data. I’m going to
run that script so that we can calculate the median and mean.

The function for calculating the median is, strangely enough, called “median.” I’m
going to type in the function and provide the age variable as an argument. As
always, we put arguments, that is inputs, for the function in parentheses. Now I
will run this line. I don’t get the median. Instead, R gave me the letters “NA.”
Why? In R, NA means “not available.” There is a very common reason why a
statistic may not be available. It is because there is missing data for the variable.
In this case, that means that we don’t have the ages of all the passengers. R
doesn’t know what we want to do with the missing data until we tell it what to
do.

Let me take a little side trip here and show you a function in R for determining if
there is missing data. The function is “is.na.” If we run that function and use the
age variable as our argument, we will get a list of “true” and “false.” The is.na
function is asking R if there are there any cases for our variable that has missing
data. R will return TRUE when it finds a case with missing data. If we get a list that

1
is all “false,” then we have no missing data. For this variable I see both “TRUE”
and “FALSE,” so there is missing data.

Now back to calculating the median. There’s a simple solution for missing data,
and that is to exclude these cases from the calculation. To do this, we add an
argument to the median function. The argument is na.rm equals TRUE. That tells
R that if there are cases with missing data, they should be removed for this
calculation. The “rm” in this function is short for “remove.” We set na.rm equal to
true to say, “Yes, please remove cases with missing data for this calculation.”

Now let’s run the median function again. Ah ha! It worked. We get a median of
28. An age of 28 is right in the middle of the distribution of passenger ages.

Let’s calculate the mean. For that, we use the “mean” function. You might
anticipate that we need to remove missing data for the mean as well. That is
indeed the case, so I’m going to put that argument in the function input. It never
hurts to include this argument if you want to remove missing data. If there is no
missing data for your variable, this argument will have no effect. If there are
missing data, R will remove those cases before calculating the mean.

Let’s run this function. We get 29.88, or about 30. The mean age of passengers on
the Titanic was 30.

The median and mean are close in value. The fact that the mean is slightly greater
than the median tells me that the distribution has a slight positive skew. It isn’t
much skew, so I feel comfortable using both or either of these statistics as an
indication of central tendency. The median age is 28 years and the mean age is 30
years. Notice that when I give the final report of my statistics that I include the
unit of measurement, which in this case is “years.”

2
When we provide our readers graphical displays that show the distribution of a
variable, we can also report central tendency statistics to provide more precise
information about distribution center. The picture is the overview and the
statistics drill down to yield finer grain information.

You might also like