Uploaded by Poonam Naidu

Chebyshev Stats

1

Lecture Notes for Introductory Statistics

a data set. As it turns out, simply knowing the mean and standard deviation of a

data set can reveal much about the nature of a set of data. A well-known result

attributed to Chebyshev makes the nature of this statement precise.

1. Chebyshev’s Theorem

First, we begin by stating Chebyshev’s Theorem. We will not prove the result.

Chebyshev’s Theorem. Given any set of data, and any real number k ≥ 1, at

least 1−1/k 2 of the points in that set of data must fall within k standard deviations

of the mean. That is, at least 1 − 1/k 2 of the data must lie in the interval between

µ − kσ and µ + kσ.

This is a pretty powerful result, since it allows us to make a statement about

any set of data. Additionally, there are some special cases of this theorem that are

worth knowing, so let’s take a look at a couple of these.

If we set k = 2, then 1 − 1/k 2 = 1 − 1/4 = 75%, and Chebyshev’s Theorem

tells us that in any data set, at least 75 percent of the data must lie within k = 2

standard deviations of the mean.

If we set k = 3, then 1 − 1/k 2 = 1 − 1/9 ≈ 88.9%, and Chebyshev’s Theorem

tells us that in any data set, at least 88.9 percent of the data must lie within k = 3

standard deviations of the mean.

If we set k = 4, then 1 − 1/k 2 = 1 − 1/16 = 93.75%, and Chebyshev’s Theorem

tells us that in any data set, at least 93.75 percent of the data must lie within k = 4

standard deviations of the mean.

We could of course do this all day, but the takeaway is that in any set of data we

can expect the vast majority of the data to fall within just a few standard deviations

of the mean.

Example 1. A group of 20 students were asked for the amount they spent in

textbooks during the last academic year. The amounts in dollars reported were

700, 600, 550, 550, 550, 500, 500, 500, 450, 450,

450, 400, 400, 400, 400, 350, 350, 300, 300, 200

Check that this population has mean 445 dollars with σ = 113.91 dollars.

Chebyshev’s Theorem therefore predicts that at least 75 percent of the costs will

fall in the interval 445 − 2 × 113.91 = 217.18 to 445 + 2 × 113.91 = 672.82. By

examination of the data set, we see that this is certainly true, as 18 of the 20 (or 90

percent) of the reported costs do fall in this interval. Thus, the operative phrase is

at least.

1

These lecture notes are intended to be used with the open source textbook “Introductory

Statistics” by Barbara Illowsky and Susan Dean (OpenStax College, 2013).

1

Supplemental topic: Chebyshev's Theorem N. Smith

Also observe that Chebyshev’s Theorem predicts that at least 88.9 percent of

the costs fall within three standard deviations of the mean, but it is easy to see that

in fact all of the data in this particular data set is in fact within three standard

deviations of the mean.

Example 2. A professor tells a class that the mean on a recent exam was 80

with a standard deviation of 6 points, and suppose you wanted to find an interval

where at least 75 percent of the students must have scored. Since 75 percent

corresponds to k = 2 in Chebyshev’s Theorem, we need only look 2 standard

deviations from the mean to conclude that at least 75 percent of the students

scored between 80 − 2 × 6 = 68 and 80 + 2 × 8 = 96.

Depending on the data, Chebyshev’s Theorem may tell you a lot or not so much.

Let’s look at an example where Chebyshev’s Theorem is not too enlightening.

Example 3. A professor tells a class that the mean on a recent (100 point) exam

was 62 and the standard deviation was a whopping 33 points. Again, if we wanted

to get a handle on at least 75 percent of the exam scores, we would let k = 2 and

conclude that at least 75 percent of the students scored between 62 − 2 × 33 = −4

and 62 + 2 × 33 = 128. Since this was a 100 point exam, and presumably negative

scores were not possible, this tells us that at least 75 percent of the students scored

between 0 and 100 on the exam. While this statement is of course true, it is not

terribly enlightening! Hopefully, you can see that since the standard deviation was

so large, this is an indication of high variability in the data set, and there is simply

too much potential variation in the data to be able to draw fantastic conclusions

knowing only the mean and the standard deviation!

Let’s do one final example.

Example 4. Anew college graduate has done their homework and is searching

for their first job. Based on their major, their educational level, the type of job

they are looking for, their experience, and the geographic location where they want

to live, a salary aggregator tells them that the mean salary of new employees is

approximately 45000 dollars with a standard deviation of 2600 dollars. This person

is subsequently offered a salary of 52000 dollars. How good is this offer?

Solution. Well, it’s certainly not terrible, being above the mean, but fortunately

we can quantify this somewhat better. First, since Chebyshev’s Theorem can tell

us what is happening a certain number of standard deviations away from the mean,

it would be nice to know a z-score for this 52000 dollar salary.

x−µ 52000 − 45000

z52000 = = ≈ 2.7

σ 2600

Since this salary is 2.7 standard deviations away from the mean, using Cheby-

shev’s Theorem with k = 2.7 tells us that at least 1 − 1/2.72 ≈ 86.3 percent of the

salaries must lie within 2.7 standard deviations of the mean; that is between 38000

and 52000 dollars. Thus, we can safely conclude that at least this 52000 dollar offer

is greater or equal to at least 86.3 percent of the other salaries out there.

Notes, p 2

