You are on page 1of 41

M1.

Understanding a data Set

Presented by
Aung Kay Tu, MBBS, DTM&H, MCTM, PhD
Variables: Discrete and Continuous

• A variable is a symbol that can assume any of a prescribed set of


values
• If the variable can assume only one value, it is called a constant
• The number n of children in a family, which can assume any of the
values 0, 1, 2, 3, . . is a discrete variable. . but cannot be 2.5 or 3.842
etc..
• The height of an individual, which can be 162 cm or 163.8 cm or
165.83 cm, depending on the accuracy of measurement, is a
continuous variable.
discrete and continuous data
• The number of children in each of 1000 families is an example of
discrete data
• The heights of 100 university students is an example of continuous
data.
In general..
• measurement  continuous data
• counting  discrete data
Logarithm

• The inverse function to exponentiation.


• Logarithm counts the number of occurrences of the same factor in repeated multiplication
• Since 1000 = 10 × 10 × 10 = 103,
• The "logarithm base 10" of 1000 is 3
• log10(1000) = 3.
• Log b (x) for any two positive real numbers b and x, where b is not equal to 1, is always a
unique real number
• "logarithm base 2" of 64 is
• 26 = 64
• log2 (64) = 6
• The two bases that have been mainly used are 10 and e =
2.71828182 . . . .
• Logarithms with base 10 are common logarithms and are written as
log(x)
• Logarithms with base e are natural logarithms and are written
as ln(x)
Look at these numbers
• 100
• 1,000
• 10,000
• 100,000
• 1,000,000
It is not appropriate to draw a graph based on these data
Change these numbers to log values !
Common Logarithms
• log (10)= 1 log(15)=1.1760
• log(100)= 2 log(267)= 2.4265
• log (1,000)= 3
• log(10,000)= 4
• log(100,000)=5
• Log(1,000,000)=6
Try these numbers in excel !!
Natural logarithm
• Use EXCEL to compute the natural logarithm of the integers 1
through 5.
• The numbers 1 through 5 are entered into B1:F1 and the expression =
LN(B1) is entered into B2 and a click-and-drag is performed from B2
to F2.

•x 1 2 3 4 5
• LN(x) 0 0.693147 1.098612 1.386294 1.609438
e Euler's number- mathematical constant
• "Euler's number" named after Leonhard Euler
• The number e is a mathematical constant approximately equal to
2.71828 and is the base of the natural logarithm:
• The unique number whose natural logarithm is equal to 1.
e = 2.71828183….

Try it in excel
=ln(2.71828183)
Order the numbers from least to greatest.

A. 7, 4, 15, 9, 5, 2 2, 4, 5, 7, 9, 15

B. 70, 21, 36, 54, 22 21, 22, 36, 54, 70


• mean
• median
• mode
• range
• outlier
The mean is the sum of the data values The mean is also called the average.
divided by the number of data items.

The median is the middle value of an odd number of data


items arranged in order. For an even number of data items,
the median is the average of the two middle values.

The mode is the value or values that occur most often.


When all the data values occur the same number of times,
there is no mode.

The range of a set of data is the difference between the


greatest and least values. It is used to show the spread of
the data in a data set.
Finding the Mean, Median, Mode, and Range of Data

Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

mean:
4 + 7 + 8 + 2 + 1 + 2 + 4 + 2 = 30 Add the values.

8 items sum
Divide the sum by the
30  8 = 3.75 number of items.

The mean is 3.75.


Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

median:
1, 2, 2, 2, 4, 4, 7, 8 Arrange the values in order.

2+4=6 There are two middle values, so


find the mean of these two values.
62=3

The median is 3.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

mode:
1, 2, 2, 2, 4, 4, 7, 8 The value 2 occurs three times.

The mode is 2.
Find the mean, median, mode, and range of the data set.
4, 7, 8, 2, 1, 2, 4, 2

range:
1, 2, 2, 2, 4, 4, 7, 8 Subtract the least value
from the greatest value.

8– 1 = 7

The range is 7.
Find the mean, median, mode, and range of the data set.
6, 4, 3, 5, 2, 5, 1, 8

mean:
6 + 4 + 3 + 5 + 2 + 5 + 1 + 8 = 34 Add the values.

8 items sum

34  8 = 4.25 Divide the sum


by the number of items.
The mean is 4.25.
Find the mean, median, mode, and range of the data set.
6, 4, 3, 5, 2, 5, 1, 8

median:
1, 2, 3, 4, 5, 5, 6, 8 Arrange the values in order.

4+5=9 There are two middle values, so find


the mean of these two values.
9  2 = 4.5

The median is 4.5.


Find the mean, median, mode, and range of the data set.
6, 4, 3, 5, 2, 5, 1, 8

mode:
1, 2, 3, 4, 5, 5, 6, 8 The value 5 occurs two times.

The mode is 5.
In the data set below, the value 12 is much less than
the other values in the set. An extreme value such as
this is called an outlier.

35, 38, 27, 12, 30, 41, 31, 35

x
x x x x x x x

10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
The data shows scores for the last 5 math
tests: 88, 90, 55, 94, and 89. Identify the
outlier in the data set.

55, 88, 89, 90, 94

outlier 55
With the Outlier
55, 88, 89, 90, 94
outlier 55

mean: median: mode:


55+88+89+90+94 = 416 55, 88, 89, 90, 94
416  5 = 83.2
The mean is 83.2. The median is 89. There is no mode.
Without the Outlier

55, 88, 89, 90, 94

mean: median: mode:


88+89+90+94 = 361 88, 89,+90, 94
361  4 = 90.25 2
= 89.5
The mean is 90.25.
The median is 89.5. There is no mode.
Without the Outlier With the Outlier
mean 90.25 83.2
median 89.5 89
mode no mode no mode

Adding the outlier decreased the mean by 7.05


and the median by 0.5.
The mode did not change.
The median best describes the data with the
outlier.
Find the mean, median, mode, and range of the data set.
6, 4, 3, 5, 2, 5, 1, 8

range:
1, 2, 3, 4, 5, 5, 6, 8 Subtract the least value
from the greatest value.

8– 1 = 7

The range is 7.
Identify the outlier in the data set. Then
determine how the outlier affects the mean,
median, and mode of the data. The tell
which measure of central tendency best
describes the data with the outlier.
63, 58, 57, 61, 42

42, 57, 58, 61, 63

outlier 42
With the Outlier
42, 57, 58, 61, 63
outlier 42

mean: median: mode:


42+57+58+61+63 = 281 42, 57, 58, 61, 63
281  5 = 56.2
The mean is 56.2. The median is 58. There is no mode.
Without the Outlier

42, 57, 58, 61, 63

mean: median: mode:


57+58+61+63 = 239 57, 58,+61, 63
239  4 = 59.75 2
= 59.5
The mean is 59.75.
The median is 59.5. There is no mode.
Without the Outlier With the Outlier
mean 59.75 56.2
median 59.5 58
mode no mode no mode

Adding the outlier decreased the mean by 3.55


and decreased the median by 1.5.
The mode did not change.
The median best describes the data with the
outlier.
Measure Most Useful When
mean The data are spread fairly evenly
median The data set has an outlier
mode The data involve a subject in which
many data points of one value are
important.

Caution!
Since all the data values occur the same
number of times, the set has no mode.
PERCENTILE

• The percentile is a value such that at least p percent of the


observations are less than or equal to this value and at least (100 - p)
percent of the observations are greater than or equal to this value.
Example
• 3310 3355 3450 3480 3480 3490 3520 3540 3550 3650 3730 3925
• n = 12
• let us determine the 85th percentile for the data set

• Because i is not an integer, round up. The position of the 85th percentile is the
next integer greater than 10.2, the 11th position
• The 85th percentile is the data value in the 11th position that is 3730.
• let us consider the calculation of the 50th percentile for the data set

• Because i is an integer, step 3(b) states that the 50th percentile is the
average of the sixth and seventh data values

• 3310 3355 3450 3480 3480 3490 3520 3540 3550 3650 3730 3925
• the 50th percentile is (3490 + 3520)/2 = 3505.
• the 50th percentile is also the median.
• 3310 3355 3450 3480 3480 3490 3520 3540 3550 3650 3730 3925
• the first quartile, or 25th percentile, is the average of the third and
fourth data values; thus, Q1 = (3450 + 3480)/2 = 3465.
• the third quartile, or 75th percentile, is the average of the ninth and
tenth data values; thus, Q3 = (3550 + 3650)/2 = 3600
Interquartile Range (IQR)

• A measure of variability that overcomes the dependency on extreme


values is the interquartile range (IQR).
• This measure of variability is the difference between the third quartile,
Q3, and the first quartile, Q1.
• The interquartile range is the range for the middle 50% of the data.
• IQR = Q3 - Q1
• For the data set, the quartiles are Q3 =3600 and Q1=3465.
• The interquartile range is 3600 - 3465 = 135.
EXERCISE

1. Find the mean, median, mode, and range of the data set. 8, 10, 46, 37, 20, 8, and 11
2. Identify the outlier in the data set, and determine how the outlier affects the mean,
median, and mode of the data. Then tell which measure of central tendency best
describes the data with and without the outlier. Justify your answer. 85, 91, 83, 78,
79, 64, 81, 97
3. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25. Compute
the 20th, 25th, 65th, and 75th percentiles

You might also like