You are on page 1of 46

Lecture 2 – Data Summaries

Dr. Salman Saeed


Assistant Professor,
National Institute of Urban Infrastructure Planning (NIUIP)
University of Engineering and Technology, Peshawar

salmansaeed@uetpeshawar.edu.pk
Summarizing Data
• As discussed in previous slides:
– Data matrix is often huge
– For presentation purposes we have to summarize data

Summarize Data

Center Spread Variation Graphs Tables

Pie Charts Dot Plots

Bar Graphs Histograms

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Summarizing Data – Central Tendency
• First we discuss the how to measure the central
tendency of data

Summarize Data

Center Spread Variation Graphs Tables

Mode
Measures
Median of Central
Tendency
Mean

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Central Tendency - Mode
• The value that occurs most frequently – the most
common outcome
• Commonly used with Nominal and Ordinal
measurements
• On the Pie Chart, the value with the biggest pie is
the mode
• On Bar Graphs and Histograms, the highest bar
represents the mode
• On Frequency Distributions, the peak occurs at the
mode value
• That’s why the distribution with two peaks is called a
bimodal distribution
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Examples of modes
50 66
Mode 25

119
140

13
12
11
10
Left Arm Fast Left Arm Medium Right Arm Fast Right Arm Medium Other
9
8
7
6
5
4
3
2
1
0

175

193
155
157
159
161
163
165
167
169
171
173

177
179
181
183
185
187
189
191

195
197
199
155 160 165 170 175 180 185 190 195 200

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
What is NOT the mode
140
50 66
119
Mode 25

66 119
50 140

25 13 13
12
12
11 11
11
1010 10
10
Left Arm Fast Left Arm Medium Right Arm Fast Right Arm Medium Other
9
8
8
7
7
6
555
5
44
4
3 3
3
22 2 22 222 2 2
2
11 1 1 1 1 1 1 1
1
0 0 0 0 0 00 000 0
0

175

193
155
157
159
161
163
165
167
169
171
173

177
179
181
183
185
187
189
191

195
197
199
155 160 165 170 175 180 185 190 195 200

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Central Tendency - Median
• The central value in an ordered list
Consider the following list, for example:
8 9 3 6 7 8 1 5 2
What is the mode?
8 – because it occurs most frequently
To find the median, we have to order the list:
1 2 3 5 6 7 8 8 9
Median

50% of the values are BELOW median 50% of the values are ABOVE median

What happens when we have even number of values


23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Central Tendency - Median
• When we have an even number of values
Consider the following list, for example:
8 9 3 6 7 8 1 5 2 4
To find the median, we have to order the list:
1 2 3 4 5 6 7 8 8 9

Median = ( 5 + 6 ) / 2
=

Median cannot be used with measurements,


since they cannot be ordered
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Central Tendency - Mean
• Sum of all values divided by the number of observations

Consider the following list of values


156 163 168 177 185 187 191 194 199

#of obs. =9
Σof all obs. = 1620
Mean = 1620 / 9 = 180

• Engineers can easily relate to the mean as:


– The first moment of the data with respect to the origin, OR
– The moment of all observations about the mean is zero
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Central Tendency - Mean

155 160 165 170 175 180 185 190 195 200

• Mean is the resultant of all observations


• Mean is the point at which all observations are
balanced
Difference of each value from the mean is:
-24 -17 -12 -3 5 7 11 14 19
The sum of above number is zero

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Central Tendency - Mean
Consider the following list of numbers
6 7 7 8 8 9 => mean = 7.5

5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

Now add another weight at 9


6 7 7 8 8 9 9 => mean=7.7

5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Which measure of central tendency should be used?
Central Tendency

Nominal measurements have no


value, so we cannot sum them,
Mean nor divide them. Mean cannot be
calculated

Nominal measurements cannot


Nominal Median be ordered. Median cannot be
calculated

Mode is the only measure


Mode available for central tendency of
Nominal data

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Which measure of central tendency should be used?
Central Tendency

Ordinal measurements are just


categories with no meaningful
Mean value so, like Nominal, their
mean also cannot be calculated

Ordinal values can be ordered, so


Ordinal Median theoretically we can find the
median (for odd # of obs.), but it
usually has no meaning

Mode is a better measure for


Mode
central tendency of Nominal data

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Which measure of central tendency should be used?
Central Tendency

Mean

Quantitative
Median
[Interval & Ratio]

Mode

All three measures can be easily calculated for quantitative measurements. However, Mean and
Median are better measures of central tendency as compared to Mode.

Choice between Mean and Median depends on the underlying data.

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Which measure of central tendency should be used?
Consider the yearly income of individuals
Yearly Income
Person 1 $46,000
Person 2 $41,000 Mean= $42,285.71
Person 3 $39,000 Median= $41,000
Person 4 $38,000
Person 5 $41,000
Person 6 $45,000
Person 7 $46,000

Mean and Median are approximately same

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Which measure of central tendency should be used?
Let’s add another individual to the list
Yearly Income
Person 1 $46,000
Person 2 $41,000 Mean= $42,285.71
Person 3 $39,000 Median= $41,000
Person 4 $38,000
Person 5 $41,000
Person 6 $45,000 New Mean= $87,87,000
Person 7 $46,000 New Median= $43,000
Person 8 $70,000,000  Outlier

There was a slight change in the median but the mean has
changed significantly. The Mean is sensitive to outliers
while the median remain relatively unchanged.
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Which measure of central tendency should be used?

Categorical Mode

No Mean

Quantitative Outliers

Yes Median

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Summarizing Data
• Consider the following two data sets, presented here as dot plots
• Both data sets have the same mean, same median and same mode
• The first data set is more “spread out” as compared to the second one
Mean=178
Median=178
Mode=177

155 160 165 170 175 180 185 190 195 200
Same
Central Physical Height (cm)
Tendency
• Clearly, none of the central tendency measures can fully describe the data
• We need another measure that can describe the spread of the data

Mean=178
Median=178
Mode=177

155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Summarizing Data – Spread

Summarize Data

Center Spread Variation Graphs Tables

Range

Interquartile Range

Box and Whiskers Plot

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Range
• Range is the difference between highest and lowest
value in a data set
Max=199 Highest Value = 199, Lowest Value = 156, Range = 199-156 = 43
Min=156
Range=43

155 160 165 170 175 180 185 190 195 200
Physical Height (cm)

Max=185 Highest Value = 185, Lowest Value = 170, Range = 185-170 = 15


Min=170
Range=15

155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Range
• Range is the difference between highest and lowest
value in a data set
Max=199 The Range is:
Min=156 • Easy to understand
Range=43 • Simple to compute

155 160 165 170 175 180 185 190 195 200
Physical Height (cm)

• It doesn’t give a good idea about the dispersion of the data


• It only takes into account the extreme values
• Presence of outliers can greatly influence the range
Max=185
Min=170
Range=15

155 160 165 170 175 180 185 190 195 200
Physical Height (cm)
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Range
• Consider another two data sets
• We can see two different data sets, one in which the data is more
Mean 178 dispersed compared to the other
Median 178 • But Neither the central tendency measures nor the range can tell
Mode 177 them apart
Range 43
Same
155 165 175 185 195
Central
Tendency Physical Height (cm)
&
Spread

Mean 178 • This brings us to another measure of spread, which is Inter-


Median 178 quartile range
Mode 177
Range 43
155 165 175 185 195
Physical Height (cm)
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Interquartile Range
• Interquartile range is the difference between the first and the third
quartile
• To understand what quartile means, consider the following
distribution
Median
• We are familiar with the median
50% of the 50% of the
data data

25% of the 25% of the 25% of the 25% of the


data data data data

Q1 Q2 Q3

• The median divides the data such that 50% of the values are above
it, and 50% are below it
• Now we divide the data such that 25% of the values are below it –
This is the first quartile.
• Now add another division such that 25% of the values are above it
– This will be the third quartile
• While the median is the second quartile
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Interquartile Range
• Interquartile range is the difference between the first and the third
quartile
• The three quartiles divide the data into four parts or quarters
Median
50% of the 50% of the
data data

25% of the 25% of the 25% of the 25% of the


data data data data

Q1 Q2 Q3

IQR = Q3 – Q1

• Interquartile Range (IQR) is simply the third quartile (Q3) minus the
first quartile (Q1)

• It is a better measure of dispersion because it leaves out the


extreme values and takes into account the middle 50% of the data
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Interquartile Range
• Consider the weights of playing eleven for team
Lahore Qalandars IQR = Q3 – Q1
= 133.8 – 90.8
Weight (Kg) = 43
Player 1 115.3
Player 2 90.8
Player 3 110.7 Order List 58.2 86.7 90.8 98.2 101.8 110.7 115.3 117.1 133.8 183.7 199.6

Player 4 133.8
Player 5 58.2
Player 6 98.2 Q1 Median/Q2 Q3
Player 7 199.6
Player 8 117.1
Player 9 183.7
Player 10 101.8
Player 11 86.7 Outliers do not affect the calculation – hence, the IQR
gives a more realistic description of dispersion in the data

It is also useful in detecting the outliers

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Interquartile Range – Detecting outliers
• Values that are 1.5 x IQR lower or higher than first and
third quartiles, respectively are generally considered to
be outliers in the data

• In our example, IQR is 43, so 1.5 x 43 is 64.5

• Any values that are 64.5kg lower than Q1 (90.8) and


values that are 64.5kg higher than Q3 (133.8) are outliers

• 90.8 – 64.5 = 26.3, any value lower than this is outlier

• 133.8 + 64.5 = 198.3, any value higher than this is an


outlier
• 58.2 86.7 90.8 98.2 101.8 110.7 115.3 117.1 133.8 183.7 199.6

IQR = Q3 – Q1
Q1 Median/Q2 Q3 = 133.8 – 90.8 = 43
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Box and Whiskers Plot
Outlier

Highest value that is not an outlier

Q3

Whiskers Box Median/Q2

Q1

Lowest value that is not an outlier

Outlier
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Consider the data from Box and Whiskers Plot
our previous example 200

about weights of the 190


players. 180

170

• Here is the vertical dot 160


plot of that data 150

140

• The median, Q2, is at 130


110.7 120

110 110.7

• The first quartile, Q1, is 100


at 90.8 90

80

• The third quartile, Q3, 70


is at 133.8 60

50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Consider the data from Box and Whiskers Plot
our previous example 200

about weights of the 190


players. 180

170

• Here is the vertical dot 160


plot of that data 150

140

• The median, Q2, is at 130


110.7 120

110 110.7

• The first quartile, Q1, is 100


at 90.8 90 90.8

80

• The third quartile, Q3, 70


is at 133.8 60

50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Consider the data from Box and Whiskers Plot
our previous example 200

about weights of the 190


players. 180

170

• Here is the vertical dot 160


plot of that data 150

140

• The median, Q2, is at 130


133.8

110.7 120

110 110.7

• The first quartile, Q1, is 100


at 90.8 90 90.8

80

• The third quartile, Q3, 70


is at 133.8 60

50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Now join the Q3 and Box and Whiskers Plot
Q1 lines to make a box 200

• The IQR is Q3-Q1=43 190

180

170

160

150

140
133.8
130

120

110 110.7 IQR = 43


100

90 90.8

80

70

60

50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Now join the Q3 and Box and Whiskers Plot
Q1 lines to make a box 200
198.3
199.6

• The IQR is Q3-Q1=43 190


183.7
180
• The outlier range is
170
1.5 x IQR = 64.5
160
• The upper limit is 150
64.5 + 133.8 = 198.3
Q3+64.5=198.3 140
133.8
130

120
IQR = 43
110 110.7
IQR x 1.5 = 64.5
100

90 90.8

80

70

60 58.2
50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Now join the Q3 and Box and Whiskers Plot
Q1 lines to make a box 200
198.3
199.6

• The IQR is Q3-Q1=43 190


183.7
180
• The outlier range is
170
1.5 x IQR = 64.5
160
• The upper limit is 150
64.5 + 133.8 = 198.3
Q3+64.5=198.3 140
• Within this limit, the 130
133.8

highest data point is 120


183.7 IQR = 43
110 110.7
IQR x 1.5 = 64.5
100

90 90.8

80

70

60 58.2
50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Now join the Q3 and Box and Whiskers Plot
Q1 lines to make a box 200
198.3
199.6

• The IQR is Q3-Q1=43 190


183.7
180
• The outlier range is
170
1.5 x IQR = 64.5
160
• The upper limit is 150
64.5 + 133.8 = 198.3
Q3+64.5=198.3 140
• Within this limit, the 130
133.8

highest data point is 120


183.7 IQR = 43
110 110.7
IQR x 1.5 = 64.5
• So the upper limit 100

moves down to 183.7 90 90.8

80

70

60 58.2
50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• Now join the Q3 and Box and Whiskers Plot
Q1 lines to make a box 200 199.6

• The IQR is Q3-Q1=43 190


183.7
180
• The outlier range is
170
1.5 x IQR = 64.5
160
• The upper limit is 150
Q3+64.5=198.3 140
• Within this limit, the 130
133.8

highest data point is 120


183.7 110 110.7

• So the upper limit 100

moves down to 183.7 90 90.8

• Which makes the upper 80

whisker 70

60 58.2
50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• We adopt the same Box and Whiskers Plot
procedure to make the 200 199.6

lower whisker 190


183.7
180

170

160

150

140
133.8
130

120

110 110.7

100

90 90.8

80

70

60 58.2
50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Making a Box and Whisker Plot
• We adopt the same Box and Whiskers Plot
procedure to make the 200 199.6

lower whisker 190


183.7
180

170
• Next we mark the
160
outliers that are the
150
data points outside the
140
whiskers 133.8
130

120

110 110.7

100

90 90.8

80

70

60 58.2
50
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Exercise - 1
Weight (Kg) 1. Summarize the two tables Weight (Kg)
Player 1 103.8 using following descriptors Player 1 128.5
Player 2 113.2 and compare them: Player 2 113.9
Player 3 94.9 Player 3 87.2
Player 4 105.6 • Median Player 4 116
Player 5 111.6 • Mean Player 5 101.9
Player 6 112.2 • Range Player 6 122.4
Player 7 119.6 • IQR Player 7 78.7
Player 8 85.7 Player 8 91.2
Player 9 109 2. Plot and compare the Box Player 9 92.8
Player 10 90.3 and Whiskers plots for two Player 10 52.2
Player 11 106.2 tables Player 11 147.7
Player 12 114.6 Player 12 96.6
Player 13 114.4 3. Find all the parameters in Player 13 132.6
Player 14 110.7 Player 14 50.8
question 1 above, after
Player 15 97.1 Player 15 137.1
removing the outliers.
Player 16 65.3 Player 16 107.5
Comment on how much
Player 17 98.1 Player 17 117.2
Player 18 117.5
they changed
Player 18 141.4
Player 19 99.9 Player 19 57.4
Player 20 102.6 Player 20 127
Player 21 122.5 Player 21 65.5
Player 22 110.3 Player 22 137.1
Player 23 102 Player 23 73.9
23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Trimmed Mean
• Like the Median, another measure with reduced
effects of outliers is the trimmed mean.

• Certain percentage of the sorted data array are


taken away before calculating mean of the
remaining data

• If p% of the data are trimmed from each end of the


sorted array, the mean is called the p% trimmed
mean 11 values , 20% of 11 = 2.2, rounded to 2
Take out 2 values from each side, can calculate the mean of the rest

58.2 86.7 90.8 98.2 101.8 110.7 115.3 117.1 133.8 183.7 199.6

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Mean and Median
Two samples of 10 seedlings were planted in a green
house, one sample was treated with nitrogen and the
other was not treated. All other environmental
conditions were held constant. The weight of stems of
the plants growing out of these seedlings after 140
days were recorded and are presented in the following
table:

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Mean and Median
S.No. No Nitrogen Nitrogen
1 0.28 0.26
2 0.32 0.43
3 0.36 0.46
4 0.37 0.47
5 0.38 0.49
6 0.42 0.52
7 0.43 0.62
8 0.43 0.75
9 0.47 0.79
10 0.53 0.86
Sum 3.99 5.65
Mean 3.99/10=0.399 5.65/10=0.565
Median (0.38+0.42)/2=0.4
(0.49+0.52)/2=0.505

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Trimmed Mean

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Trimmed Mean

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Trimmed Mean

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Trimmed Mean - Example

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar
Trimmed Mean - Example

23-Jul-2020 Lecture # 02 GEOL 703 – Applied Geo-statistics Dr. Salman Saeed, NIUIP, UET Peshawar

You might also like