Session 1&2 - Descriptive Statistics (GbA) PDF

Introduction to Statistical Methods
BITS Pilani Prof.Gangaboraiah PhD

Bangalore Campus
BITS Pilani
BITS Pilani
Bangalore Campus
Descriptive Statistics
About me
 Dr.Gangaboraiah, PhD (Stats)
 Former Professor of Statistics, KIMS, Bangalore
 Work Experience
 Kempegowda Institute of Medical Sciences, Bangalore (34 years)
 Govt. Homeopathy Medical College, Bangalore (4 years)
 SJC Institute of Technology, Chickballapur (13 years, Visiting Professor)
 Manipal University, Bangalore Centre (Since 2008, Visiting Professor)
 MS (Computer Science), MS (Computer Network)
 Data Science
 BITS (Since 2013, Visiting Professor)
 MTech (Data Science)
 WIPRO and Aricent (2019)
Prof.Gangaboraiah PhD (Stats) | Slide 3 of 125 Former Professor of Statistics | KIMS, B’lore
Agenda
Here’s what you will learn in the entire Session:
1 Data Visualization: Why? What? How?
2 Measures of Central Tendency

2 Measures of Dispersion/ Variation
Data Visualization
Data is generated everywhere …everyday
…and is increasing exponentially
Source: http://3dsbiovia.com/blog/
Data Has Become ‘Big ’
5 Vs of a data are ever

5 Vs
increasing
of Big
Data
🡺 Need an effective way
to understand them
Source: http://hedureka.com
quickly
Data Visualization
• Visual representation of data
• For exploration, discovery , insight ….
• Interactive component provides more insight as
compared to static images
Image credits: www.researchgate.net
How many 3’s ?
• How much time did you take ?
Image credits: www.researchgate.net
How many 3’s ?
• How much time did you take now ?

Pre- attentive Processing:
• Requires attention despite the name

• Very fast: < 200-250 ms
• What matters most is the contrast between features
Highest sales figure for each product?
Annual Sales (in US $)
US States Product A Product B Product C Product D Product E
Pennsylvania 30,934.00 30,112.00 26,376.00 34,459.00 44,105.00
Illinois 49,743.00 40,996.00 33,527.00 45,298.00 31,961.00
Ohio 42,439.00 49,024.00 43,670.00 48,584.00 47,183.00
Georgia 29,344.00 41,095.00 37,954.00 38,337.00 34,406.00
North Carolina 30,443.00 31,941.00 29,231.00 49,090.00 41,067.00
Michigan 41,241.00 36,036.00 43,715.00 34,026.00 32,050.00
New Jersey 40,437.00 49,737.00 40,207.00 46,347.00 31,150.00
Virginia 38,816.00 44,372.00 35,359.00 31,226.00 26,563.00
Washington 30,773.00 31,220.00 47,950.00 36,900.00 40,732.00
Georgia 35,738.00 28,715.00 29,418.00 29,913.00 29,859.00
• Is it easy to find out ?
Highest sales figure for each product?
Annual Sales (in US $)
US States Product A Product B Product C Product D Product E
Pennsylvania 30,934.00 30,112.00 26,376.00 34,459.00 44,105.00
Illinois 49,743.00 40,996.00 33,527.00 45,298.00 31,961.00
Ohio 42,439.00 49,024.00 43,670.00 48,584.00 47,183.00
Georgia 29,344.00 41,095.00 37,954.00 38,337.00 34,406.00
North Carolina 30,443.00 31,941.00 29,231.00 49,090.00 41,067.00
Michigan 41,241.00 36,036.00 43,715.00 34,026.00 32,050.00
New Jersey 40,437.00 49,737.00 40,207.00 46,347.00 31,150.00
Virginia 38,816.00 44,372.00 35,359.00 31,226.00 26,563.00
Washington 30,773.00 31,220.00 47,950.00 36,900.00 40,732.00
Georgia 35,738.00 28,715.00 29,418.00 29,913.00 29,859.00
“The purpose of visualization is insight, not just picture.”
Data visualization pioneer, Ben Shneiderman
Need for Data Visualization
Tool to enable a user get insight into data
Broadly three types of goals:
• To explore:
o Nothing is known
o Required to get an insight
• To analyze :
o There are hypotheses
o Used for verification or falsification
• To present:
o We have the required information
o Used for communication of result Source: Google images
What experts say ?
“Data visualization is the use of visual representations to
explore, make sense of, and communicate data.”
Data visualization expert, Stephen Few
Visualization transforms data into images that effectively

and accurately represent information about the data.
– Schroeder et al. The Visualization Toolkit, 2/e 1998
Turning invisible into visible that people can understand

intuitively
History of Visualization
Visualization : Very old L . Da Vinci (1452 – 1519)
Often an intuitive step: graphical illustration

Image source: http://www.leonardo-da-vinci-biography.com/leonardo-davinci-anatomy.html
Visualization of Napoleon's Army
Impact of Visualization
• John Snow’s Cholera Map (1854)
• Snow used a spot map to illustrate how cases of cholera clustered around the
pump
Truth about Crime – BBC
http://www.bbc.co.uk/truthaboutcrime/crimemap/
Good data representation principles
Exercise
Breakout into groups of two and identify five good

data visualization principles
– 5 minutes
1. Use Colors Wisely – 1/5
What is
Wrong with
this Color
Scale ?
1. Use Colors Wisely -2/5
Not a bad choice of color scale, but the Dynamic Range
needs some work
Do Not Attempt to Fight Pre-Established Color Meanings
Red Green Blue
• Stop • Go • Cool
• Off • On • Safe
• Dangerous • Plants • Deep
• Hot • Carbon • Nitrogen
• High stress • Moving • Job
• Oxygen • Money completed
• Shallow • All OK
• Money loss • SLA Met
• Project running late • Project on schedule
Which one is easier to read ?
Use good contrast as human eye is good at difference
NB : Could you also spot a grammatical error in above ?

Just because there are million of colors to chose
from does NOT mean that you have to use them all
Avoid Color Pollution
Use a different
color to
extrapolation to
future
2. Reduce clutter
Three graphs from the same data
• Why are they all different ?

• What is good/ bad about each one of them ?
2. Reduce Clutter
Can you draw inferences from these charts ?
Make your data stand out, by reducing the clutter

3. Improving the Vision
Use visually prominent graphical elements to show the data
• Connecting lines should never obscure points and points should not obscure
each other.
• If multiple samples overlap, a representation should be chosen for the
elements that emphasizes the overlap.
• If multiple data sets are represented in the same plot (superposed data), they must be visually
separable.
• If this is not possible due to the data itself, the data can be separated into
• adjacent plots that share an axis
4- Use proper scale
Choice of proper scale decipher the trend easily
• Horizontal and vertical axes must be labeled, or data points labeled.

• Add margins for data
• Tick-marks outs and 3-10 for each axis
5- Reference lines, labels, notes
Use reference lines, labels, notes, keys et only when
necessary and don’t let them obscure data.
6- Align juxtaposed plots
Make sure that scales match and graphs are aligned, to
improve the understanding
Gestalt principles
• The brain creates a perception that is more than the sum of available visual
inputs.
• Gestalt principles is used to identify the elements in our visualization which are
signal (the information we want to communicate) and which are noise (clutter).
• Six Gestalt principles are as follows:

1. Proximity
2. Similarity
3. Closure
4. Enclosure
5. Continuity
6. Connection
Law of Proximity
We perceive objects close to each other as belonging to a group.
Law of Similarity
We seek similarities and differences in objects and link similar
objects as belonging to a group.
Law of Closure
Our minds tend to see complete figures even if a picture is
incomplete.
Law of Enclosure
We perceive objects as belonging to a group when they are enclosed in
a way that creates a boundary or border around them.
Law of Continuity
Our tendency is to see shapes as continuous to the greatest
degree possible. The human eye follows lines, curves or a
sequence of shapes to create pathways.
Law of Connection
We perceive objects connected to each other as a single group as
opposed to objects that are not linked in the same manner.
Are you ready to create a dashboard ?
Dashboard
Elements of dashboard design
Chart type
Steps for a great dashboard design
• Content
• Know your target audience

o Who is the audience? Their need?
o Relationship with presenter? 1st time or established trust?
o Face to face or connected remotely?
• Identify the content they will be looking for

o What the audience is expected to learn?
o Key takeaways and next steps
• Tools
Steps for a great dashboard design
Avoid excessive details
Chart Types
Line charts are great when it comes to displaying patterns of
change across a continuum.
Chart Types
• Choose bar charts if you want to compare items in the same
category.
• The objective is not just to compare but also show how much one is
better or worse than the rest.
Chart Types
Sparklines usually don’t have a scale which means that users
will not be able to notice individual values. They work well
when you have a lot of metrics and you want to show only the
trends.
Charts To Avoid
Avoid scatterplots. They lack precision and clarity as the
relationships between two quantitative measures don’t
change very frequently.
Charts To Avoid
Avoid Pie charts. They rank low in precision because users
find it difficult to accurately compare the sizes of the pie slices.
Charts To Avoid
Avoid bubble charts. They require too much mental effort from their
users even when it comes to reading simple information in a context.
Visualization and layout design
• Place the most important information of top left of the dashboard.
• Reason – Humans follow F shaped visual scanning path.
Visualization and layout design
• Avoid highly saturated colors instead choose few
colors and stick to it.
• Use the same color for the same item on all charts.
• Use colors (or encircle it) to highlight the target data
Some tools that aid in data
visualization include:
- Tableau
- Power Point
- QlikView
- Python Visualization library
(MatPlotlib)
- Google chart
and so on
Summarisation of Data
Data Summarization
Measures of Central Tendency

and
Dispersion/ Variation
Measures of Central Tendency
• Measure of central tendency provides a very convenient way
of describing a set of scores with a single number that
describes the PERFORMANCE of the group.
• Also defined as a single value that is used to describe the

“center” of the data.
• Three commonly used measures of central tendency:

1. Mean
2. Median
3. Mode
Ungrouped Distribution
Prepare a report showing the number of hours per week
students spend studying from a random sample of 30
students. Determines the number of hours each student
studied last week.
15.0, 23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8,
13.5, 20.7, 17.4, 18.6, 12.9, 20.3, 13.7, 21.4,
18.3, 29.8, 17.1, 18.9, 10.3, 26.1, 15.7, 14.0,
17.8, 33.8, 23.2, 12.9, 27.1, 16.6.
Grouped Frequency Distribution
Table No. Title. Head note
Caption
Stub heading Column Column
Total
heading heading
Row heading r1
Body of the table
Row heading r2
Total c1 c2 n
Footnote
Source
Marks No. of Age (yrs) No. of persons
obtained persons <1 15
45 10 1-5 15
6-12 30
46 15
13-19 35
47 30 20-29 45
48 25 30-39 65
49 15 40-49 44
50 5 50-60 32
> 60 19
Total 100
Total 300
Prepare a report showing the number of hours per week
students spend studying from a random sample of 30
students. Determines the number of hours each student
studied last week.
15.0, 23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8,
13.5, 20.7, 17.4, 18.6, 12.9, 20.3, 13.7, 21.4,
18.3, 29.8, 17.1, 18.9, 10.3, 26.1, 15.7, 14.0,
17.8, 33.8, 23.2, 12.9, 27.1, 16.6.
Organize the data into a frequency distribution by
considering the class interval (a) 7.5 – 12.5, 12.5 – 17.5
etc and (b) 10-15, 15 – 20 etc.
Hours of No. of Hours of No. of
studying persons (f) studying persons (f)
07.5 – 12.5 1 10 - 15 7
12.5 – 17.5 12 15 - 20 12
17.5 – 22.5 10 20 - 25 7
22.5 – 27.5 5 25 - 30 3
27.5 – 32.5 1 30 - 35 1
32.5 – 37.5 1 Total 30
Total 30
Class Midpoint: find the midpoint of each interval, use the following formula:
Upper limit + lower limit
2
No. of No. of
Hours of Hours of
Mid point (x) persons Mid point (x) persons
studying studying
(f) (f)
07.5 – 12.5 (12.5+07.5)/2=10 1 10 - 15 (10+15)/2=12.5 7
12.5 – 17.5 (17.5+12.5)/2=15 12 15 - 20 (15+20)/2=17.5 12
17.5 – 22.5 (22.5+17.5)/2=20 10
20 - 25 (20+25)/2=22.5 7
22.5 – 27.5 (27.5+22.5)/2=25 5
25 - 30 (25+30)/2=27.5 3
27.5 – 32.5 (32.5+27.5)/2=30 1
32.5 – 37.5 (37.5+32.5)/2=35 1
30 - 35 (35+35)/2=32.5 1
Total 30
Total 30
Relative Frequency Distribution: Shows the relative observations in each class
No. of No. of
Hours of Relative Hours of Relative
Persons Persons
studying frequency studying frequency
(f) (f)
07.5 – 12.5 1 1/30=0.33 10 - 15 7 7/30 =0.23
12.5 – 17.5 12 12/30=0.40 15 - 20 12 12/30 =0.40
17.5 – 22.5 10 10/30=0.33 20 - 25 7 7/30 =0.23
22.5 – 27.5 5 5/30=0.17 25 - 30 3 3/30 =0.10
27.5 – 32.5 1 1/30=0.03 30 - 35 1 1/30 = 0.04
32.5 – 37.5 1 1/30=0.03 Total 30 1
Total 30 1
• Also referred as the “arithmetic average”
• The most commonly used measure of the center of
data
• Computation of Sample mean for ungrouped data:
Sum of observation divided by number of
observations. If X is denoted as variable and x1, x2,
…, xn as values of X, then n

x1  x 2  ...  x n i 1
xi
X 
n n
• If the mean is to be calculated for population based
data, then it is given by
n
x1  x 2  ...  x n x i
μ  i 1
n n
Computation of mean
Population data Sample data
n n
x1  x 2  ...  x n x i
x1  x 2  ...  x n x i
μ  i 1
X  i 1
n n n n
Mean for ungrouped data
0 1 2 3 4 5 6 7 8 9 10
Mean = 5
0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 6
Mean for ungrouped data
0 1 2 3 4 5 6 7 8 9 10 12 14 . . . 24
Mean = 8
0 1 2 3 4 5 6 7 8 9 10 12 14 . . . 44
Mean = 12
• Computation of Mean for grouped frequency data:
Sum of the product of frequency (f) with mid-term (x)
of class interval divided by total frequency. If x1, x2,
…, xn are the mid-term of class interval and f1, f2, …,
fn are their corresponding frequencies then the mean
is calculated by n
f1x1  f 2 x 2  ...  f n x n  fi x i
X n
 i 1
 fi
N
i 1
Mean for grouped data
Using Direct and Step-deviation methods n
Mid f x i i
Hours of
fi point fixi di=(xi-A)/h fidi X i 1
studying N
(xi)
580
07.5 – 12.5 1 10 10 -2 -2 X  19.33
12.5 – 17.5 12 15 180 -1 - 12 30
n
17.5 – 22.5 10 A= 20 200 0 0 f d i i
22.5 – 27.5 5 25 125 +1 5 X A i 1
h
27.5 – 32.5 1 30 30 +2 2 N
32.5 – 37.5 4
1 35 35 +3 3 X  20  5  19.33
Total 30 ∑fixi 580 ∑fidi - 4 30
Mean for grouped data
Wages of employees No. of
Find the mean of (Rs) persons
the wages of 4001- 4500 25
employees 4501- 5000 36
of a company are 5001- 5500 45
as follows: 5501- 6000 62
6001- 6500 39
6501- 7000 55
7001- 7500 44
7501- 8000 29
8001- 8500 15
Total 350
Mean: Grouped Scores
Data of Children watching TV in Bangalore
Hours of
No. of
TV Cumulati
children fX Percentage
watching ve %
(f)
(X)
1 104 104 31.3 33.3
2 130 260 39.2 70.5
3 98 294 29.5 100
Total 332 658 100.0
Mean - Properties
• It measures stability. Mean is the most stable among other
measures of central tendency because every score contributes to
the value of the mean.
• It is rigidly defined and therefore suitable for further mathematical
anlysis
• It may easily affected by the extreme scores (outliers).
• The sum of each score’s distance from the mean is zero
(X-Mean)=0
• It can be applied to interval and ratio level of measurement
• It may not be an actual score in the distribution
• It is very easy to compute.
Mean
When to use the Mean
Sampling stability is desired.
• Other measures are to be computed such as standard

deviation, coefficient of variation and skewness
The Standard Deviation
Sl.
X1 X2
No.
1 2 3
2 8 10
3 5 5
4 3 3
5 7 7
6 8 3
7 5 5
8 2 6
9 5 3
Total 45 45
Statistical measures Group 1 Group 2
Mean 5 5
Median 5 5
Mode 5 5
??? - Observe the following data
Group A
11 12 13 14 15 16 17 18 19 20 21
Group B
11 12 13 14 15 16 17 18 19 20 21
Group C
11 12 13 14 15 16 17 18 19 20 21
Compute Mean for all three groups
Group A Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
Group B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
Group C Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
??? - Observe the following data
Do you need anything else to describe the data
with mean because mean of all three groups are
same?
• Measures of dispersion or variation
 Standard deviation
 Variance
 Range
Standard Deviation
 x 
G1(x) xi  x (x i  x ) 2 n
2
i x
12 -3.5 12.25
S i 1
13 -2.5 6.25 n -1
15 -0.5 0.25
52.0
15 -0.5 0.25 
15 -0.5 0.25 7
16 0.5 0.25  2.726
17 1.5 2.25
21 5.5 30.25
Total 0 52.0
Standard Deviation
G1 G2 G3
12 14 11
13 15 11
15 15 11 Calculate SD
of groups
15 15 12
G2 and G3
15 16 19
16 16 20
17 16 20
21 17 20
Standard Deviation
Group A Mean = 15.5
SD= 2.726
11 12 13 14 15 16 17 18 19 20 21
Group B
Mean = 15.5
SD=0.926
11 12 13 14 15 16 17 18 19 20 21
Group C Mean = 15.5
SD=4.567
11 12 13 14 15 16 17 18 19 20 21
Sl.
X1 X2
No.
1 2 3
2 8 10
3 5 5
4 3 3
5 7 7
6 8 3
7 5 5
8 2 6
9 5 3
Total 45 45
Statistical measures Group 1 Group 2
Mean 5 5
Median 5 5
Mode 5 5
Range 2- 8 3- 10
Mean Deviation 2 2
Variance 5.50 5.75
Standard Deviation 2.34 2.40
Coefficient of Variation (%) 46.90 48.00
Standard Deviation
The Standard Deviation is a measure of Dispersion
or Variation which is a descriptive statistics that
describe how similar a set of scores are to each other.
The more similar the scores are to each other, the
lower the Standard deviation will be and the less
similar the scores are to each other, the higher the
Standard deviation will be.
In general, the more spread out a distribution is, the
larger the measure of dispersion will be.
Measure of Variability/ Dispersion
Which of the distributions of scores has the larger
125
dispersion? 100
75
The upper distribution has 50
25
more dispersion because the 0
1 2 3 4 5 6 7 8 9 10
scores are more spread out
That is, they are less similar 125
100
to each other 75
50
25
0
1 2 3 4 5 6 7 8 9 10
Measure of Variability/ Dispersion
Variability can be defined several ways:
• A quantitative distance measure based on the

differences between scores
• Describes distance of the spread of scores or
distance of a score from the mean
Purposes of Measure of Variability:

• Describe the variability in the distribution
• Measure how well an individual score represents the

distribution
New Strategy :
Find deviation of each score, ie., (X-Mean)
Square each deviation of each score, ie., (X-Mean)2
Sum the Squared Deviations, ie., Σ (X-Mean)2
Average the squared deviations Σ (X-Mean)2/n
• Mean Squared Deviation is known as “Variance”
• Variability is now measured in squared units
The Population Variance
• Population variance equals mean (average) squared
deviation (distance) of the scores from the population
mean
• Variance is the average of squared deviations, so we

identify population variance with a lowercase Greek
letter sigma squared: σ2
• Standard deviation is the square root of the variance,

so we identify it with a lowercase Greek letter sigma: σ
Computation of Standard Deviation
Population data Sample data
n n
 (x i  μ) 2
 (x  x)
i
2
σ i 1
S i 1
n n -1
Formula for grouped data
◊ Standard deviation for grouped data
• Most common and most important measure of variability is
the standard deviation
o A measure of the standard, or average, distance from
the mean
o Describes whether the scores are clustered closely
around the mean or are widely scattered
• Calculation differs for population and samples
• Variance is a necessary companion concept to standard

deviation but not the same concept
Exercise : Find out the deviations of all the data points with
the mean….and then find the ‘mean deviation’.
Interpretation
Important note:
◊ Mean and Standard Deviation
◊ Mean and Variance
Which pair to be used for comparison?

Why?
Interpretation
Important note:
◊ It is not enough if only mean is
computed to describe the data but
standard deviation is also needed to
give complete description as both have
same unit of measurement.
Coefficient of Variation
Important note:
◊ If two variables are measured in different
unit of measurement which is the best
measure to compare the variability?
Ans: Coefficient of Variation
S
CV  100
X
Variables Mean SD
Height (cms) 169.6 27.5
Weight (kgs) 88.7 21.8
Variables Mean SD CV (%)
Height (cms) 169.6 27.5 16.21
Weight (kgs) 88.7 21.8 24.58
The Median
• The score that divides the distribution into two
equal parts, so that half the cases are above it and
half below it.
• The median is the middle score, or average of

middle scores in a distribution.
o Fifty percent (50%) lies below the median value
and 50% lies above the median value.
o It is also known as the middle score or the 50th
percentile.
The Median
Median of Ungrouped Data
 Arrange the scores (from lowest to highest

or highest to lowest).
 Determine the middle most score in a

distribution if n is an odd number (and if n
is an even number, get the average of the
two middle most scores)
Median
151, 168, 174
Median  168
166, 169, 172, 185
Median  170.5
Median
 n 1
Value in the position of 2 , if n is odd
Median  
Average value in the position of n and n  1, if n is even
 2 2
Median in Grouped Data
Where:
• L = Lower boundary of the category containing the N/2
• Cf = Cumulative frequency before the median class if
the scores are arranged from the lowest to highest

value
• h = Size of the class interval
• f = frequency of the median class
Median in Grouped Data
Steps to solve median for grouped data
1. Complete the table for Cumulative frequency.

2. Get N/2 of the scores in the distribution so that
you can identify MC
3. Determine L, h, f and Cumulative frequency
4. Solve the median using the given formula
Median
Example: Scores of 40 students in a science class consist of 60 items
and they are tabulated below. The highest score is 54 and the lowest
score is 10.
Median
Solution:
• N/2 = 40/2 = 20
• The category containing N/2 is (35 – 39)
• Lower Limit of MC = 35
• L = 34.5
• Cf (or Cfp) = 17
• f (or fm) = 9
• h=5
• Median = L + (N/2 – Cf) /f * h

= 34.5 +(20-17)/9 *5
= 34.5 + 15/9
= 36.17
Median - Properties
• It may not be an actual observation in the data set.
• Not affected by extreme values because median is a
positional measure.
• Can be applied in ordinal level.
When to Use the Median

o The exact midpoint of the score distribution is
desired.
o There are extreme scores in the distribution.
Analyze and interpret this data
Hours of TV Duration of
Intelligence
Sl. No. watching per hospital stay
Quotient
week (days)
1 106 7 8
2 86 2 40
3 200 27 10
4 101 70 80
5 199 8 180
6 103 9 5
7 197 20 80
8 113 12 10
9 112 6 5
10 65 17 8
Shape of the Distribution
Shape of the Distribution
• Symmetrical : mean is about equal to median
• Normality: mean = median = mode
• Skewed: Deviation from Normality

• Negatively: mean < median < mode
• Positively: mode > median > mean
• Bimodal: has two distinct modes
• Multi-modal: has more than 2 distinct modes)
Quartiles
• Not a Measure of Central Tendency
• Split Ordered Data into 4 Quarters
25% 25% 25% 25%

Q1 Q2 Q3
Position of i-th Quartile: position of point in
Data in Ordered Array: 11 12 13 16 16 17 18 21 22

1•(9 + 1)
Position of Q1 = = 2.50, Q1 =12.5
4
Interquartile range (IQR)
• Measure of Variation
• Also Known as Midspread: Spread in the Middle 50%
• Difference Between Third & First Quartiles:
• Not Affected by Extreme Values
Interquartile Range = Q3 – Q1
Data in Ordered Array: 11 12 13 16 16 17 17 18 21
1•(9 + 1)
Position of Q1 = = 2.50, Q1 =12.5
4
3•(9 + 1)
Position of Q3 = = 7.50, Q3 =17.5
4
Interquartile Range = Q3 – Q1 = 17.5 - 12.5 = 5
Box and Whisker plot
12 97.5th Centile
10
8 75th Centile Q1
Pain (VAS)
4 MEDIAN
Q2
2 (50th centile)
0
-2
25th Centile Q3
N= 74 27
Female Male
Inter-quartile
2.5th Centile
range
Box and Whisker plot
IQR = Q3 – Q1
Q1-1.5 IQR Q3+1.5 IQR
Min Max
Q1 Q2 Q3
Outlier Outlier
Box-and-Whisker plot
IQR = Q3 – Q1
Min Q1-1.5 IQR Q3+1.5 IQR
Max
Q1-3 IQR Q1 Q2 Q3 Q3+3 IQR
Major Outlier Major Outlier
Percentiles
A score below which a specific percentage of the distribution
falls.
Finding percentiles in ungrouped data:

0.01N  Cf
Pi  L  h, i  1, 2, 3, . . . , 99
f
Finding percentiles in grouped data
i(n  1)
Pi  , i  1, 2, 3, . . . , 99
100
The Mode
• The category or score with the largest frequency (or
percentage) in the distribution.
• The mode can be calculated for variables with levels of
measurement that are: nominal, ordinal, or interval-ratio.
Example:
• Number of Votes for Candidates for Lok Sabha MP. The mode, in this
case, gives you the “central” response of the voters: the most popular
candidate.
• Candidate A – 11,769 votes The Mode:

• Candidate B – 39,443 votes “Candidate C”
• Candidate C – 78,331 votes
Mode
Properties
• It can be used when the data are qualitative as well as
quantitative.
• It may not be unique (Uni -, Bi -, Tri- , Poly- modal)
• It is affected by extreme values (outliers).
• It may not exist (Ill – defined : Mode = 3Median – 2Median).
When to Use the Mode
o When the “typical” value is desired.
o When the data set is measured on a nominal scale

The Ranges
• The distance covered by the scores in a distribution – From smallest
value to highest value
• For continuous data, real limits are used
Range = XMin - XMax
Range = LRL for Xmin - URL for Xmax
• Based on two scores, not all the data – An imprecise, unreliable

measure of variability
Example: For a set of scores: 7, 2, 7, 6, 5, 6, 2
Range = Highest Score minus Lowest score = 7 - 2 = 5 (or 2 – 7)
Learning Check
a) If all the scores in a data set are the same, the Standard
Deviation is equal to 1.00 True / False ?
Select the correct option
a) The standard deviation measures …

(1) Sum of squared deviation scores
(2) Standard distance of a score from the mean
(3) Average deviation of a score from the mean
(4) Average squared distance of a score from the mean
Solution
a) If all the scores in a data set are the same, they are
equal to the mean and hence the deviation from mean
= 0 therefore, Standard Deviation is equal to zero
False
a) The standard deviation measures …
(1) Sum of squared deviation scores
(2) Standard distance of a score from the mean
(3) Average deviation of a score from the mean
(4) Average squared distance of a score from the
mean
Learning Check
a) A sample of four scores has SS = 24. What is the variance?
(1) The variance is 6
b) A sample systematically has less variability than True / False ?

a population
c) The standard deviation is the distance from the True / False ?
Mean to the farthest point on the distribution
curve
Solution
a) A sample of four scores has SS = 24. What is the variance?
b) Extreme scores affect variability, but are less

likely to be included in a sample True (???)
c) The standard deviation extends from the mean
approximately halfway to the most extreme False
score
Exercise 3
Compute quartiles using Python or R and construct a Box-and-
Whisker plot for the following data on wages to represent IQR
Exercise 1
The following data is the wages of 350 employees of an organisation.
Compute (a) Mean, (b) Median (c) Mode and (d) Standard deviation
Wages 4001- 4501- 5001- 5501- 6001- 6501- 7001- 7501- 8001-
(Rs.) 4500 5000 5500 6000 6500 7000 7500 8000 8500
No. of
25 36 45 62 39 55 44 29 15
persons
Hint for Mode: L = Lower limit of Modal class; f0 = Frequency preceding modal
class; f1 = Frequency of modal class; f2 = Frequency succeeding
modal class; h = Width of class interval
Exercise 2
In the following three sets of data calculate suitable measure of
central tendency and dispersion? Represent data graphically using
an appropriate graphical method
Sl. No. 1 2 3 4 5 6 7 8 9 10
Intelligence Quotient 106 86 200 101 199 103 197 113 112 65
Hours of TV
7 2 27 70 8 9 20 12 6 17
watching per week
Duration of hospital
8 40 10 80 180 5 80 10 5 8
stay (days)

Session 1&2 - Descriptive Statistics (GbA) PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 1&2 - Descriptive Statistics (GbA) PDF

Uploaded by

Copyright:

Available Formats

Introduction to Statistical Methods

BITS Pilani Prof.Gangaboraiah PhD

2 Measures of Central Tendency

5 Vs of a data are ever

Image credits: www.researchgate.net

• How much time did you take ?

Image credits: www.researchgate.net

• How much time did you take now ?

• Requires attention despite the name

Visualization transforms data into images that effectively

– Schroeder et al. The Visualization Toolkit, 2/e 1998

Turning invisible into visible that people can understand

Visualization : Very old L . Da Vinci (1452 – 1519)

Often an intuitive step: graphical illustration

Breakout into groups of two and identify five good

Use good contrast as human eye is good at difference

NB : Could you also spot a grammatical error in above ?

• Why are they all different ?

Make your data stand out, by reducing the clutter

Use visually prominent graphical elements to show the data

Choice of proper scale decipher the trend easily

• Horizontal and vertical axes must be labeled, or data points labeled.

• Six Gestalt principles are as follows:

• Know your target audience

• Identify the content they will be looking for

• Reason – Humans follow F shaped visual scanning path.

• Use colors (or encircle it) to highlight the target data

Measures of Central Tendency

• Also defined as a single value that is used to describe the

• Three commonly used measures of central tendency:

Population data Sample data

• Other measures are to be computed such as standard

• A quantitative distance measure based on the

Purposes of Measure of Variability:

• Measure how well an individual score represents the

• Variance is the average of squared deviations, so we

• Standard deviation is the square root of the variance,

Computation of Standard Deviation

Population data Sample data

• Calculation differs for population and samples

• Variance is a necessary companion concept to standard

Which pair to be used for comparison?

Height (cms) 169.6 27.5

Weight (kgs) 88.7 21.8

Variables Mean SD CV (%)

Height (cms) 169.6 27.5 16.21

Weight (kgs) 88.7 21.8 24.58

• The median is the middle score, or average of

 Arrange the scores (from lowest to highest

 Determine the middle most score in a

166, 169, 172, 185

• Cf = Cumulative frequency before the median class if

the scores are arranged from the lowest to highest

• f = frequency of the median class

1. Complete the table for Cumulative frequency.

• Median = L + (N/2 – Cf) /f * h

When to Use the Median

• Skewed: Deviation from Normality

25% 25% 25% 25%

Data in Ordered Array: 11 12 13 16 16 17 18 21 22

Min Q1-1.5 IQR Q3+1.5 IQR

Finding percentiles in ungrouped data: