You are on page 1of 125

Introduction to Statistical Methods

BITS Pilani Prof.Gangaboraiah PhD


Bangalore Campus
BITS Pilani

BITS Pilani
Bangalore Campus
Descriptive Statistics
About me
 Dr.Gangaboraiah, PhD (Stats)
 Former Professor of Statistics, KIMS, Bangalore
 Work Experience
 Kempegowda Institute of Medical Sciences, Bangalore (34 years)
 Govt. Homeopathy Medical College, Bangalore (4 years)
 SJC Institute of Technology, Chickballapur (13 years, Visiting Professor)
 Manipal University, Bangalore Centre (Since 2008, Visiting Professor)
 MS (Computer Science), MS (Computer Network)
 Data Science
 BITS (Since 2013, Visiting Professor)
 MTech (Data Science)
 WIPRO and Aricent (2019)
Prof.Gangaboraiah PhD (Stats) | Slide 3 of 125 Former Professor of Statistics | KIMS, B’lore
Agenda
Here’s what you will learn in the entire Session:
1 Data Visualization: Why? What? How?

2 Measures of Central Tendency


2 Measures of Dispersion/ Variation

Prof.Gangaboraiah PhD (Stats) | Slide 4 of 125 Former Professor of Statistics | KIMS, B’lore
Data Visualization
Data is generated everywhere …everyday
…and is increasing exponentially

Source: http://3dsbiovia.com/blog/

Prof.Gangaboraiah PhD (Stats) | Slide 6 of 125 Former Professor of Statistics | KIMS, B’lore
Data Has Become ‘Big ’

5 Vs of a data are ever


5 Vs
increasing
of Big
Data
🡺 Need an effective way
to understand them
Source: http://hedureka.com
quickly
Prof.Gangaboraiah PhD (Stats) | Slide 7 of 125 Former Professor of Statistics | KIMS, B’lore
Data Visualization
• Visual representation of data
• For exploration, discovery , insight ….
• Interactive component provides more insight as
compared to static images

Image credits: www.researchgate.net

Prof.Gangaboraiah PhD (Stats) | Slide 8 of 125 Former Professor of Statistics | KIMS, B’lore
How many 3’s ?

• How much time did you take ?

Image credits: www.researchgate.net

Prof.Gangaboraiah PhD (Stats) | Slide 9 of 125 Former Professor of Statistics | KIMS, B’lore
How many 3’s ?

• How much time did you take now ?


Pre- attentive Processing:

• Requires attention despite the name


• Very fast: < 200-250 ms
• What matters most is the contrast between features
Prof.Gangaboraiah PhD (Stats) | Slide 10 of 125 Former Professor of Statistics | KIMS, B’lore
Highest sales figure for each product?
Annual Sales (in US $)
US States Product A Product B Product C Product D Product E
Pennsylvania 30,934.00 30,112.00 26,376.00 34,459.00 44,105.00
Illinois 49,743.00 40,996.00 33,527.00 45,298.00 31,961.00
Ohio 42,439.00 49,024.00 43,670.00 48,584.00 47,183.00
Georgia 29,344.00 41,095.00 37,954.00 38,337.00 34,406.00
North Carolina 30,443.00 31,941.00 29,231.00 49,090.00 41,067.00
Michigan 41,241.00 36,036.00 43,715.00 34,026.00 32,050.00
New Jersey 40,437.00 49,737.00 40,207.00 46,347.00 31,150.00
Virginia 38,816.00 44,372.00 35,359.00 31,226.00 26,563.00
Washington 30,773.00 31,220.00 47,950.00 36,900.00 40,732.00
Georgia 35,738.00 28,715.00 29,418.00 29,913.00 29,859.00
• Is it easy to find out ?
Prof.Gangaboraiah PhD (Stats) | Slide 11 of 125 Former Professor of Statistics | KIMS, B’lore
Highest sales figure for each product?
Annual Sales (in US $)
US States Product A Product B Product C Product D Product E
Pennsylvania 30,934.00 30,112.00 26,376.00 34,459.00 44,105.00
Illinois 49,743.00 40,996.00 33,527.00 45,298.00 31,961.00
Ohio 42,439.00 49,024.00 43,670.00 48,584.00 47,183.00
Georgia 29,344.00 41,095.00 37,954.00 38,337.00 34,406.00
North Carolina 30,443.00 31,941.00 29,231.00 49,090.00 41,067.00
Michigan 41,241.00 36,036.00 43,715.00 34,026.00 32,050.00
New Jersey 40,437.00 49,737.00 40,207.00 46,347.00 31,150.00
Virginia 38,816.00 44,372.00 35,359.00 31,226.00 26,563.00
Washington 30,773.00 31,220.00 47,950.00 36,900.00 40,732.00
Georgia 35,738.00 28,715.00 29,418.00 29,913.00 29,859.00
“The purpose of visualization is insight, not just picture.”
Data visualization pioneer, Ben Shneiderman

Prof.Gangaboraiah PhD (Stats) | Slide 12 of 125 Former Professor of Statistics | KIMS, B’lore
Need for Data Visualization
Tool to enable a user get insight into data
Broadly three types of goals:
• To explore:
o Nothing is known
o Required to get an insight

• To analyze :
o There are hypotheses
o Used for verification or falsification

• To present:
o We have the required information
o Used for communication of result Source: Google images
Prof.Gangaboraiah PhD (Stats) | Slide 13 of 125 Former Professor of Statistics | KIMS, B’lore
What experts say ?
“Data visualization is the use of visual representations to
explore, make sense of, and communicate data.”
Data visualization expert, Stephen Few

Visualization transforms data into images that effectively


and accurately represent information about the data.

– Schroeder et al. The Visualization Toolkit, 2/e 1998

Turning invisible into visible that people can understand


intuitively

Prof.Gangaboraiah PhD (Stats) | Slide 14 of 125 Former Professor of Statistics | KIMS, B’lore
History of Visualization

Visualization : Very old L . Da Vinci (1452 – 1519)

Often an intuitive step: graphical illustration


Image source: http://www.leonardo-da-vinci-biography.com/leonardo-davinci-anatomy.html

Prof.Gangaboraiah PhD (Stats) | Slide 15 of 125 Former Professor of Statistics | KIMS, B’lore
Visualization of Napoleon's Army

Prof.Gangaboraiah PhD (Stats) | Slide 16 of 125 Former Professor of Statistics | KIMS, B’lore
Impact of Visualization
• John Snow’s Cholera Map (1854)
• Snow used a spot map to illustrate how cases of cholera clustered around the
pump

Prof.Gangaboraiah PhD (Stats) | Slide 17 of 125 Former Professor of Statistics | KIMS, B’lore
Truth about Crime – BBC

http://www.bbc.co.uk/truthaboutcrime/crimemap/
Prof.Gangaboraiah PhD (Stats) | Slide 18 of 125 Former Professor of Statistics | KIMS, B’lore
Good data representation principles

Exercise

Breakout into groups of two and identify five good


data visualization principles
– 5 minutes

Prof.Gangaboraiah PhD (Stats) | Slide 19 of 125 Former Professor of Statistics | KIMS, B’lore
1. Use Colors Wisely – 1/5

What is
Wrong with
this Color
Scale ?

Prof.Gangaboraiah PhD (Stats) | Slide 20 of 125 Former Professor of Statistics | KIMS, B’lore
1. Use Colors Wisely -2/5
Not a bad choice of color scale, but the Dynamic Range
needs some work

Prof.Gangaboraiah PhD (Stats) | Slide 21 of 125 Former Professor of Statistics | KIMS, B’lore
1. Use Colors Wisely – 3/5
Do Not Attempt to Fight Pre-Established Color Meanings
Red Green Blue
• Stop • Go • Cool
• Off • On • Safe
• Dangerous • Plants • Deep
• Hot • Carbon • Nitrogen
• High stress • Moving • Job
• Oxygen • Money completed
• Shallow • All OK
• Money loss • SLA Met
• Project running late • Project on schedule

Prof.Gangaboraiah PhD (Stats) | Slide 22 of 125 Former Professor of Statistics | KIMS, B’lore
2. Use Colors Wisely – 4/5
Which one is easier to read ?

Use good contrast as human eye is good at difference

NB : Could you also spot a grammatical error in above ?


Prof.Gangaboraiah PhD (Stats) | Slide 23 of 125 Former Professor of Statistics | KIMS, B’lore
1. Use Colors Wisely – 5/5
Just because there are million of colors to chose
from does NOT mean that you have to use them all
Avoid Color Pollution

Use a different
color to
extrapolation to
future

Prof.Gangaboraiah PhD (Stats) | Slide 24 of 125 Former Professor of Statistics | KIMS, B’lore
2. Reduce clutter
Three graphs from the same data

• Why are they all different ?


• What is good/ bad about each one of them ?
Prof.Gangaboraiah PhD (Stats) | Slide 25 of 125 Former Professor of Statistics | KIMS, B’lore
2. Reduce Clutter
Can you draw inferences from these charts ?

Make your data stand out, by reducing the clutter


Prof.Gangaboraiah PhD (Stats) | Slide 26 of 125 Former Professor of Statistics | KIMS, B’lore
3. Improving the Vision

Use visually prominent graphical elements to show the data

• Connecting lines should never obscure points and points should not obscure
each other.
• If multiple samples overlap, a representation should be chosen for the
elements that emphasizes the overlap.
• If multiple data sets are represented in the same plot (superposed data), they must be visually
separable.
• If this is not possible due to the data itself, the data can be separated into
• adjacent plots that share an axis
Prof.Gangaboraiah PhD (Stats) | Slide 27 of 125 Former Professor of Statistics | KIMS, B’lore
4- Use proper scale

Choice of proper scale decipher the trend easily

• Horizontal and vertical axes must be labeled, or data points labeled.


• Add margins for data
• Tick-marks outs and 3-10 for each axis
Prof.Gangaboraiah PhD (Stats) | Slide 28 of 125 Former Professor of Statistics | KIMS, B’lore
5- Reference lines, labels, notes
Use reference lines, labels, notes, keys et only when
necessary and don’t let them obscure data.

Prof.Gangaboraiah PhD (Stats) | Slide 29 of 125 Former Professor of Statistics | KIMS, B’lore
6- Align juxtaposed plots
Make sure that scales match and graphs are aligned, to
improve the understanding

Prof.Gangaboraiah PhD (Stats) | Slide 30 of 125 Former Professor of Statistics | KIMS, B’lore
Gestalt principles
• The brain creates a perception that is more than the sum of available visual
inputs.
• Gestalt principles is used to identify the elements in our visualization which are
signal (the information we want to communicate) and which are noise (clutter).

• Six Gestalt principles are as follows:


1. Proximity
2. Similarity
3. Closure
4. Enclosure
5. Continuity
6. Connection
Prof.Gangaboraiah PhD (Stats) | Slide 31 of 125 Former Professor of Statistics | KIMS, B’lore
Law of Proximity
We perceive objects close to each other as belonging to a group.

Prof.Gangaboraiah PhD (Stats) | Slide 32 of 125 Former Professor of Statistics | KIMS, B’lore
Law of Similarity
We seek similarities and differences in objects and link similar
objects as belonging to a group.

Prof.Gangaboraiah PhD (Stats) | Slide 33 of 125 Former Professor of Statistics | KIMS, B’lore
Law of Closure
Our minds tend to see complete figures even if a picture is
incomplete.

Prof.Gangaboraiah PhD (Stats) | Slide 34 of 125 Former Professor of Statistics | KIMS, B’lore
Law of Enclosure
We perceive objects as belonging to a group when they are enclosed in
a way that creates a boundary or border around them.

Prof.Gangaboraiah PhD (Stats) | Slide 35 of 125 Former Professor of Statistics | KIMS, B’lore
Law of Continuity
Our tendency is to see shapes as continuous to the greatest
degree possible. The human eye follows lines, curves or a
sequence of shapes to create pathways.

Prof.Gangaboraiah PhD (Stats) | Slide 36 of 125 Former Professor of Statistics | KIMS, B’lore
Law of Connection
We perceive objects connected to each other as a single group as
opposed to objects that are not linked in the same manner.

Prof.Gangaboraiah PhD (Stats) | Slide 37 of 125 Former Professor of Statistics | KIMS, B’lore
Are you ready to create a dashboard ?

Prof.Gangaboraiah PhD (Stats) | Slide 38 of 125 Former Professor of Statistics | KIMS, B’lore
Dashboard

Prof.Gangaboraiah PhD (Stats) | Slide 39 of 125 Former Professor of Statistics | KIMS, B’lore
Elements of dashboard design
Chart type

Prof.Gangaboraiah PhD (Stats) | Slide 40 of 125 Former Professor of Statistics | KIMS, B’lore
Steps for a great dashboard design
• Content

• Know your target audience


o Who is the audience? Their need?
o Relationship with presenter? 1st time or established trust?
o Face to face or connected remotely?

• Identify the content they will be looking for


o What the audience is expected to learn?
o Key takeaways and next steps

• Tools

Prof.Gangaboraiah PhD (Stats) | Slide 41 of 125 Former Professor of Statistics | KIMS, B’lore
Steps for a great dashboard design
Avoid excessive details

Prof.Gangaboraiah PhD (Stats) | Slide 42 of 125 Former Professor of Statistics | KIMS, B’lore
Chart Types
Line charts are great when it comes to displaying patterns of
change across a continuum.

Prof.Gangaboraiah PhD (Stats) | Slide 43 of 125 Former Professor of Statistics | KIMS, B’lore
Chart Types
• Choose bar charts if you want to compare items in the same
category.
• The objective is not just to compare but also show how much one is
better or worse than the rest.

Prof.Gangaboraiah PhD (Stats) | Slide 44 of 125 Former Professor of Statistics | KIMS, B’lore
Chart Types
Sparklines usually don’t have a scale which means that users
will not be able to notice individual values. They work well
when you have a lot of metrics and you want to show only the
trends.

Prof.Gangaboraiah PhD (Stats) | Slide 45 of 125 Former Professor of Statistics | KIMS, B’lore
Charts To Avoid
Avoid scatterplots. They lack precision and clarity as the
relationships between two quantitative measures don’t
change very frequently.

Prof.Gangaboraiah PhD (Stats) | Slide 46 of 125 Former Professor of Statistics | KIMS, B’lore
Charts To Avoid
Avoid Pie charts. They rank low in precision because users
find it difficult to accurately compare the sizes of the pie slices.

Prof.Gangaboraiah PhD (Stats) | Slide 47 of 125 Former Professor of Statistics | KIMS, B’lore
Charts To Avoid
Avoid bubble charts. They require too much mental effort from their
users even when it comes to reading simple information in a context.

Prof.Gangaboraiah PhD (Stats) | Slide 48 of 125 Former Professor of Statistics | KIMS, B’lore
Visualization and layout design
• Place the most important information of top left of the dashboard.

• Reason – Humans follow F shaped visual scanning path.

Prof.Gangaboraiah PhD (Stats) | Slide 49 of 125 Former Professor of Statistics | KIMS, B’lore
Visualization and layout design
• Avoid highly saturated colors instead choose few
colors and stick to it.

• Use the same color for the same item on all charts.

• Use colors (or encircle it) to highlight the target data

Prof.Gangaboraiah PhD (Stats) | Slide 50 of 125 Former Professor of Statistics | KIMS, B’lore
Some tools that aid in data
visualization include:

- Tableau
- Power Point
- QlikView
- Python Visualization library
(MatPlotlib)
- Google chart

and so on

Prof.Gangaboraiah PhD (Stats) | Slide 51 of 125 Former Professor of Statistics | KIMS, B’lore
Summarisation of Data
Data Summarization

Measures of Central Tendency


and
Dispersion/ Variation

Prof.Gangaboraiah PhD (Stats) | Slide 53 of 125 Former Professor of Statistics | KIMS, B’lore
Measures of Central Tendency
• Measure of central tendency provides a very convenient way
of describing a set of scores with a single number that
describes the PERFORMANCE of the group.

• Also defined as a single value that is used to describe the


“center” of the data.

• Three commonly used measures of central tendency:


1. Mean
2. Median
3. Mode

Prof.Gangaboraiah PhD (Stats) | Slide 54 of 125 Former Professor of Statistics | KIMS, B’lore
Ungrouped Distribution
Prepare a report showing the number of hours per week
students spend studying from a random sample of 30
students. Determines the number of hours each student
studied last week.
15.0, 23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8,
13.5, 20.7, 17.4, 18.6, 12.9, 20.3, 13.7, 21.4,
18.3, 29.8, 17.1, 18.9, 10.3, 26.1, 15.7, 14.0,
17.8, 33.8, 23.2, 12.9, 27.1, 16.6.

Prof.Gangaboraiah PhD (Stats) | Slide 55 of 125 Former Professor of Statistics | KIMS, B’lore
Grouped Frequency Distribution
Table No. Title. Head note
Caption
Stub heading Column Column
Total
heading heading
Row heading r1
Body of the table
Row heading r2
Total c1 c2 n
Footnote
Source
Prof.Gangaboraiah PhD (Stats) | Slide 56 of 125 Former Professor of Statistics | KIMS, B’lore
Grouped Frequency Distribution
Marks No. of Age (yrs) No. of persons
obtained persons <1 15
45 10 1-5 15
6-12 30
46 15
13-19 35
47 30 20-29 45
48 25 30-39 65
49 15 40-49 44
50 5 50-60 32
> 60 19
Total 100
Total 300
Prof.Gangaboraiah PhD (Stats) | Slide 57 of 125 Former Professor of Statistics | KIMS, B’lore
Ungrouped Distribution
Prepare a report showing the number of hours per week
students spend studying from a random sample of 30
students. Determines the number of hours each student
studied last week.
15.0, 23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8,
13.5, 20.7, 17.4, 18.6, 12.9, 20.3, 13.7, 21.4,
18.3, 29.8, 17.1, 18.9, 10.3, 26.1, 15.7, 14.0,
17.8, 33.8, 23.2, 12.9, 27.1, 16.6.
Organize the data into a frequency distribution by
considering the class interval (a) 7.5 – 12.5, 12.5 – 17.5
etc and (b) 10-15, 15 – 20 etc.
Prof.Gangaboraiah PhD (Stats) | Slide 58 of 125 Former Professor of Statistics | KIMS, B’lore
Grouped Frequency Distribution
Hours of No. of Hours of No. of
studying persons (f) studying persons (f)
07.5 – 12.5 1 10 - 15 7
12.5 – 17.5 12 15 - 20 12
17.5 – 22.5 10 20 - 25 7
22.5 – 27.5 5 25 - 30 3
27.5 – 32.5 1 30 - 35 1
32.5 – 37.5 1 Total 30
Total 30

Prof.Gangaboraiah PhD (Stats) | Slide 59 of 125 Former Professor of Statistics | KIMS, B’lore
Grouped Frequency Distribution
Class Midpoint: find the midpoint of each interval, use the following formula:
Upper limit + lower limit
2
No. of No. of
Hours of Hours of
Mid point (x) persons Mid point (x) persons
studying studying
(f) (f)
07.5 – 12.5 (12.5+07.5)/2=10 1 10 - 15 (10+15)/2=12.5 7
12.5 – 17.5 (17.5+12.5)/2=15 12 15 - 20 (15+20)/2=17.5 12
17.5 – 22.5 (22.5+17.5)/2=20 10
20 - 25 (20+25)/2=22.5 7
22.5 – 27.5 (27.5+22.5)/2=25 5
25 - 30 (25+30)/2=27.5 3
27.5 – 32.5 (32.5+27.5)/2=30 1
32.5 – 37.5 (37.5+32.5)/2=35 1
30 - 35 (35+35)/2=32.5 1
Total 30
Total 30
Prof.Gangaboraiah PhD (Stats) | Slide 60 of 125 Former Professor of Statistics | KIMS, B’lore
Grouped Frequency Distribution
Relative Frequency Distribution: Shows the relative observations in each class
No. of No. of
Hours of Relative Hours of Relative
Persons Persons
studying frequency studying frequency
(f) (f)
07.5 – 12.5 1 1/30=0.33 10 - 15 7 7/30 =0.23
12.5 – 17.5 12 12/30=0.40 15 - 20 12 12/30 =0.40
17.5 – 22.5 10 10/30=0.33 20 - 25 7 7/30 =0.23
22.5 – 27.5 5 5/30=0.17 25 - 30 3 3/30 =0.10
27.5 – 32.5 1 1/30=0.03 30 - 35 1 1/30 = 0.04
32.5 – 37.5 1 1/30=0.03 Total 30 1
Total 30 1
Prof.Gangaboraiah PhD (Stats) | Slide 61 of 125 Former Professor of Statistics | KIMS, B’lore
Ungrouped Distribution
• Also referred as the “arithmetic average”
• The most commonly used measure of the center of
data
• Computation of Sample mean for ungrouped data:
Sum of observation divided by number of
observations. If X is denoted as variable and x1, x2,
…, xn as values of X, then n


x1  x 2  ...  x n i 1
xi
X 
n n
Prof.Gangaboraiah PhD (Stats) | Slide 62 of 125 Former Professor of Statistics | KIMS, B’lore
Ungrouped Distribution
• If the mean is to be calculated for population based
data, then it is given by
n

x1  x 2  ...  x n x i
μ  i 1
n n

Prof.Gangaboraiah PhD (Stats) | Slide 63 of 125 Former Professor of Statistics | KIMS, B’lore
Ungrouped Distribution

Computation of mean

Population data Sample data

n n

x1  x 2  ...  x n x i
x1  x 2  ...  x n x i
μ  i 1
X  i 1
n n n n
Prof.Gangaboraiah PhD (Stats) | Slide 64 of 125 Former Professor of Statistics | KIMS, B’lore
Mean for ungrouped data

0 1 2 3 4 5 6 7 8 9 10

Mean = 5

0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 6
Prof.Gangaboraiah PhD (Stats) | Slide 65 of 125 Former Professor of Statistics | KIMS, B’lore
Mean for ungrouped data

0 1 2 3 4 5 6 7 8 9 10 12 14 . . . 24

Mean = 8

0 1 2 3 4 5 6 7 8 9 10 12 14 . . . 44

Mean = 12

Prof.Gangaboraiah PhD (Stats) | Slide 66 of 125 Former Professor of Statistics | KIMS, B’lore
Grouped Frequency Distribution
• Computation of Mean for grouped frequency data:
Sum of the product of frequency (f) with mid-term (x)
of class interval divided by total frequency. If x1, x2,
…, xn are the mid-term of class interval and f1, f2, …,
fn are their corresponding frequencies then the mean
is calculated by n

f1x1  f 2 x 2  ...  f n x n  fi x i
X n
 i 1

 fi
N
i 1

Prof.Gangaboraiah PhD (Stats) | Slide 67 of 125 Former Professor of Statistics | KIMS, B’lore
Mean for grouped data
Using Direct and Step-deviation methods n

Mid f x i i
Hours of
fi point fixi di=(xi-A)/h fidi X i 1
studying N
(xi)
580
07.5 – 12.5 1 10 10 -2 -2 X  19.33
12.5 – 17.5 12 15 180 -1 - 12 30
n
17.5 – 22.5 10 A= 20 200 0 0 f d i i
22.5 – 27.5 5 25 125 +1 5 X A i 1
h
27.5 – 32.5 1 30 30 +2 2 N
32.5 – 37.5 4
1 35 35 +3 3 X  20  5  19.33
Total 30 ∑fixi 580 ∑fidi - 4 30

Prof.Gangaboraiah PhD (Stats) | Slide 68 of 125 Former Professor of Statistics | KIMS, B’lore
Mean for grouped data
Wages of employees No. of
Find the mean of (Rs) persons
the wages of 4001- 4500 25
employees 4501- 5000 36
of a company are 5001- 5500 45
as follows: 5501- 6000 62
6001- 6500 39
6501- 7000 55
7001- 7500 44
7501- 8000 29
8001- 8500 15
Total 350
Prof.Gangaboraiah PhD (Stats) | Slide 69 of 125 Former Professor of Statistics | KIMS, B’lore
Mean: Grouped Scores
Data of Children watching TV in Bangalore

Hours of
No. of
TV Cumulati
children fX Percentage
watching ve %
(f)
(X)
1 104 104 31.3 33.3
2 130 260 39.2 70.5
3 98 294 29.5 100
Total 332 658 100.0

Prof.Gangaboraiah PhD (Stats) | Slide 70 of 125 Former Professor of Statistics | KIMS, B’lore
Mean - Properties
• It measures stability. Mean is the most stable among other
measures of central tendency because every score contributes to
the value of the mean.
• It is rigidly defined and therefore suitable for further mathematical
anlysis
• It may easily affected by the extreme scores (outliers).
• The sum of each score’s distance from the mean is zero
(X-Mean)=0
• It can be applied to interval and ratio level of measurement
• It may not be an actual score in the distribution
• It is very easy to compute.
Prof.Gangaboraiah PhD (Stats) | Slide 71 of 125 Former Professor of Statistics | KIMS, B’lore
Mean
When to use the Mean
Sampling stability is desired.

• Other measures are to be computed such as standard


deviation, coefficient of variation and skewness

Prof.Gangaboraiah PhD (Stats) | Slide 72 of 125 Former Professor of Statistics | KIMS, B’lore
The Standard Deviation
Sl.
X1 X2
No.
1 2 3
2 8 10
3 5 5
4 3 3
5 7 7
6 8 3
7 5 5
8 2 6
9 5 3
Total 45 45
Prof.Gangaboraiah PhD (Stats) | Slide 73 of 125 Former Professor of Statistics | KIMS, B’lore
The Standard Deviation
Statistical measures Group 1 Group 2
Mean 5 5
Median 5 5
Mode 5 5

Prof.Gangaboraiah PhD (Stats) | Slide 74 of 125 Former Professor of Statistics | KIMS, B’lore
??? - Observe the following data
Group A

11 12 13 14 15 16 17 18 19 20 21
Group B

11 12 13 14 15 16 17 18 19 20 21
Group C

11 12 13 14 15 16 17 18 19 20 21
Prof.Gangaboraiah PhD (Stats) | Slide 75 of 125 Former Professor of Statistics | KIMS, B’lore
Compute Mean for all three groups
Group A Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21
Group B
Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21
Group C Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21
Prof.Gangaboraiah PhD (Stats) | Slide 76 of 125 Former Professor of Statistics | KIMS, B’lore
??? - Observe the following data
Do you need anything else to describe the data
with mean because mean of all three groups are
same?
• Measures of dispersion or variation

 Standard deviation

 Variance

 Range

Prof.Gangaboraiah PhD (Stats) | Slide 77 of 125 Former Professor of Statistics | KIMS, B’lore
Standard Deviation

 x 
G1(x) xi  x (x i  x ) 2 n
2
i x
12 -3.5 12.25
S i 1
13 -2.5 6.25 n -1
15 -0.5 0.25
52.0
15 -0.5 0.25 
15 -0.5 0.25 7
16 0.5 0.25  2.726
17 1.5 2.25
21 5.5 30.25
Total 0 52.0
Prof.Gangaboraiah PhD (Stats) | Slide 78 of 125 Former Professor of Statistics | KIMS, B’lore
Standard Deviation
G1 G2 G3
12 14 11
13 15 11
15 15 11 Calculate SD
of groups
15 15 12
G2 and G3
15 16 19
16 16 20
17 16 20
21 17 20
Prof.Gangaboraiah PhD (Stats) | Slide 79 of 125 Former Professor of Statistics | KIMS, B’lore
Standard Deviation
Group A Mean = 15.5
SD= 2.726

11 12 13 14 15 16 17 18 19 20 21
Group B
Mean = 15.5
SD=0.926

11 12 13 14 15 16 17 18 19 20 21
Group C Mean = 15.5
SD=4.567

11 12 13 14 15 16 17 18 19 20 21
Prof.Gangaboraiah PhD (Stats) | Slide 80 of 125 Former Professor of Statistics | KIMS, B’lore
The Standard Deviation
Sl.
X1 X2
No.
1 2 3
2 8 10
3 5 5
4 3 3
5 7 7
6 8 3
7 5 5
8 2 6
9 5 3
Total 45 45
Prof.Gangaboraiah PhD (Stats) | Slide 81 of 125 Former Professor of Statistics | KIMS, B’lore
The Standard Deviation
Statistical measures Group 1 Group 2
Mean 5 5
Median 5 5
Mode 5 5
Range 2- 8 3- 10
Mean Deviation 2 2
Variance 5.50 5.75
Standard Deviation 2.34 2.40
Coefficient of Variation (%) 46.90 48.00
Prof.Gangaboraiah PhD (Stats) | Slide 82 of 125 Former Professor of Statistics | KIMS, B’lore
Standard Deviation
The Standard Deviation is a measure of Dispersion
or Variation which is a descriptive statistics that
describe how similar a set of scores are to each other.
The more similar the scores are to each other, the
lower the Standard deviation will be and the less
similar the scores are to each other, the higher the
Standard deviation will be.
In general, the more spread out a distribution is, the
larger the measure of dispersion will be.
Prof.Gangaboraiah PhD (Stats) | Slide 83 of 125 Former Professor of Statistics | KIMS, B’lore
Measure of Variability/ Dispersion
Which of the distributions of scores has the larger
125
dispersion? 100
75
The upper distribution has 50
25
more dispersion because the 0
1 2 3 4 5 6 7 8 9 10
scores are more spread out
That is, they are less similar 125
100
to each other 75
50
25
0
1 2 3 4 5 6 7 8 9 10

Prof.Gangaboraiah PhD (Stats) | Slide 84 of 125 Former Professor of Statistics | KIMS, B’lore
Measure of Variability/ Dispersion
Variability can be defined several ways:

• A quantitative distance measure based on the


differences between scores
• Describes distance of the spread of scores or
distance of a score from the mean

Purposes of Measure of Variability:


• Describe the variability in the distribution

• Measure how well an individual score represents the


distribution

Prof.Gangaboraiah PhD (Stats) | Slide 85 of 125 Former Professor of Statistics | KIMS, B’lore
The Standard Deviation
New Strategy :
Find deviation of each score, ie., (X-Mean)
Square each deviation of each score, ie., (X-Mean)2
Sum the Squared Deviations, ie., Σ (X-Mean)2
Average the squared deviations Σ (X-Mean)2/n
• Mean Squared Deviation is known as “Variance”
• Variability is now measured in squared units

Prof.Gangaboraiah PhD (Stats) | Slide 86 of 125 Former Professor of Statistics | KIMS, B’lore
The Population Variance
• Population variance equals mean (average) squared
deviation (distance) of the scores from the population
mean

• Variance is the average of squared deviations, so we


identify population variance with a lowercase Greek
letter sigma squared: σ2

• Standard deviation is the square root of the variance,


so we identify it with a lowercase Greek letter sigma: σ
Prof.Gangaboraiah PhD (Stats) | Slide 87 of 125 Former Professor of Statistics | KIMS, B’lore
Ungrouped Distribution

Computation of Standard Deviation

Population data Sample data

n n

 (x i  μ) 2
 (x  x)
i
2

σ i 1
S i 1
n n -1
Prof.Gangaboraiah PhD (Stats) | Slide 88 of 125 Former Professor of Statistics | KIMS, B’lore
Formula for grouped data
◊ Standard deviation for grouped data

Prof.Gangaboraiah PhD (Stats) | Slide 89 of 125 Former Professor of Statistics | KIMS, B’lore
The Standard Deviation
• Most common and most important measure of variability is
the standard deviation
o A measure of the standard, or average, distance from
the mean
o Describes whether the scores are clustered closely
around the mean or are widely scattered

• Calculation differs for population and samples

• Variance is a necessary companion concept to standard


deviation but not the same concept
Prof.Gangaboraiah PhD (Stats) | Slide 90 of 125 Former Professor of Statistics | KIMS, B’lore
The Standard Deviation

Exercise : Find out the deviations of all the data points with
the mean….and then find the ‘mean deviation’.
Prof.Gangaboraiah PhD (Stats) | Slide 91 of 125 Former Professor of Statistics | KIMS, B’lore
Interpretation
Important note:
◊ Mean and Standard Deviation
◊ Mean and Variance

Which pair to be used for comparison?


Why?

Prof.Gangaboraiah PhD (Stats) | Slide 92 of 125 Former Professor of Statistics | KIMS, B’lore
Interpretation
Important note:
◊ It is not enough if only mean is
computed to describe the data but
standard deviation is also needed to
give complete description as both have
same unit of measurement.
Prof.Gangaboraiah PhD (Stats) | Slide 93 of 125 Former Professor of Statistics | KIMS, B’lore
Coefficient of Variation
Important note:
◊ If two variables are measured in different
unit of measurement which is the best
measure to compare the variability?
Ans: Coefficient of Variation
S
CV  100
X
Prof.Gangaboraiah PhD (Stats) | Slide 94 of 125 Former Professor of Statistics | KIMS, B’lore
Coefficient of Variation

Variables Mean SD

Height (cms) 169.6 27.5

Weight (kgs) 88.7 21.8

Prof.Gangaboraiah PhD (Stats) | Slide 95 of 125 Former Professor of Statistics | KIMS, B’lore
Coefficient of Variation

Variables Mean SD CV (%)

Height (cms) 169.6 27.5 16.21

Weight (kgs) 88.7 21.8 24.58

Prof.Gangaboraiah PhD (Stats) | Slide 96 of 125 Former Professor of Statistics | KIMS, B’lore
The Median
• The score that divides the distribution into two
equal parts, so that half the cases are above it and
half below it.

• The median is the middle score, or average of


middle scores in a distribution.
o Fifty percent (50%) lies below the median value
and 50% lies above the median value.
o It is also known as the middle score or the 50th

percentile.
Prof.Gangaboraiah PhD (Stats) | Slide 97 of 125 Former Professor of Statistics | KIMS, B’lore
The Median
Median of Ungrouped Data

 Arrange the scores (from lowest to highest


or highest to lowest).

 Determine the middle most score in a


distribution if n is an odd number (and if n
is an even number, get the average of the
two middle most scores)

Prof.Gangaboraiah PhD (Stats) | Slide 98 of 125 Former Professor of Statistics | KIMS, B’lore
Median
151, 168, 174

Median  168

166, 169, 172, 185

Median  170.5
Prof.Gangaboraiah PhD (Stats) | Slide 99 of 125 Former Professor of Statistics | KIMS, B’lore
Median

 n 1
Value in the position of 2 , if n is odd
Median  
Average value in the position of n and n  1, if n is even
 2 2

Prof.Gangaboraiah PhD (Stats) | Slide 100 of 125 Former Professor of Statistics | KIMS, B’lore
Median in Grouped Data

Where:
• L = Lower boundary of the category containing the N/2

• Cf = Cumulative frequency before the median class if

the scores are arranged from the lowest to highest


value
• h = Size of the class interval

• f = frequency of the median class

Prof.Gangaboraiah PhD (Stats) | Slide 101 of 125 Former Professor of Statistics | KIMS, B’lore
Median in Grouped Data
Steps to solve median for grouped data

1. Complete the table for Cumulative frequency.


2. Get N/2 of the scores in the distribution so that
you can identify MC
3. Determine L, h, f and Cumulative frequency
4. Solve the median using the given formula

Prof.Gangaboraiah PhD (Stats) | Slide 102 of 125 Former Professor of Statistics | KIMS, B’lore
Median
Example: Scores of 40 students in a science class consist of 60 items
and they are tabulated below. The highest score is 54 and the lowest
score is 10.

Prof.Gangaboraiah PhD (Stats) | Slide 103 of 125 Former Professor of Statistics | KIMS, B’lore
Median
Solution:
• N/2 = 40/2 = 20
• The category containing N/2 is (35 – 39)
• Lower Limit of MC = 35
• L = 34.5
• Cf (or Cfp) = 17
• f (or fm) = 9
• h=5

• Median = L + (N/2 – Cf) /f * h


= 34.5 +(20-17)/9 *5
= 34.5 + 15/9
= 36.17
Prof.Gangaboraiah PhD (Stats) | Slide 104 of 125 Former Professor of Statistics | KIMS, B’lore
Median - Properties
• It may not be an actual observation in the data set.
• Not affected by extreme values because median is a
positional measure.
• Can be applied in ordinal level.

When to Use the Median


o The exact midpoint of the score distribution is
desired.
o There are extreme scores in the distribution.
Prof.Gangaboraiah PhD (Stats) | Slide 105 of 125 Former Professor of Statistics | KIMS, B’lore
Analyze and interpret this data
Hours of TV Duration of
Intelligence
Sl. No. watching per hospital stay
Quotient
week (days)

1 106 7 8
2 86 2 40
3 200 27 10
4 101 70 80
5 199 8 180
6 103 9 5
7 197 20 80
8 113 12 10
9 112 6 5
10 65 17 8
Prof.Gangaboraiah PhD (Stats) | Slide 106 of 125 Former Professor of Statistics | KIMS, B’lore
Shape of the Distribution

Prof.Gangaboraiah PhD (Stats) | Slide 107 of 125 Former Professor of Statistics | KIMS, B’lore
Shape of the Distribution
• Symmetrical : mean is about equal to median
• Normality: mean = median = mode

• Skewed: Deviation from Normality


• Negatively: mean < median < mode
• Positively: mode > median > mean
• Bimodal: has two distinct modes
• Multi-modal: has more than 2 distinct modes)
Prof.Gangaboraiah PhD (Stats) | Slide 108 of 125 Former Professor of Statistics | KIMS, B’lore
Quartiles
• Not a Measure of Central Tendency
• Split Ordered Data into 4 Quarters

25% 25% 25% 25%


Q1 Q2 Q3
Position of i-th Quartile: position of point in

Data in Ordered Array: 11 12 13 16 16 17 18 21 22


1•(9 + 1)
Position of Q1 = = 2.50, Q1 =12.5
4
Prof.Gangaboraiah PhD (Stats) | Slide 109 of 125 Former Professor of Statistics | KIMS, B’lore
Interquartile range (IQR)
• Measure of Variation
• Also Known as Midspread: Spread in the Middle 50%
• Difference Between Third & First Quartiles:
• Not Affected by Extreme Values
Interquartile Range = Q3 – Q1
Data in Ordered Array: 11 12 13 16 16 17 17 18 21
1•(9 + 1)
Position of Q1 = = 2.50, Q1 =12.5
4
3•(9 + 1)
Position of Q3 = = 7.50, Q3 =17.5
4
Interquartile Range = Q3 – Q1 = 17.5 - 12.5 = 5
Prof.Gangaboraiah PhD (Stats) | Slide 110 of 125 Former Professor of Statistics | KIMS, B’lore
Box and Whisker plot
12 97.5th Centile
10

8 75th Centile Q1
Pain (VAS)

4 MEDIAN
Q2
2 (50th centile)
0

-2
25th Centile Q3
N= 74 27
Female Male
Inter-quartile
2.5th Centile
range

Prof.Gangaboraiah PhD (Stats) | Slide 111 of 125 Former Professor of Statistics | KIMS, B’lore
Box and Whisker plot

IQR = Q3 – Q1
Q1-1.5 IQR Q3+1.5 IQR

Min Max
Q1 Q2 Q3
Outlier Outlier

Prof.Gangaboraiah PhD (Stats) | Slide 112 of 125 Former Professor of Statistics | KIMS, B’lore
Box-and-Whisker plot

IQR = Q3 – Q1

Min Q1-1.5 IQR Q3+1.5 IQR

Max
Q1-3 IQR Q1 Q2 Q3 Q3+3 IQR
Major Outlier Major Outlier
Prof.Gangaboraiah PhD (Stats) | Slide 113 of 125 Former Professor of Statistics | KIMS, B’lore
Percentiles
A score below which a specific percentage of the distribution
falls.

Finding percentiles in ungrouped data:


0.01N  Cf
Pi  L  h, i  1, 2, 3, . . . , 99
f
Finding percentiles in grouped data

i(n  1)
Pi  , i  1, 2, 3, . . . , 99
100
Prof.Gangaboraiah PhD (Stats) | Slide 114 of 125 Former Professor of Statistics | KIMS, B’lore
The Mode
• The category or score with the largest frequency (or
percentage) in the distribution.
• The mode can be calculated for variables with levels of
measurement that are: nominal, ordinal, or interval-ratio.
Example:
• Number of Votes for Candidates for Lok Sabha MP. The mode, in this
case, gives you the “central” response of the voters: the most popular
candidate.

• Candidate A – 11,769 votes The Mode:


• Candidate B – 39,443 votes “Candidate C”
• Candidate C – 78,331 votes

Prof.Gangaboraiah PhD (Stats) | Slide 115 of 125 Former Professor of Statistics | KIMS, B’lore
Mode
Properties
• It can be used when the data are qualitative as well as
quantitative.
• It may not be unique (Uni -, Bi -, Tri- , Poly- modal)
• It is affected by extreme values (outliers).
• It may not exist (Ill – defined : Mode = 3Median – 2Median).
When to Use the Mode
o When the “typical” value is desired.

o When the data set is measured on a nominal scale


Prof.Gangaboraiah PhD (Stats) | Slide 116 of 125 Former Professor of Statistics | KIMS, B’lore
The Ranges
• The distance covered by the scores in a distribution – From smallest
value to highest value
• For continuous data, real limits are used
Range = XMin - XMax

Range = LRL for Xmin - URL for Xmax

• Based on two scores, not all the data – An imprecise, unreliable


measure of variability
Example: For a set of scores: 7, 2, 7, 6, 5, 6, 2

Range = Highest Score minus Lowest score = 7 - 2 = 5 (or 2 – 7)

Prof.Gangaboraiah PhD (Stats) | Slide 117 of 125 Former Professor of Statistics | KIMS, B’lore
Learning Check
a) If all the scores in a data set are the same, the Standard
Deviation is equal to 1.00 True / False ?
Select the correct option

a) The standard deviation measures …


(1) Sum of squared deviation scores
(2) Standard distance of a score from the mean
(3) Average deviation of a score from the mean
(4) Average squared distance of a score from the mean

Prof.Gangaboraiah PhD (Stats) | Slide 118 of 125 Former Professor of Statistics | KIMS, B’lore
Solution
a) If all the scores in a data set are the same, they are
equal to the mean and hence the deviation from mean
= 0 therefore, Standard Deviation is equal to zero
False
a) The standard deviation measures …
(1) Sum of squared deviation scores
(2) Standard distance of a score from the mean
(3) Average deviation of a score from the mean
(4) Average squared distance of a score from the
mean
Prof.Gangaboraiah PhD (Stats) | Slide 119 of 125 Former Professor of Statistics | KIMS, B’lore
Learning Check
Select the correct option
a) A sample of four scores has SS = 24. What is the variance?
(1) The variance is 6
(2) The variance is 7
(3) The variance is 8
(4) The variance is 12

b) A sample systematically has less variability than True / False ?


a population
c) The standard deviation is the distance from the True / False ?
Mean to the farthest point on the distribution
curve
Prof.Gangaboraiah PhD (Stats) | Slide 120 of 125 Former Professor of Statistics | KIMS, B’lore
Solution
Select the correct option
a) A sample of four scores has SS = 24. What is the variance?
(1) The variance is 6
(2) The variance is 7
(3) The variance is 8
(4) The variance is 12

b) Extreme scores affect variability, but are less


likely to be included in a sample True (???)
c) The standard deviation extends from the mean
approximately halfway to the most extreme False
score
Prof.Gangaboraiah PhD (Stats) | Slide 121 of 125 Former Professor of Statistics | KIMS, B’lore
Exercise 3
Compute quartiles using Python or R and construct a Box-and-
Whisker plot for the following data on wages to represent IQR

Prof.Gangaboraiah PhD (Stats) | Slide 122 of 125 Former Professor of Statistics | KIMS, B’lore
Exercise 1
The following data is the wages of 350 employees of an organisation.
Compute (a) Mean, (b) Median (c) Mode and (d) Standard deviation
Wages 4001- 4501- 5001- 5501- 6001- 6501- 7001- 7501- 8001-
(Rs.) 4500 5000 5500 6000 6500 7000 7500 8000 8500
No. of
25 36 45 62 39 55 44 29 15
persons

Hint for Mode: L = Lower limit of Modal class; f0 = Frequency preceding modal
class; f1 = Frequency of modal class; f2 = Frequency succeeding
modal class; h = Width of class interval

Prof.Gangaboraiah PhD (Stats) | Slide 123 of 125 Former Professor of Statistics | KIMS, B’lore
Exercise 2
In the following three sets of data calculate suitable measure of
central tendency and dispersion? Represent data graphically using
an appropriate graphical method

Sl. No. 1 2 3 4 5 6 7 8 9 10
Intelligence Quotient 106 86 200 101 199 103 197 113 112 65
Hours of TV
7 2 27 70 8 9 20 12 6 17
watching per week
Duration of hospital
8 40 10 80 180 5 80 10 5 8
stay (days)

Prof.Gangaboraiah PhD (Stats) | Slide 124 of 125 Former Professor of Statistics | KIMS, B’lore
Prof.Gangaboraiah PhD (Stats) | Slide 125 of 125 Former Professor of Statistics | KIMS, B’lore

You might also like