You are on page 1of 51

QUANTITATIVE TECHNIQUES FOR Dr.

Pritha Guha
MANAGERIAL DECISION - 1 (QTMD1G21-1)
TEACHING AND GRADING

TEXTBOOK:
PRITHA GUHA A FIRST COURSE IN PROBABILITY GRADING:
(EMAIL: PRITHA@XLRI.AC.IN) BY QUIZ-1 (30%),
SHELDON ROSS (9TH EDITION), QUIZ-2 (30%),
PEARSON END-TERM (40%)
PROBABILITY
SUMMARISING DATA
Facts and figures
collected, analysed
and summarised
for presentation
and interpretation.

All the data


collected in a
particular study
are referred to as
the data set for the
study.

WHAT IS DATA
Existing Sources
• Data Repositories:
• Kaggle(https://www.kaggle.com/),
• UCI (https://archive.ics.uci.edu/ml/index.php)

Experimental and observational studies


• A beverage company investigates consumer reaction to a new
DATA SOURCES bottle design for one of its popular soft drinks

Transactional data, data warehousing and big data


• Click data in websites
• Search data in Google
• Viewing history in Netflix
HOW TO DO A STATISTICAL ANALYSIS?

Find Use Communicate

Find the right data. Use the appropriate Clearly communicate the
statistical tools. numerical information
into written language.
TWO BRANCHES OF STATISTICS
Population: consists
of all items of
interest.

Sample: a subset
of the population.

Descriptive Statistics: collecting, Inferential Statistics: drawing conclusions


organising, and presenting the data. about a population based on sample
data from that population.
Why do we need a sample?
WHERE DOES PROBABILITY FIT IN?
Probability helps in understanding
Probability • how inferential procedures are
developed and used,
• how statistical conclusions can be
translated into everyday language
and interpreted,
• when and where pitfalls can occur
in applying the methods.

Probability and statistics both deal


with questions involving populations
and samples but do so in an “inverse
manner” to each other.
Inferential Statistics
VARIABLE
A variable is the general characteristic being observed on an object of interest

Qualitative: How do we know whether a


Categorical, variable is Categorical, Ordinal,
Ordinal Discrete or Continuous?
Variable
Quantitative:
Discrete,
Continuous

The statistical analysis that is appropriate depends on whether the data for the
variable are qualitative or quantitative
What are the variables here?
AN EXAMPLE What type of variables are those?

ID gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
1 female group B bachelor's degree standard none 72 72 74
2 female group C some college standard completed 69 90 88
3 female group B master's degree standard none 90 95 93
4 male group A associate's degree free/reduced none 47 57 44
5 male group C some college standard none 76 78 75
6 female group B associate's degree standard none 71 83 78
7 female group B some college standard completed 88 95 92
8 male group B some college free/reduced none 40 43 39
9 male group D high school free/reduced completed 64 64 67
10 female group B high school free/reduced none 38 60 50
11 male group C associate's degree standard none 58 54 52
12 male group D associate's degree standard none 40 52 43
13 female group B high school standard none 65 81 73
14 male group A some college standard completed 78 72 70
15 female group A master's degree standard none 50 53 58
16 female group C some high school standard none 69 75 78
17 male group C high school standard none 88 89 86
How can we extract the most prominent
features of the data?

CHALLENGES Use numerical and graphical summaries


WITH THE DATA Graphical summaries: frequency tables,
SET? histograms, pie charts, bar charts, scatterplots

Numerical summaries: mean, median,


quartiles, standard deviation
GRAPHICAL SUMMARIES Frequency table, Histogram
CLASS FREQUENCY RELATIVE FREQUENCY
FREQUENCY TABLE 10-20 1 0.001

20-30 7 0.007
Groups data into intervals called classes and
records the number of observations that falls into
each class. 30-40 19 0.019

40-50 70 0.070
How to construct?
50-60 178 0.178
• Decide the number of non-overlapping classes. 60-70 238 0.238
• The classes are exhaustive.
• Determine the width of each class: take equal width for 70-80 252 0.252
classes.
• approx. class width = (Max-Min)/no. of classes
• Determine the class limits: each data point should be in 80-90 173 0.173
exactly one class; no more, no less.
90-100 62 0.062
Relative Frequency: A relative frequency
distribution identifies the proportion or fraction of
values that fall into each class. Total 1000 1.0
HISTOGRAM
𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠
=
𝐶𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠
• Along x-axis, plot the classes.
• AlongY-axis plot the density of the data in
that class.
• The vertical scale on such a histogram is
called a density scale.
• Interesting property of a density histogram:
• The area of each rectangle is the relative
frequency of the corresponding class.
• As the sum of relative frequencies must be
1.0 (except for roundoff), the total area of
all rectangles in a density histogram is 1.
NUMERICAL SUMMARY Measures of central tendency,
Measures of dispersion
A measure of central tendency represents
the centre or middle of the data

Mean

MEASURES OF Median
CENTRAL
TENDENCY Quartiles

Mode
• The sample mean of observations x 1, x2, x3, …, xn:
x1 +x2 +⋯+xn σn
i=1 xi
=
MEAN n n

• Simplest of all the central tendency measures


• Sensitive to outliers

a) Find mean of 90, 70, 80, 90, 120, 140,


110, 100, 130
b) Find mean of 90, 70, 80, 90, 120, 140,
110, 100, 130, 1200
• The sample median is the middle value when the observations are
ordered from smallest to largest.
• Suppose observations x 1, x2, x3, …, xn are ordered from smallest to
largest, then sample median, 𝑥෥,

MEDIAN
• Resistant to extreme values, easy to describe
• Not as mathematically tractable as mean, need to sort the data to
calculate

a) Find median of 90, 70, 80, 90,


120, 140, 110, 100, 130
b) Find median of 90, 70, 80, 90, 120,
140, 110, 100, 130, 1200
• Sample median is resistant to extreme observations
• it depends only on those data values in the middle, and is
unaffected by the actual values of the outer observations in the
ordered list
• This property of being unaffected by extreme values is known as
“robustness”
ROBUSTNESS IDEA • Sample mean is not robust
• Any significant changes in the magnitude of one observation
results in a corresponding change in the value of the mean
• Consequently, the sample mean is said to be sensitive to extreme
observations
The 100pth percentile of a population/ sample is a
value such that at least 100p% values lie at or below
it and at least 100(1-p)% values lie at or above it

For median (the second quartile, Q2), , p = 0.5. For


PERCENTILES/ first quartile (Q1), p = 0.25, and for third quartile
(Q3), , p = 0.75
QUARTILES/
QUANTILES Percentiles are also referred to as “quantiles”, with
100pth percentile being pth quantile; for example, the
median is the 0.5th quantile
For the following dataset:
127, 132, 138, 141, 146, 152, 154,
162, 171, 177, 192, 241,
compute 0.5th, 0.25th, 0.75th
quantiles!
• The mode is another measure of central location
•The most frequently occurring value in a data set

MODE •More useful to summarise qualitative data


•A data set can have no mode, one mode (unimodal),
or many modes (multimodal)
Besides knowing the central point of a data set, we would also like to describe
the data’s spread or how far from the centre the data tend to range
Measures of dispersion gauge the variability of a data set

SOME MEASURES Range


OF Interquartile Range (IQR)
VARIABILITY/ Variance
SPREAD/ Standard Deviation
DISPERSION
• Range = Maximum Value – Minimum Value
• Simplest measure of dispersion
• Focuses on the extreme values
RANGE • Very sensitive to the smallest and largest data values
IQR = Q3- Q1
• IQR represents the middle half of the data

INTERQUARTILE • Any observation farther than 1.5IQR from the closest


quartile is an outlier.
RANGE (IQR) • The interquartile range is robust against outliers
•Relies on the quartiles, IQR, and aforementioned outlier rule
•Shows several of a data set’s most prominent features, including
centre, spread, the extent and nature of any departure from
symmetry, and outliers.
•Steps for Constructing a Boxplot
1. Draw a measurement scale (horizontal or vertical).
2. Draw a rectangle adjacent to this axis beginning at Q 1 and ending
BOXPLOT at Q3 (so rectangle length = IQR).
3. Place a line segment at the location of the median. (The position of
the median symbol relative to the two edges conveys information about
the skewness of the middle 50% of the data.)
4. Determine which data values, if any, are outliers. Mark each outlier
individually
5. Finally, draw “whiskers” out from either end of the rectangle to the
smallest and largest observations that are not outliers.
1. Smallest value
2. First quantile (Q1)

FIVE NUMBER 3. Second quantile/median (Q2)


4. Third quantile (Q3)
SUMMARY 5. Largest value
This summary helps us to know where is the data
centred and what is the spread
• The sample standard deviation of observations x1, x2, x3,
…, xn denoted by s, is given by

STANDARD • The quantity s2 is known as the sample variance

DEVIATION AND • The variance is a measure of variability that utilizes all the
data
VARIANCE • The variance is useful in comparing the variability of two or
more variables
• Standard deviation is measured in the same units as the
data, making it more easily interpreted than the variance.
A LITTLE BIT OF R…
R is a language and
environment for statistical
computing and graphics.
R is available as Free
Software.
R provides a wide variety of
statistical and graphical
techniques and is highly
extensible (active community
of developers).

WHAT IS R?
DOWNLOADING R AND RSTUDIO
R download: https://cran.rstudio.com/
Rstudio download:
https://www.rstudio.com/products/rstudio/download/#download
Download the free desktop version for RStudio
After installing R and RStudio, launch RStudio from your computer
“application folders”.
Code Editor
Workspace and History

Plots, Files and Packages


R-Console
x = c(90, 70, 80, 90, 120, 140, 110, 100, 130)

[1] 90 70 80 90 120 140 110 100 130

y = c(90, 70, 80, 90, 120, 140, 110, 100, 130, 1200)

ENTERING DATA [1] 90 70 80 90 120 140 110 100 130 1200

IN R
How will you enter the data
127, 132, 138, 141, 144, 146,
152, 154, 162, 171, 177, 192,
241
in R?
x = c(90, 70, 80, 90, 120, 140, 110, 100, 130)
mean(x)

[1] 103.3333

y = c(90, 70, 80, 90, 120, 140, 110, 100, 130, 1200)

COMPUTATIONS mean(y)

[1] 213

IN R
sort(x)

[1] 70 80 90 90 100 110 120 130 140

median(x)

[1] 100
data1 = c(127, 132, 138, 141, 144, 146, 152, 154, 162,
171, 177, 192,241)

quantile(data1)

0% 25% 50% 75% 100%

COMPUTATIONS 127 141 152 171 241

IN R quantile(x,p) gives the 100pth percentile, or the pth


quantile.
quantile(data1, 0.9)

90%

189
range(data1)
[1] 127 241

diff(range(data1))

COMPUTATIONS
[1] 114

IQR(data1)

IN R [1] 30

var(data1)

[1] 939.0256

sd(data1)
[1] 30.64353
BOXPLOT IN R

boxplot(data1)
FIVE NUMBER SUMMARY IN R
summary(data1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
127.0 141.0 152.0 159.8 171.0 241.0

fivenum(data1)
[1] 127 141 152 171 241
READING A DATA FILE IN R
Read the StudentsPerformance.csv file
SPData = read.csv(file.choose(),
header = TRUE)

See the file in R


View(SPData)
names(SPData)
How will R identify the values in the
data set
SPData$reading.score[1:5]
COMPUTATIONS IN R mean(SPData$reading.score)
[1] 69.169
median(SPData$reading.score)
[1] 70
What is the mean reading score? sd(SPData$reading.score)
What is the median reading score? [1] 14.60019
What is the standard deviation of
the reading score?

IQR(SPData$reading.score)
What is the IQR of the reading score? [1] 20
boxplot(SPData$reading.score)
How would the boxplot for reading score look
like?
Are there any outliers?
boxplot(SPData$reading.score, main = "Boxplot of reading score", ylab =
"Reading score (out of total marks 100)", col = "gold")
break.SPData = c(10,20,30,40,50,60,70,80,90,100)

hist(SPData$reading.score, breaks=break.SPData, right=TRUE, freq=FALSE, ylab = "Density",


xlab="Reading Score", main="Histogram of reading score", col = "darkred")
COMPUTATIONS IN R
Frequency Table
break.SPData = c(10,20,30,40,50,60,70,80,90,100)
SPData.cut = cut(SPData$reading.score, breaks =
break.SPData, right = TRUE)
FreqTab.ReadingScore = table(SPData.cut)

Relative Frequency
Table
PropFreqTab.ReadingScore=prop.table(FreqTab.Reading
Score)
HOME WORK
Toss a coin 1, 2, 3, 4, …, 20 times and note down the number of heads.
Compute the relative frequency.
Plot the relative frequency and send it to me!
Coin Tossed Frequency of Heads Relative Frequency
1 1 1
2 1 0.5
3 3 1
4 1 0.25
5 3 0.6
6 1 0.166666667
7 3 0.428571429
8 5 0.625
9 5 0.555555556
10 6 0.6

You might also like