You are on page 1of 4

Session 3, Lecture 4, BIMTECH, 01 Feb 2/2/2022

2022

Statistics for Decision Making in Python


Session 3, Lecture 4
Business Vertical – DA, Trimester III, Batch ‘21-’23

V Shekhar Avasthy, 1st Feb, 2022

The idea behind this module is to ensure that you understand that behind every concept, there are multiple finer
aspects that you must research on your own. It shall not be possible to cover all finer aspects given the scope of the
course, but you should inculcate the habit of exploring every concept and doing R&D on your own to become a GREAT
Data Scientist! We shall not go into such details for most other concepts in your syllabus.

Such Fine aspects of Quartiles are NOT a part of the course that shall be graded. Box Plot basics could be part of
grading system though.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 1

Quartile Considerations –
Quartile computation is all about dividing the number line with data values placed at various numbers!
1. CONSIDERATION 1: Since quartile divides sorted data into 4 parts, for a given N (or n), four scenarios emerge:
• N/4 gives ZERO as remainder (i.e., N is perfectly divisible by 4), such as N=8
• N/4 gives 1 as remainder, such as N=9
• N/4 gives 2 as remainder, such as N=10
• N/4 gives 3 as remainder, such as N=11

2. CONSIDERATION 2: What do you take start and end points?


• Dividing data into 4 parts is like placing objects (data points) on a number line (as above), but should you place first data point at 0 or
at 1?
• If placing first data point at 0, 8th data point shall be at 7 as shown 
• Then, values of Q1, Q3 shall be as shown by RED / GREEN arrows
N-1 basis of number line
• Used by QurtileInc function in Excel

• If placing first data point at 0, 8th data point shall be at 7 as shown 


• Then, values of Q1, Q3 shall be as shown by RED / GREEN arrows N basis of number line

• If placing first data point at 0, 8th data point shall be at 7 as shown 


• Then, values of Q1, Q3 shall be as shown by RED / GREEN arrows Q1 Q2 Q3 N+1 basis of number line
• Used by SAS / Minitab/ QuartileExc (in Excel)

3. CONSIDERATION 3: Where should you place data? At the start point/ in middle, at the end points? See above figures.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 2

All rights reserved, Facts n Data, 2022 1


Session 3, Lecture 4, BIMTECH, 01 Feb 2/2/2022
2022

…Quartile Calculation Considerations

4. CONSIDERATION 4: How should you extrapolate data? Nearest point? Mid Point? Linear
Extrapolation?

• In first figure, should Q1 be 3.25? OR 3? Figure 1

• In second figure, should it be 2.5? OR 2? OR 3?


Figure 2

Figure 3

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 3

Understanding by example

• Consider a data set {2,3,5,8,11,12,14,17}. For N=8 using the N-1 basis, quartiles are

• This shows that on (N-1) scale, 1.75th value shall ACTUALLY be 2.75th Value of data and so on. The Calculated value
indices are not the values of the quartiles. It indicates which value (by position) is to be used as the quartile.

• Thus, for above data, our quartiles are the 2.75th, 4.5th, and 6.25th values. For the first quartile, 2.75
means the value 0.75 of the way from the 2nd to the 3rd values, or for Q3, 0.75 of the way from 3 to 5,
or 4.5. For the median, 4.5 means halfway between the 4th and 5th values, or halfway between 8 and
11, or 9.5. For the third quartile, 6.25 means the value 0.25 of the way from the 6th to the 7th value, or
0.25 of the way from 12 to 14, or 12.5.

• MUST READ: https://peltiertech.com/quartiles/

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 4

All rights reserved, Facts n Data, 2022 2


Session 3, Lecture 4, BIMTECH, 01 Feb 2/2/2022
2022

So, what does “Area Under the Curve” indicate?

4267

2943

72.8 – 73.8 82.8 – 83.8

For sample data, take case of n=9. The above histogram shows the number of times the sample means was found to have a particular value. E.g.,
RED bar shows that “when all possible samples of n=9 were drawn from this population, then the mean of these samples between 72.8-73.8 was found
2,943 times” whereas the Green bar shows that “when all possible samples of n=9 were drawn from this population, then the mean of these samples
between 82.8-83.8 was found 4,267 times”. Since height of every “rectangular bar” of histogram above indicates the number of cases and width
indicates the interval, every “bar” represents area of height*width, OR area under curve!
Incidentally, this also indicates that out of a total of nCr (20C9 = 1,25,970) combinations, 2,943 combinations belong to RED bar. In other words, the
area of RED bar as proportion of total area (2,943/1,25,970) represents the probability of the sample mean lying here. That’s why this is called
“Probability Distribution Curve”.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 5

Area Under the Curve…

2943

318

51..8 – 52.8 72.8 – 73.8

The RED bars above indicate probability that the mean of samples is between 318 and 2943. Similarly, any
probability can be computed for mean less than a given value, greater than a given value or between given
values.

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 6

All rights reserved, Facts n Data, 2022 3


Session 3, Lecture 4, BIMTECH, 01 Feb 2/2/2022
2022

Thank You!

Comments/ Clarifications: shekhar@factsNdata.com / +91-9810228402

Privileged and Confidential. All Rights Reserved © Facts n Data 2022 7

All rights reserved, Facts n Data, 2022 4

You might also like