Professional Documents
Culture Documents
Statistics For Decision Making in Python: Session 3, Lecture 4 V Shekhar Avasthy, 1 Feb, 2022
Statistics For Decision Making in Python: Session 3, Lecture 4 V Shekhar Avasthy, 1 Feb, 2022
2022
The idea behind this module is to ensure that you understand that behind every concept, there are multiple finer
aspects that you must research on your own. It shall not be possible to cover all finer aspects given the scope of the
course, but you should inculcate the habit of exploring every concept and doing R&D on your own to become a GREAT
Data Scientist! We shall not go into such details for most other concepts in your syllabus.
Such Fine aspects of Quartiles are NOT a part of the course that shall be graded. Box Plot basics could be part of
grading system though.
Quartile Considerations –
Quartile computation is all about dividing the number line with data values placed at various numbers!
1. CONSIDERATION 1: Since quartile divides sorted data into 4 parts, for a given N (or n), four scenarios emerge:
• N/4 gives ZERO as remainder (i.e., N is perfectly divisible by 4), such as N=8
• N/4 gives 1 as remainder, such as N=9
• N/4 gives 2 as remainder, such as N=10
• N/4 gives 3 as remainder, such as N=11
3. CONSIDERATION 3: Where should you place data? At the start point/ in middle, at the end points? See above figures.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 2
4. CONSIDERATION 4: How should you extrapolate data? Nearest point? Mid Point? Linear
Extrapolation?
Figure 3
Understanding by example
• Consider a data set {2,3,5,8,11,12,14,17}. For N=8 using the N-1 basis, quartiles are
• This shows that on (N-1) scale, 1.75th value shall ACTUALLY be 2.75th Value of data and so on. The Calculated value
indices are not the values of the quartiles. It indicates which value (by position) is to be used as the quartile.
• Thus, for above data, our quartiles are the 2.75th, 4.5th, and 6.25th values. For the first quartile, 2.75
means the value 0.75 of the way from the 2nd to the 3rd values, or for Q3, 0.75 of the way from 3 to 5,
or 4.5. For the median, 4.5 means halfway between the 4th and 5th values, or halfway between 8 and
11, or 9.5. For the third quartile, 6.25 means the value 0.25 of the way from the 6th to the 7th value, or
0.25 of the way from 12 to 14, or 12.5.
4267
2943
For sample data, take case of n=9. The above histogram shows the number of times the sample means was found to have a particular value. E.g.,
RED bar shows that “when all possible samples of n=9 were drawn from this population, then the mean of these samples between 72.8-73.8 was found
2,943 times” whereas the Green bar shows that “when all possible samples of n=9 were drawn from this population, then the mean of these samples
between 82.8-83.8 was found 4,267 times”. Since height of every “rectangular bar” of histogram above indicates the number of cases and width
indicates the interval, every “bar” represents area of height*width, OR area under curve!
Incidentally, this also indicates that out of a total of nCr (20C9 = 1,25,970) combinations, 2,943 combinations belong to RED bar. In other words, the
area of RED bar as proportion of total area (2,943/1,25,970) represents the probability of the sample mean lying here. That’s why this is called
“Probability Distribution Curve”.
Privileged and Confidential. All Rights Reserved © Facts n Data 2022 5
2943
318
The RED bars above indicate probability that the mean of samples is between 318 and 2943. Similarly, any
probability can be computed for mean less than a given value, greater than a given value or between given
values.
Thank You!