You are on page 1of 63

Vishal Mishra (IBS, Hyderabad)

BASICS – Descriptive Analytics

• Statistics: Data Collection, Organization, Analysis, Interpretation and


Presentation

• Descriptive Statistics

• Inferential Statistics
Vishal Mishra (IBS, Hyderabad)
BASICS
Descriptive Statistics

Graphical, Tabular, Numeric representation of Data

- Qualitative

- Quantitative
Vishal Mishra (IBS, Hyderabad)

BASICS
Graphical, Tabular, Numeric representation of Data

Qualitative Data (Graphical and Tabular)

Bar Graph, Pie Chart

Frequency Distribution
Relative Frequency Distribution
Cross-tabulation
Vishal Mishra (IBS, Hyderabad)

BASICS
Graphical, Tabular, Numeric representation of Data

Quantitative Data

Dot Plot, Histogram, Ogive, Scatter Diagram

Frequency Distribution
Relative Frequency Distribution
Cumulative Frequency Distribution
Stem and Leaf Display
Cross-tabulation

Measures of location, Variability


Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Numeric Representation

Measures of location: Mean, Median, Mode

Measures of Variability:
Range
Inter-quartile Range
Standard Deviation
Variance
Coefficient of Variation
Vishal Mishra (IBS, Hyderabad)

BASICS S. No. Color S. No. Colour


1 Blue 16 Blue
2 Red 17 Blue
Qualitative Data: Example
3 Blue 18 Green
4 Green 19 Red
5 Red 20 Blue
6 Blue 21 Red
7 Red 22 Red
8 Red 23 Green
9 Red 24 Green
10 Blue 25 Red
11 Red 26 Red
12 Red 27 Blue
13 Green 28 Blue
14 Green 29 Red
15 Blue 30 Red
Vishal Mishra (IBS, Hyderabad)

BASICS
Qualitative Data: Bar Graph

Colour Preference
16

14

12

10

Blue Red Green


Vishal Mishra (IBS, Hyderabad)

BASICS
Qualitative Data: Pie Chart

Total angle in a pie : 360 degrees

Angle covered by Blue Colour preference: (10/30) * 360 = 120 degrees

Angle covered by Red Colour preference: (14/30) * 360 = 168 degrees

Angle covered by Green Colour preference: (6/30) * 360 = 72 degrees


Vishal Mishra (IBS, Hyderabad)

BASICS
Qualitative Data: Pie Chart

Colour Preference
Vishal Mishra (IBS, Hyderabad)

BASICS
Qualitative Data: Frequency Distribution

Colour Frequency

Blue 10

Red 14

Green 6
Vishal Mishra (IBS, Hyderabad)

BASICS
Qualitative Data: Relative Frequency Distribution

Relative Frequency: Frequency/Total

Colour Frequency Relative Frequency

Blue 10 0.3333

Red 14 0.4667

Green 6 0.2

Total 30 1
Vishal Mishra (IBS, Hyderabad)

BASICS S. No. JOB TYPE SATISFACTION S. No. JOB TYPE SATISFACTION

Qualitative Data: Example 1 S/W Engineer Y 16 S/W Engineer N


2 Carpenter Y 17 S/W Engineer N
3 S/W Engineer N 18 Bank Cashier N
4 Bank Cashier Y 19 Carpenter N
5 Carpenter Y 20 S/W Engineer Y
6 S/W Engineer N 21 Carpenter N
7 Carpenter Y 22 Carpenter Y
8 Carpenter Y 23 Bank Cashier N
9 Carpenter Y 24 Bank Cashier N
10 S/W Engineer N 25 Carpenter Y
11 Carpenter N 26 Carpenter Y
12 Carpenter N 27 S/W Engineer Y
13 Bank Cashier Y 28 S/W Engineer Y
14 Bank Cashier Y 29 Carpenter Y
15 S/W Engineer N 30 Carpenter Y
Vishal Mishra (IBS, Hyderabad)

BASICS
Qualitative Data: Cross-Tabulation

SATISFACTION
NO YES
S/W Engineer 6 4
JOB Bank Cashier 3 3
Carpenter 4 10
Vishal Mishra (IBS, Hyderabad)

BASICS S. No. Marks S. No. Marks


1 19 16 47
2 33 17 10
Quantitative Data: Example
3 22 18 35
4 32 19 12
5 33 20 15
6 34 21 27
7 38 22 19
8 27 23 45
9 27 24 32
10 26 25 14
11 34 26 40
12 35 27 31
13 25 28 17
14 44 29 20
15 26 30 36
Vishal Mishra (IBS, Hyderabad)
BASICS

Quantitative Data: Frequency Distribution

Number of classes ?

Based on judgement, given the number of observations


e.g. for 30 observations we can decide on 5 classes

Class Width = (Max. Value – Min. Value) / (no. of classes)

(47-10)/5 = 7.4 rounded to 8, the larger integer


Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Frequency Distribution

Class Intervals Frequency


Notation: [10 – 18) 10 to 18 5
18 to 26 5
26 to 34 10
34 to 42 7
42 to 50 3
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Frequency Distribution

Class Intervals Frequency


Notation: [10 – 17] 10 to 17 5
18 to 25 5
26 to 33 10
34 to 41 7
42 to 49 3
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Relative Frequency Distribution

Class Intervals Frequency Relative Frequency Cumulative Relative Frequency

10 to 18 5 0.166666667 (<18): 0.166666667

18 to 26 5 0.166666667 (<26): 0.333333333

26 to 34 10 0.333333333 (<34): 0.666666667

34 to 42 7 0.233333333 (<42): 0.9

42 to 50 3 0.1 (<50): 1

Total = 30 Total = 1
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Histogram

Frequency
12

10

0
Less than 10 10 to 18 18 to 26 26 to 34 34 to 42 42 to 50
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Cumulative Frequency Distribution

Class Intervals Frequency Cumulative Frequency

10 to 18 5 5
18 to 26 5 10
26 to 34 10 20
34 to 42 7 27
42 to 50 3 30
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Ogive

Cumulative Frequency
35
30
25
20
15
10
5
0
Less than 10 Less than 18 Less than 26 Less than 34 Less than 42 Less than 50
Vishal Mishra (IBS, Hyderabad)
BASICS

Quantitative Data: Frequency Distribution

What if we decide on 8 classes for this data-set ?

Class Width = (Max. Value – Min. Value) / (no. of classes)

(47-10)/8 = 4.6 rounded to 5, the larger integer


Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Frequency Distribution

Class Intervals Frequency

10-15 3
15-20 4
20-25
25-30
30-35
35-40
40-45
45-50

Question: What will happen if number of classes are too many OR too little?
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Numeric Summary

Measures of location OR central tendency:


Mean, Median, Mode

Measures of Variability:
Range, IQR, Standard Deviation (S.D), Variance, Coefficient of variation
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Location OR Central Tendency

e.g., Two measures of central tendency for the dataset:


Mean: 28.5 (Symbol for population: µ ; Symbol for sample: x )
Mode: 27

Quartiles/Percentiles:

Quartiles are those values that divide the data into 4 parts (Q1, Q2, Q3).
Percentiles are those values that divide the data into 100 parts

Pth percentile is a value such that at least P % of the values are <= this
value and at least (100-P) % of the values are >= this value
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Location OR Central Tendency

Q1: 25th percentile; Q2: 50th Percentile; Q3: 75th Percentile

Quartiles/Percentiles:

i = (p/100)*n

If i is an integer, then (in a data arranged in an ascending order) pth


percentile is a value that is mean of the value in the ith and (i+1)th
position. If i is a fraction, then pth percentile is the value in the position
that is obtained by rounding up the value of i to the higher integer.

Median is called as the 50th percentile and represented using Q2


Vishal Mishra (IBS, Hyderabad)

BASICS S. No. Marks S. No. Marks


1 10 16 31
2 12 17 32
Quantitative Data: Ascending Order
3 14 18 32
4 15 19 33
5 17 20 33
6 19 21 34
7 19 22 34
8 20 23 35
9 22 24 35
10 25 25 36
11 26 26 38
12 26 27 40
13 27 28 44
14 27 29 45
15 27 30 47
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Location OR Central Tendency

For Q2, the 50th percentile or median, i = (50/100)*30 = 15

Now since i is an integer, Q2 is the mean of observations in the ith and (i+1) th
position (when the data is arranged in ascending order).

i.e., Q2 = (Value in 15th position + Value in 16th position)/2

i.e., Q2 = (27+31)/2

Thus Q2 or Median is 29
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Location OR Central Tendency

Example: Finding the 70th percentile,

i = (70/100)*30 = 21

Now since i is an integer, 70th percentile is the mean of observations in the ith
and (i+1) th position (when the data is arranged in ascending order).

i.e., 70th percentile = (Value in 21st position + Value in 22nd position)/2

= (34+34)/2 = 34

Question: Find the 25th percentile


Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Range: Maximum Value – Minimum Value


=?

Inter Quartile Range (IQR) = Q3 – Q1

Where, Q1 is the first quartile (or 25th Percentile)


and Q3 is the third quartile (or 75th Percentile)

Here Q1 = ? and Q3 = ?
Thus IQR = ?
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Range: Maximum Value – Minimum Value


= 47 – 10 = 37

Inter Quartile Range: IQR = Q3 – Q1

Where, Q1 is the first quartile (or 25th Percentile)


and Q3 is the third quartile (or 75th Percentile)

Here Q1 (8th position) = 20 and Q3 (23rd position) = 35


Thus IQR = 35 – 20 = 15
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Population Variance:  ( x −  ) 2
2 = i
N
=?
Population S.D = σ = ?

2  ( xi − x )
2
Sample Variance: s =
n −1
=?

Sample S.D = s=?


Vishal Mishra (IBS, Hyderabad)

BASICS S. No. Marks Mean (Marks- Mean)^2


1 10 28.5 342.25
Quantitative Data: Measures of Variability 2 12 28.5 272.25
3 14 28.5 210.25
4 15 28.5 182.25
(Note: data arranged in ascending order)
5 17 28.5 132.25
6 19 28.5 90.25
 ( xi −  ) 2
Population  =
2 7 19 28.5 90.25
N 8 20 28.5 72.25
9 22 28.5 42.25
10 25 28.5 12.25
2  ( xi − x )
2
s = 11 26 28.5 6.25
Sample n −1 12 26 28.5 6.25
… … … …
… … … …
… … … …
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Example: Calculate variance of this data of 5 observations


S. No. Marks
Population  ( xi −  ) 2
 =
2 1 10
N 2 12
3 14

2  ( xi − x )
2
Sample 4 15
s =
n −1 5 17
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Example: Calculate variance of this data of 5 observations

Population  ( xi −  ) 2 S. No. Marks Mean (Marks- Mean)^2


 =
2
N 1 10 13.6
2 12 13.6
3 14 13.6
2  ( xi − x )
2
Sample 4 15 13.6
s =
n −1 5 17 13.6
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability
Example: Calculate variance of this data of 5 observations

Population  ( x −  ) 2
2 = i
N S. No. Marks Mean (Marks- Mean)^2

 i
( x − x ) 2 1 10 13.6 12.96
Sample s2 = 2 12 13.6 2.56
n −1 3 14 13.6 0.16
4 15 13.6 1.96
Numerator = 29.2 5 17 13.6 11.56

If this data is treated as population then Pop. Var. = 5.84


If this data is treated as sample then Sample Var. = 7.3
Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability: Data of marks of 30 students
 ( x −  ) 2
Population Variance: 2 = i
N
= 93.85,
(Numerator, Sum of Squared Difference from Mean: ∑(Xi - µ)2 = 2815.5)

Population S.D, σ = 9.69

2  ( xi − x )
2
Sample Variance: s =
n −1
= 97.09,

Sample S.D, s = 9.85


Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Limitations of Variance, S.D

Two populations: Population 1, Population 2

S.D: 4 kg, 16 kg respectively


Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Concept of Coefficient of variation: Limitations of Variance, S.D

Coefficient of variation, Population = (σ/µ) * 100

Coefficient of variation, Sample = (s/x ) * 100


Vishal Mishra (IBS, Hyderabad)

BASICS
Quantitative Data: Measures of Variability

Example: Samples from two populations: Human Beings, Blue Whales


Characteristic: Body Weight

Sample S.D: 4 kg, 16 kg respectively

Sample Mean: 65 kg, 100000 kg respectively

Sample Coefficient of variation: = (s/x ) * 100

Sample 1 C.V: (4/65)*100 = 6.15%


Sample 2 C.V: (16/100000)*100 = 0.016%
Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Qualitative Data: Cross-tabulation (Tabular)

Quantitative Data:

Scatter Plot (Graphical)

Cross-tabulation (Tabular)

Covariance & Correlation (Numeric)


Vishal Mishra (IBS, Hyderabad)

BASICS S. No. Hours Studied Marks Scored


1 4 4
Relationship Between Variables 2 8 9
3 6 8
Quantitative Data:
4 2 6
Scatter Plot (Graphical) 5 9 10
6 7 6
7 15 20
8 3 5
9 12 18
10 1 3
Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables Marks Scored
25
Quantitative Data:
20

Scatter Plot (Graphical)


15

10

0
0 2 4 6 8 10 12 14 16
Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables Marks Scored
25
Quantitative Data:
20

Scatter Plot (Graphical)


15

10

0
0 2 4 6 8 10 12 14 16
Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Quantitative Data:

Covariance : Degree of linear association

 ( xi − x )( yi − y )
Sample: sxy =
n −1
 ( xi −  x )( yi −  y )
Population:
 xy =
N
Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Quantitative Data: Covariance


S.No. Hours Studied (X) Marks Scored (Y) Xi - MeanX Yi - Mean Y (Xi-MeanX) * (Yi-MeanY)
1 4 4
2 8 9
3 6 8
4 2 6
5 9 10
6 7 6
7 15 20
8 3 5
9 12 18
10 1 3

Mean 6.7 8.9


Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Quantitative Data: Covariance


S.No. Hours Studied (X) Marks Scored (Y) Xi - MeanX Yi - Mean Y (Xi-MeanX) * (Yi-MeanY)
1 4 4 -2.7 -4.9 13.23
2 8 9 1.3 0.1 0.13
3 6 8 -0.7 -0.9 0.63
4 2 6 -4.7 -2.9 13.63
5 9 10 2.3 1.1 2.53
6 7 6 0.3 -2.9 -0.87
7 15 20 8.3 11.1 92.13
8 3 5 -3.7 -3.9 14.43
9 12 18 5.3 9.1 48.23
10 1 3 -5.7 -5.9 33.63

Mean 6.7 8.9


Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Quantitative Data: Covariance

Numerator: Product of differences from mean: 217.7

Sample Covariance: 217.7/9 = 24.19

Population Covariance = 217.7/10 = 21.77


Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Quantitative Data:
Drawback of Covariance
Correlation - Degree of linear association

sxy
Sample: rxy =
sx s y
Population:  xy
 xy =
 x y
Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Quantitative Data:
Correlation - Degree of linear association
Population Variance of X = 18.01; Sample Variance = 20.01
Population Variance of Y = 29.89 ; Sample Variance = 33.21

sxy
rxy =
Sample Corr.: sx s y
: (24.19) / (20.01 * 33.21)
: 0.9384
 xy
Population Corr. :  =
xy
 x y
Vishal Mishra (IBS, Hyderabad)

BASICS S. No.
1
Marks Scored
9
Hours Spend on Social Media
8
2 10 8
3 11 7
Relationship Between Variables
4 11 6
5 17 6
Quantitative Data: 6 19 7
7 19 2
8 20 5
Cross-tabulation 9 22 5
10 25 4
11 26 5
12 26 6
13 27 4
14 28 2.5
15 29 4
16 31 2.5
17 32 2.5
18 32 2
19 33 2
20 33 2
Vishal Mishra (IBS, Hyderabad)

BASICS
Relationship Between Variables

Quantitative Data:
Hours on Social Media
Cross-tabulation
0 to 3 3 to 6 6 to 9

0 to 12 0 0 4

Marks Scored 12 to 24 1 2 2

24 to 36 6 4 1
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers
(The extreme values that are either very low OR very high)

Graphical Method: Box Plot

Numeric Method: Z-Score


Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers

Graphical Method: Box Plot

Identifying (five number summary):

1. Maximum value
2. Minimum value
3. Q1
4. Q2
5. Q3
Vishal Mishra (IBS, Hyderabad)

BASICS S. No. Marks S. No. Marks


1 10 16 31
Detecting Outliers
2 12 17 32
3 14 18 32
Graphical Method: Box Plot
4 15 19 33
5 17 20 33
Identifying (five number summary):
6 19 21 34

1. Maximum value 7 19 22 34
2. Minimum value 8 20 23 35
3. Q1 9 22 24 35
4. Q2 10 25 25 36
5. Q3 11 26 26 38
12 26 27 40
13 27 28 44
14 27 29 45
15 27 30 75
Vishal Mishra (IBS, Hyderabad)

BASICS S. No. Marks S. No. Marks


1 10 16 31
Detecting Outliers
2 12 17 32
3 14 18 32
Graphical Method: Box Plot
4 15 19 33
5 17 20 33
Identifying (five number summary):
6 19 21 34

1. Maximum value = 75 7 19 22 34
2. Minimum value = 10 8 20 23 35
3. Q1 = 20 9 22 24 35
4. Q2 = 29 10 25 25 36
5. Q3 = 35 11 26 26 38
12 26 27 40
IQR = 35-20 = 15 13 27 28 44
14 27 29 45
15 27 30 75
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers

Graphical Method: Box Plot

Outliers are the values that are below Q1-1.5 (IQR) or that are above Q3 + 1.5
(IQR)

Lower limit : Q1 - 1.5 (IQR) = 20 -1.5 * 15 = -2.5


Upper limit : Q3 + 1.5 (IQR) = 35 + 1.5 * 15 = 57.5

This implies that there is one outlier in this data set. That outlier is the
observation with a value of 75
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers
Box Plot
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers
Box Plot
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers (when data is symmetrically distributed)

Numeric Method: Z-Score

Also called as standardized score, it indicates the number of standard


deviations a value is away from the mean.

For a sample, the Z-score for the ith observation (i.e. Xi) is calculated as,
𝑥𝑖 −𝑥ҧ
𝑧𝑖 =
𝑠
where x is sample mean and s is sample standard deviation
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers
S.No. Marks Scored (Xi)
Numeric Method: Z-Score
1 4
2 9
3 8
4 6
5 10
6 6
7 20
8 5
9 18
10 3
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers S.No. Marks Scored (Xi) Z-score (Zi)
1 4 -0.850694
Numeric Method: Z-Score 2 9 0.0173611
3 8 -0.15625
4 6 -0.503472
5 10 0.1909722
6 6 -0.503472
7 20 1.9270833
8 5 -0.677083
9 18 1.5798611
10 3 -1.024306

Mean 8.9
Standard Deviation 5.76
Vishal Mishra (IBS, Hyderabad)

BASICS
Detecting Outliers

Numeric Method: Z-Score

Values less than mean have a negative Z-score

Values greater than mean have a positive Z-score

Mean of Z-scores is 0 and standard deviation is 1

Values with z-score below -3 and above +3 are termed as outliers (extreme
values)

You might also like