You are on page 1of 7

26-09-2023

RANDOM VARIABLE
A variable which can take different values is called a random
PART 1 variable. It is generally denoted by capital letters: A, B, ……X,Y, Z,
………
Values taken by X are either numbers (numeric/ quantitative data)
BASICS OF STATISTICS or non – numeric (Qualitative data)
Examples of Numeric Data
E.g. X= time taken to deliver a particular order,
Y = Shipping record of time of receipt of an order to delivery.
Z = Scores received by employees in performance test conducted.
A = In a test conducted for the mother board the time to failure
B= The Business per employee in the public sector bank
X= Return on assets in a private banks.
DATA_SCIENCE_2019_20 1 DATA_SCIENCE_2019_20 2

1 2

Work
Examples of Non – Numeric Data No. CGPA UG Qualification Specialisation Experience Age (in years)
E.g. X= gender of an employee, 1 3.24 B.Com. Finance 0 23
Y = Graduate stream of a candidate. 2 3.14 B.Sc. HR 1 21
Z = Specialization offered by MMS students 3 3.72 BAF Finance 2 23
A = Sector to which a particular industry belongs 4 3.06 B.E. Systems 4 21
B= Names of states in a government data 5 3.14 BMS HR 7 22
X= Names of car – models in an automobile industry 6 3.14 CA Finance 2 23
…………………………………….. Etc. 7 3.06 B.A. Economics Operations 0 22
8 3.17 B.Sc.(IT) Systems 3 21
NOTE:
9 2.97 BCA Systems 2 22
In a given data there can be a combination of numeric and
non – numeric data. 10 3.14 B.Com. Finance 0 23
11 3.69 BMS Marketing 3 24
For example:
12 3.85 B.E. Operations 0 25
DATA_SCIENCE_2019_20 3 13 3.92 BCA DATA_SCIENCE_2019_20
Systems 0 23 4

3 4
26-09-2023

Name
Ratan Tata
Wealth in Crores
125674.12
Sector
Large diversified
Types of Data
Two types of data:
P. R. S. Oberoi 183739.00 Hospitality
(1) Ungrouped data is a data given in the form scattered values:
Azim H. Premji 64855.27 Software
X x1 x2 x3 x4 ---- --- ---- ---- xn
Mukesh Ambani 56414.35 Petrochemicals
(2) Grouped data is a data consisting of values or class intervals along with their
Sunil Mittal & Family 35558.22 Telecom frequencies: (Here frequency = number of times a particular value repeats)
Anil Ambani 34993.98 Large diversified
X x1 x2 x3 x4 x5 x6 ...... ……. xn
Tulsi R. Tanti & Family 26139.69 Wind energy
Anil Agarwal 18108.75 Metals OR F(frequency) f1 f2 f3 f4 f5 f6 …... ……. fn
Shiv Nadar 16698.47 Software & hardware (Here frequency = Number of observations/ data points/ values falling in a
Kumarmangalam Birla 16643.04 Large diversified particular interval)
Rahul Bajaj 12455.99 Auto Class Intervals 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50 50 – 60 60 – 70
Dilip S. Shanghvi 10584.49 Pharmaceuticals F( Frequency) 23 12 5 67 52 34 19
Baba Kalyani 7857.83
DATA_SCIENCE_2019_20
Auto components 5 DATA_SCIENCE_2019_20 6

5 6

DATA ARRAY Frequency Distribution


Given data to prepare frequency distribution of given n values with
Let X is a random variable. k class intervals proceed as follows:
Given a data values of X as: x1, x2, x3, ……xn. Step I: Find minimum (min) and maximum(max) value of the given
The arrangement of these values either is ascending or descending data.
order is called a “ Data Array”. Step II: Find the width of each of the k class intervals using
Example: Given a data Width = w = (max – min ) /k or Take suitable uniform width less
200 156 231 222 96 289 126 308 than or greater than width.
Step III: Choose proper lower limit(every thing depends on this) of
The data array is: the 1st class interval and formulate the k class intervals of width w
so as to include all the data values.
96 126 156 200 222 231 289 308 Step IV: Use Tally mark method to find the data values lying in each
of the class – interval. The number of tally marks across each class
– interval gives the frequency of that class – interval.
DATA_SCIENCE_2019_20 7 DATA_SCIENCE_2019_20 8

7 8
26-09-2023

Types of class intervals Cumulative Frequency


Each class interval has a lower limit(L) and upper limit (U) Given a frequency distribution (grouped data) with class intervals
Two types of class intervals : “Cumulative frequency” of a particular class interval
(1) Exclusive type: = cumulative freq. of previous class + freq. of the that class
Example: 0 – 10 , 10 – 20 , 20 – 30, 30 – 40, 40 – 50 Class - intervals FREQ Cumulative Frequency
Here, lower limit(L) is included but upper limit(U) is not included. 0 - 10 27 27
(2) Inclusive type: 10 - 20 12 39
Example: 0 – 5, 6 – 11, 12 – 17, 18 – 23, 24 – 29
20 - 30 8 47
Here, both lower limit(L) and upper limit(U) are included in the
class interval. 30 - 40 6 53
Convention: We will use Exclusive type of class interval unless it is 40 - 50 2 55
specifically mentioned or asked to use inclusive type of class 50 - 60 0 55
intervals. Consider an example: DATA_SCIENCE_2019_20 9
Total Frequency 55
DATA_SCIENCE_2019_20 10

9 10

Relative Frequency Finding Frequency Distribution


Given a frequency distribution (grouped data) with class intervals
“Relative frequency” of a particular class interval
Using Excel
= Freq. of that class / Total Frequency Given an ungrouped data with n values
Step 1: Find the min, max, width/ number of class intervals.
Class - intervals FREQ Relative Frequency Formulate the class intervals to include all the given data values.
0 - 10 27 27/55=0.490909091
Step 2: Find the bin value of the of every class. Generally it is 1 unit
10 - 20 12 12/55=0.218181818 point less than the upper limit of every class interval.
20 - 30 8 8/55=0.145454545
Step 3: Use function FRUQUENCY(range, bin value) to find the
30 - 40 6 6/55=0.109090909 cumulative frequency of each class = number of data points <=bin
40 - 50 2 2/55=0.036363636 value.
50 - 60 0 0/55=0 Range = column range in which values are stored.
Total 55 1
DATA_SCIENCE_2019_20 11 DATA_SCIENCE_2019_20 12

11 12
26-09-2023

Finding Frequency Distribution Graphs and diagrams


Using Excel • Histogram
Given a frequency distribution
Step 4: Find the frequency of each class using frequency =
cumulative freq. of than class – cumulative freq. of previous class
to get the frequency distribution table.
>freq_distribution.xlsx
Find the class – intervals i1, i2, i3……., in for which x1, x2, ……. Xn are
>Case_3.1 CGPA.xls midpoints.
The graph of intervals i Vs Freq. , where we draw the rectangles of
(height, length): (f1,i1), (f2, i2), ………………………………………..(fn,in)

DATA_SCIENCE_2019_20 13 DATA_SCIENCE_2019_20 14

13 14

Graphs and diagrams


Histogram • Ogive
20 19
17
Given a frequency distribution
15 X x1 X2. .. .. … …. xn
Frequency

12 Freq (F) f1 f2 …… …….. …… ……. …… fn


10 10 CF cf1 cf2 ……. ……. ……. ……. ……. ……. cfn
10 8
7
5 5
Frequency It is a graph of X Vs Cumulative frequencies.
5 3 3 Plot the points (x1,cf1), (x2,cf2)……………………(xn,cfn)
1
Join the points by straight lines /smooth curves
0
0 1 2 3 4 5 6 7 8 9 10 11
X
DATA_SCIENCE_2019_20 15 DATA_SCIENCE_2019_20 16

15 16
26-09-2023

Ogive Curve Frequency Polygon


25 60

19 50
20
17
40
15
12
cumulative frequency

10 10 30
10 8 Series1

7
5 5 20
5 3 3
1 10
0
0
0 2 4 6 8 10 12 2.5 - 2.7 2.7 - 2.9 2.9 - 3.1 3.1 - 3.3 3.3 - 3.5 3.5 - 3.7 3.7 - 3.9
X

DATA_SCIENCE_2019_20 17 DATA_SCIENCE_2019_20 18

17 18

Summarization of Data For Ungrouped Data


For ungrouped data:
Central average: ∑ ⋯…………..
(1) Mean 𝑋= , Weighted Mean =
Mean, Weighted Mean, Geometric Mean ⋯……………

Median Geometric Mean = 𝑥1. 𝑥2 … … . . 𝑥𝑛, combined Mean𝑋 =


Mode (2) Median: Arrange the data in ascending order.
If n is odd then median = middle data point
Other descriptors: Quartiles, Deciles, Percentiles = value at the ( ) position.
Measure of Dispersion: If n is even then there will be two middle data points: m1, m2
Standard Deviation, Coefficient of variation median = (m1+m2) /2
(3) Mode: If the data points are all distinct then it is not possible to find the
mode. If the data points are repeating then the most frequently appearing
DATA_SCIENCE_2019_20 19
data point is the mode. DATA_SCIENCE_2019_20 20

19 20
26-09-2023

For Ungrouped Data For Ungrouped Data


For ungrouped data: For ungrouped data:
∑ ∑ (8) Deciles: 𝐷 , 𝐷 , … … … … … … . . , 𝐷 where each
(4) Standard Deviation(SD)= S = = −𝑋 ( )
∑ ∑ 𝐷 = data point appearing at [ i ( )] position
(5) Variance = V(X) = S = = −𝑋
(9) Percentiles: 𝑃 , 𝑃 ,…………………, 𝑃 where each
(6) Coefficient of Variation (CV) = = ( )
𝑃 = data point appearing at [i ( )] position
(7) Quartiles: 𝑄 , 𝑄 , 𝑄 where
( ) (10) Range= Max – Min
𝑄 = data point appearing at [ ] position, (11) Inter Quartile Range = 𝑄 - 𝑄
( )
𝑄 = data point appearing at [ ] position Semi – Inter Quartile Range = (𝑄 - 𝑄 )/2
𝑄 = data point appearing at [
( )
] position (12) Inter fractile Range in the 𝑖 and 𝑖 fractile= 𝐷 - 𝐷 (or = 𝑃 - 𝑃 )
DATA_SCIENCE_2019_20 21 DATA_SCIENCE_2019_20 22

21 22

For Grouped Data For Grouped Data


Given frequency distribution: Given frequency distribution:

(1) Mean 𝑋= ∑ , where X are the given values or midpoints of the given (3) To find mode:
class intervals and f are the frequencies.
(2) To find median: Identify the highest frequency say 𝑓
Find cumulative frequencies for the given frequency distribution. The class corresponding to this highest frequency 𝑓 is called
Find N = ∑ 𝑓, and find N/2, find the first cumulative frequency covering modal class: 𝑙 - 𝑙
(N/2).
The class corresponding this cumulative frequency is median class 𝑙 - 𝑙 If 𝑓 = frequency of the class preceding to modal class
Median =M = 𝑙 +
( )
(𝑙 - 𝑙 ) 𝑓 = frequency of the modal class
where 𝑙 - 𝑙 is a median class 𝑓 = frequency of the class succeeding to modal class
f = the frequency of the median class, cf = Cumulative Freq of the class Mode = 𝑙 + ( )(𝑙 - 𝑙 )
preceding the median class ( )
m = N/2 DATA_SCIENCE_2019_20 23 DATA_SCIENCE_2019_20 24

23 24
26-09-2023

For Grouped Data For Grouped Data


(7) Quartiles: 𝑄 , 𝑄 , 𝑄 where
Given frequency distribution: ( )
𝑄 =𝑙 + ( 𝑙 - 𝑙 ) , 𝑄 − class : 𝑙 - 𝑙 , m=(N/4), N = ∑ 𝑓
∑ ∑
(4) Standard Deviation(SD)= S = ∑
= ∑
−𝑋 cf = cumulative freq. of class preceding to 𝑄 − class
Where 𝑋=
∑ f = frequency of 𝑄 − class
∑ ( )
𝑄 =𝑙 + ( 𝑙 - 𝑙 ) , 𝑄 − class : 𝑙 - 𝑙 , m=(N/2), N = ∑ 𝑓
∑ ∑
(5) Variance V(x) = S = ∑
= ∑
−𝑋 cf = cumulative freq. of class preceding to 𝑄 − class
f = frequency of 𝑄 − class
( )
(6) Coefficient of Variation (CV) = = 𝑄 =𝑙 + ( 𝑙 - 𝑙 ) , 𝑄 − class : 𝑙 - 𝑙 , m=(3N/4), N = ∑ 𝑓
cf = cumulative freq. of class preceding to 𝑄 − class
f = frequency of 𝑄 − class
DATA_SCIENCE_2019_20 25 DATA_SCIENCE_2019_20 26

25 26

Relationships Skewness of Data


(1) Mean – Median = 3( Mean – Mode) Skewness is an indicator of lack of symmetry in a data. Data can be "skewed",
meaning it tends to have a long tail on one side or the other.
(2) Relative Dispersion =( x 100) %
(3) Interpretations of Coefficient of Variation we measure:
• Consistency (performance of cricketers) .
• Disparity (Par Capita Income of different states in India)
• Volatility / Risk
Negative Skewed Normal Positively Skewed
(Return of equity capital invested in some shares) ( ) ( )
Pearson’s Measure of Skewness = =
• Uniformity (Workload on different counters in the banks, ( )
Bowley’s coefficient of skewness = 𝑆 =
wages in different organizations)
DATA_SCIENCE_2019_20 27 DATA_SCIENCE_2019_20
( )
28

27 28

You might also like