You are on page 1of 10

NPTEL

Course On

STRUCTURAL
RELIABILITY
Module # 02
Lecture 1

Course Format: Web

Instructor:
Dr. Arunasis Chakraborty
Department of Civil Engineering
Indian Institute of Technology Guwahati
1. Lecture 01: Basic Statistics

Scatter Diagram, Histogram and Frequency Polygon

Observation data samples are presented in form of scattered points which can be independent or
dependent of any other random variable. Presentation of the sample data is vitally important as it
gives crucial knowledge about its constitutive statistical properties such as correlation, range etc.
Generally, a statistical observation sample is represented in scatter diagrams, histograms and
frequency polygons. The random variables associated with these observations are discrete.

Generally, scatter diagram is presented either in 2D or 3D form by presenting two or three


random variables, respectively. Figure 2.1.1 shows typical scatter diagrams for two random
variables. Each random variable must have observation data which can be discretely represented
across graph. Thus, the statistical data must be relating to simultaneous measurement of the
random variables. The scatter diagrams shows nature and relation of the random variables with
each other. For example, if two random variables have increasing trend in scatter diagram that
means they have positive correlation and vice versa [see Figure 2.1.1 (𝑎) and (𝑏)] whereas if this
increasing or decreasing trend is very strict (i.e. nearly following a straight line), one can say that
the correlation is either +1 or −1, respectively [see Figure 2.1.1 (𝑒) and (𝑓)]. At times one can
notice that trends in scatter diagram is neither uniformly increasing nor decreasing, these have
nearly zero linear correlation [see Figure 2.1.1 (𝑐) and (𝑑)] whereas a strong quadratic
correlation exits between the pair of random variables shown in Figure 2.1.1 (𝑑).

Histograms are representation of grouped frequency distribution of observation data. These are
bar like representation of observation data where width of bar is class interval of the data and
amplitude or height of bar refers to frequency density of data falling under its associated class
(see Figure 2.1.2). The area of each bar represents its class frequency, this is expressed in Eq.
2.1.1.

Area of each rectangle = width × height


= (width of class) × (frequency density)
class frequency 2.1.1
= (width of class) × width of class
= class frequency

Course Instructor: Dr. Arunasis Chakraborty


1
Lecture 01: Basic Statistics

𝑦 𝑦 𝑦

𝑥 𝑥 𝑥
(𝑎) (𝑏) (𝑐)
𝑦 𝑦 𝑦

(𝑑) 𝑥 (𝑒) 𝑥 (𝑓) 𝑥

Figure 2.1.1 Scatter diagrams showing different types and degrees of correlation (𝒂)
positive, 𝒃 negative, (𝒄) zero, 𝒅 zero, 𝒆 +𝟏 and 𝒇 −𝟏

Histogram
Frequency

Frequency Polygon

Values

Figure 2.1.2 Typical example of histogram and frequency polygon

Course Instructor: Dr. Arunasis Chakraborty


2
Lecture 01: Basic Statistics
Before plotting histograms one has to form frequency table which contains class and frequency.
Choosing number of classes play a very crucial role in formulation of frequency table, in turn
histograms. Generally, appropriate number of classes may be chosen by using

𝑐 = 1 + 3.3 log 𝑛 2.1.2

where, 𝑛 is number of observation data or sample size and 𝑐 is number of classes.

An alternative to histograms is frequency polygon which formed by joining the mid values of
each class as shown in Figure 2.1.2. If the width of class are same than the area under histograms
is same as under the frequency polygon. The curve formed by frequency polygon gives an idea
of frequency distribution of the data.

Measures of Central Tendency

A whole set of observations can be described by a single value. It usually occupies a central
position such that some observations are larger and some others are smaller than itself, these are
known as measures of central tendencies. There are 3 measures of central tendency – mean,
median and mode.

Mean – It is of 3 types – arithmetic mean, geometric mean and harmonic mean. The words
'mean' and 'average' only refer to arithmetic mean. In this course only arithmetic mean is
discussed.

Arithmetic Mean (AM) – It is defined as sum of a set of observations divided by size of the set.
Consider observations 𝑥1 , 𝑥2 , … , 𝑥𝑛 where 𝑛 is number of observations, their AM (𝜇𝑥 ) is

𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 1
𝜇𝑥 = = ∑𝑥 2.1.3
𝑛 𝑛

Now, say 𝑥1 , 𝑥2 , … , 𝑥𝑛 have frequencies 𝑓1 , 𝑓2 , … , 𝑓𝑛 respectively, i.e. 𝑥1 occurs 𝑓1 times, 𝑥2


occurs 𝑓2 times and so on, then the sum of all the observations (i.e., 𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ) is

𝑥1 + 𝑥1 + ⋯ + 𝑥1 + 𝑥2 + 𝑥2 + ⋯ + 𝑥2 + ⋯ + 𝑥𝑛 + 𝑥𝑛 + ⋯ + 𝑥𝑛
𝑓1 𝑡𝑒𝑟𝑚𝑠 𝑓2 𝑡𝑒𝑟𝑚𝑠 𝑓𝑛 𝑡𝑒𝑟𝑚𝑠

= 𝑓1 𝑥1 + 𝑓2 𝑥2 + ⋯ + 𝑓𝑛 𝑥𝑛 2.1.4

Hence, the arithmetic mean is

𝑓1 𝑥1 + 𝑓2 𝑥2 + ⋯ + 𝑓𝑛 𝑥𝑛 ∑𝑓𝑥
𝜇𝑥 = = 2.1.5
𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ∑𝑓

Course Instructor: Dr. Arunasis Chakraborty


3
Lecture 01: Basic Statistics
This is sometimes referred to as weighted arithmetic mean.

Important properties of AM

1. Addition of a set of observations is equal to the product of number of observations and


AM.

∑ 𝑥𝑖 = 𝑛𝜇𝑥 and ∑ 𝑓𝑖 𝑥𝑖 = 𝑁𝜇𝑥 2.1.6

where 𝑁 = ∑𝑓 is the total frequency. The first relation in Eq. 2.1.6 implies that the
simple sum whereas the second relation implies the weighted sum.

2. For given observations, the sum of deviations from their mean is always 0.

∑ 𝑥𝑖
∑ 𝑥𝑖 − 𝜇𝑥 = 0, where 𝜇𝑥 = , and
𝑛
2.1.7
∑ 𝑓𝑖 𝑥𝑖
∑ 𝑓𝑖 𝑥𝑖 − 𝜇𝑥 = 0, where 𝜇𝑥 = 𝑁

3. Two variables 𝑥 and 𝑦, related in such a way that 𝑦 = 𝑎𝑥 + 𝑏, where 𝑎 and 𝑏 are
constants, then

𝜇𝑦 = 𝑎𝜇𝑥 + 𝑏 2.1.8

and vice versa. Relation in Eq. 2.1.8 explains that if each of the observations 𝑥𝑖 is added,
subtracted, multiplied or divided by a constant than the mean 𝜇𝑥 will also follow the
same mathematical operation and that too with same constants.

4. Let a group of two observations of size 𝑛1 and 𝑛2 having means 𝜇𝑥 1 and 𝜇𝑥 2 , then the
combined mean (𝜇𝑥 ) of the composite group of 𝑛1 + 𝑛2 (= 𝑁) observations is given by

𝑁𝜇𝑥 = 𝑛1 𝜇𝑥 1 + 𝑛2 𝜇𝑥 2 2.1.9

This can be generalised to any number of groups as

𝑁𝜇𝑥 = ∑ 𝑛𝑖 𝜇𝑥 𝑖 where 𝑁 = ∑ 𝑛𝑖 2.1.10

5. The sum of squares of deviations has the smallest value if deviations are taken from their
mean or AM.

2
∑ 𝑥𝑖 − 𝐴 is minimum, when 𝐴 = simple AM 2.1.11

Course Instructor: Dr. Arunasis Chakraborty


4
Lecture 01: Basic Statistics
2
∑ 𝑓𝑖 𝑥𝑖 − 𝐴 is minimum, when 𝐴 = weighted AM

Median – The middle most value when a set of observations are sorted in order of magnitude is
called median. It can be calculated from a grouped frequency distribution by using the formula :

𝑁/2 −𝐹
Median = 𝑙1 + ×𝑐 2.1.12
𝑓𝑚

where, 𝑙1 is lower bound of the median class, 𝑁 is total frequency, 𝐹 is cumulative frequency
corresponding to 𝑙1 , 𝑓𝑚 is frequency of the median class and 𝑐 is width of the median class.

Median is, in a certain sense, the real measure of central tendency because it gives the value of
the most central observation. Moreover, it is unaffected by higher or lower bound values, and can
be easily calculated from frequency distributions with open-end classes.

Mode – The value in a set of observations which occurs with the highest frequency is known as
mode. This actually, reflects the most often occurring value. It is generally calculated as

𝑑1
Mode = 𝑙1 + 𝑑 ×𝑐 2.1.13
1 + 𝑑2

where, 𝑙1 is lower bound of the highest frequency class, 𝑑1 is difference of frequencies in the
highest frequency class and the preceding class, 𝑑2 is difference of frequencies in the highest
frequency class and the following class, and 𝑐 is common width of classes. Eq. 2.1.3 is
applicable only when all classes have the same width. One can note that mode has a peculiarity,
i.e., in case of observations occurring with equal frequency, mode does not exist.

Relation Between Mean, Median and Mode

An interesting approximate empirical relationship between mean, mode and median exist and it
can be expressed as

Mean − Mode ≈ 3(Mean − Median) 2.1.14

Note: this expression only holds fairly for single mode with moderate asymmetry.

Standard Deviation and Variance

Variance is defined as arithmetic mean of squared deviation from mean, where the deviation
from mean, square deviation from mean and variance are shown below

Deviations from mean: 𝑥1 − 𝜇𝑥 1 , 𝑥2 − 𝜇𝑥 2 , … , 𝑥𝑛 − 𝜇𝑥 𝑛 2.1.15

Course Instructor: Dr. Arunasis Chakraborty


5
Lecture 01: Basic Statistics
2 2 2
Square-Deviations from mean: 𝑥1 − 𝜇𝑥 1 , 𝑥2 − 𝜇𝑥 2 , … , 𝑥𝑛 − 𝜇𝑥 𝑛

Mean-Square-Deviations from mean:


1 2 2 2 1 2
𝑥1 − 𝜇𝑥 1 + 𝑥2 − 𝜇𝑥 2 + ⋯ + 𝑥𝑛 − 𝜇𝑥 𝑛 = 𝑛 ∑ 𝑥𝑖 − 𝜇𝑥 𝑖
𝑛

Variance is generally denoted by 𝜎 2 , further, below expressions for simple series as well as
frequency distribution are given.

1 2
For simple series, 𝜎 2 = ∑ 𝑥𝑖 − 𝜇𝑥 𝑖 2.1.16
𝑛

1 2
For frequency distribution, 𝜎 2 = 𝑁 ∑ 𝑓𝑖 𝑥𝑖 − 𝜇𝑥 𝑖 2.1.17

Standard deviation, 𝜎 is defined as square root of variance. It is evaluated as shown in Eq.


2.1.18.

1 2
Standard Deviation (𝜎) = ∑ 𝑥𝑖 − 𝜇𝑥 𝑖 2.1.18
𝑛

Both, variance and standard deviation are vital tools for representation of a statistical data as it
shows dispersion of the data from mean in its domain.

Covariance and Correlation Coefficient

Covariance is defined for pair of random variables which is associated or related to each other. It
is the average of product of individual deviation from the corresponding means. Eq. 2.1.19
shows covariance Cov 𝑥, 𝑦 between two correlated random variables 𝑥 and 𝑦.

1
Cov 𝑥, 𝑦 = 𝑥𝑖 − 𝜇𝑥 𝑖 𝑦𝑖 − 𝜇𝑦 𝑖 2.1.19
𝑛

Expanding Eq. 2.1.19, one can get

∑ 𝑥𝑦 ∑𝑥 ∑𝑦
Cov 𝑥, 𝑦 = − 2.1.20
𝑛 𝑛 𝑛

Course Instructor: Dr. Arunasis Chakraborty


6
Lecture 01: Basic Statistics
Generally, one expresses the correlation of two random variables in terms of coefficient of
correlation (𝜌) which is the ratio of covariance and individual standard deviations of both the
random variables.

Cov 𝑥, 𝑦
𝜌= 2.1.21
𝜎𝑥 𝜎𝑦

Substituting the values of Cov 𝑥, 𝑦 , 𝜎𝑥 and 𝜎𝑦 from Eq. 2.1.19 and 2.1.18 in Eq. 2.1.21, one
gets

∑ 𝑥 − 𝜇𝑥 𝑦 − 𝜇𝑦
𝜌=
2 2.1.22
∑ 𝑥 − 𝜇𝑥 2. ∑ 𝑦 − 𝜇𝑦

Expanding Eq. 2.1.22,

∑𝑥𝑦 − 𝑛𝜇𝑥 𝜇𝑦
𝜌=
2.1.23
∑𝑥 2 − 𝑛𝜇𝑥2 ∑𝑦 2 − 𝑛𝜇𝑦2

As 𝑛𝜇𝑥 = ∑ 𝑥 and 𝑛𝜇𝑦 = ∑ 𝑦, one can substitute this to the above equation and on simplifying,

𝑛 ∑𝑥𝑦 − ∑𝑥 ∑𝑦
𝜌= 2.1.24
𝑛∑𝑥 2 − ∑𝑥 2 𝑛∑𝑦 2 − ∑𝑦 2

Percentile

Percentile is a value below which a given percentage of observations fall. For example, 99% of
the observations will fall under 99 percentile (𝑃99 ). As per rank the values of different
percentiles can be arranged as 𝑃1 < 𝑃2 < ⋯ < 𝑃99 .

Regression

Regression is an estimation process done for average value of one variable for a specified value
of other variable. It is conducted with respect to suitable equations (i.e., regression equations)
based on statistical data (combined as well as individual) of the random variables. For simple
regression, one can consider linear relationship between the variables. Hence, estimates of 𝑦
(denoted by 𝑦′) is given by regression equation of 𝑦 on 𝑥 as

Course Instructor: Dr. Arunasis Chakraborty


7
Lecture 01: Basic Statistics

𝑦 ′ − 𝜇𝑦 = 𝑏𝑦𝑥 𝑥 − 𝜇𝑥 2.1.25

where, regression coefficient 𝑏𝑦𝑥 = Cov 𝑥, 𝑦 𝜎𝑥2 and similarly, regression equation of 𝑥 on 𝑦 is
given as Eq. 2.1.26 for estimate of 𝑥 (denoted by 𝑥′)

𝑥 ′ − 𝜇𝑥 = 𝑏𝑥𝑦 𝑦 − 𝜇𝑦 2.1.26

where, regression coefficient 𝑏𝑥𝑦 = Cov 𝑥, 𝑦 𝜎𝑦2 . Now consider a straight line fit as shown
below for better understanding of formulation and calculations related to regression.

𝑦 = 𝑎 + 𝑏𝑥 2.1.27

where, random variable 𝑥 is independent whereas 𝑦 is dependent of 𝑥. Hence, in Eq. 2.1.27 one
gets coefficients 𝑎 and 𝑏 as unknown terms which are to be evaluated as per regression.
Multiplying Eq. 2.1.27 by 1 and 𝑥, moreover summing up the observations of the random
variables, one gets

∑𝑦 = 𝑎𝑛 + 𝑏∑𝑥 2.1.28

∑𝑥𝑦 = 𝑎∑𝑥 + 𝑏∑𝑥 2 2.1.29

Considering Eq. 2.1.28, dividing by 𝑛 (number of observations) one gets

𝜇𝑦 = 𝑎 + 𝑏𝜇𝑦 2.1.30

𝑎 = 𝜇𝑦 − 𝑏𝜇𝑥 2.1.31

thus, unknown coefficient 𝑎 is evaluated in terms of individual mean of both the random
variables. Now, multiply Eq. 2.1.28 by ∑ 𝑥 and divide Eq. 2.1.29 by 𝑛

2
∑𝑥 ∑𝑦 = 𝑛𝑎 ∑𝑥 + 𝑏 ∑𝑥 2.1.32
2
𝑛 ∑𝑥𝑦 = 𝑛𝑎 ∑𝑥 + 𝑛𝑏 ∑𝑥 2.1.33

Finally, subtracting Eq. 2.1.33 and Eq. 2.1.32

𝑛 ∑𝑥𝑦 − ∑𝑥 ∑𝑦 = 𝑏 𝑛 ∑𝑥 2 + ∑𝑥 2 2.1.34

𝑛 ∑𝑥𝑦 − ∑𝑥 ∑𝑦
∴𝑏= 2.1.35
𝑛 ∑𝑥 2 + ∑𝑥 2

Course Instructor: Dr. Arunasis Chakraborty


8
Lecture 01: Basic Statistics
Dividing Eq. 2.1.35 by 𝑛2

∑𝑥𝑦 ∑𝑥 ∑𝑦
𝑛 − 𝑛 𝑛 cov 𝑥, 𝑦 𝜎𝑦
𝑏= 2 = 2
=𝜌 2.1.36
∑𝑥 2 ∑𝑥 𝜎𝑥 𝜎𝑥
+
𝑛 𝑛

thus, another unknown coefficient 𝑏 is evaluated in terms of covariance or coefficient of


correlation and variance. Substituting 𝑎 from Eq. 2.1.31 and 𝑏 from Eq. 2.1.36 one gets similar
expression as Eq. 2.1.25.

𝑦 − 𝜇𝑦 = 𝑏𝑦𝑥 𝑥 − 𝜇𝑥 2.1.37

Course Instructor: Dr. Arunasis Chakraborty


9

You might also like