Professional Documents
Culture Documents
Suppose there are two mutually exclusive outcomes in a given experiment. Examples are (head, tail),
(success, failure), (male, female) etc.
S = {success, failure}
2. Binomial distribution
Doing N Bernoulli trials and summing the outcomes gives the Binomial random variable
B=∑ X i
N random trials
The trials are independent
The probability of success is the same from trial to trial
P=probability of success
μ=np
σ =√ np(1−p)
This is a discrete distribution which is used to determine the probability of the occurance of a specific
number of successes per unit of time when the events/successes are independent and the average number
of successes per unit time remains constant. The probability distribution function is given by
x −λ
λ e
p ( x) =
x!
Suppose there are N possible outcomes each equally likely, the probability function is
1
f ( x )= for X=1,2…,N
N
Continuous distributions
1
f ( x )= e−1/ 2 ¿¿¿
√2 π σ 2
x
The area below the curve falls as x increases. This makes it desirable for analyzing problems involving
the time before a given event occurs (duration). Example: the probability that a machine will remain
operational for x years before failing.
P ( X ≤ x )=1−e− λ
Where λ=the mean number of occurances for the interval of interest
E ( X ) =1/ λ
The variance is
2 2
σ =1/ λ
1
f ( x )=
b−a
Mathematics of expectations
Probability distributions provide too much detailed information about the random variables. In the interest
of parsimony, but at the cost of detail, we summarize the distributions by some summary statistics. These
statistics tell us the central tendency (typical value) and dispersion of the distribution.
Central tendency
1. Mean
n
E ( X ) =∑ Pi x i if X is discrete
1
∞
∫ xf ( x )dx if X is continuous
−∞
The probabilities are the weights. The mean is the position of the fulcrum that will keep the distribution in
balance.
Properties
E ( aX +b )=b+ aE( X)
E ( a ) =a
E ( bx )=bE (x)
n
E ( g ( X ) ) =∑ Pi g ¿ ¿ ¿) if X is discrete
1
Note
Moment
E(X )
r
is reffered to as the rth moment of X about the origion. The mean is said to be the first moment
of x about the origin
E ( X −μ ¿¿¿ r ) is referred to as the rth moment of X about its mean
2. The median
1
p ( x< m) ≤ ∧ p( x >m)≤ 1/2
2
3. The mode
The mode is the value of X associated with the largest probability (probability density). The value of x
associated with the peak of the density curve. It is the most likely outcome.
Comparisons
If the objective is to minimize the mean squared deviation, the best estimator is the mean of X
If the objective is to minimize the mean absolute deviation, the best estimator is median of X
If the objective is to maximize the probability that the mean error in prediction is zero, the mode
of X is best.
Measures of Dispersion
1. The range
The range is simply the difference between the maximum and the minimum values of X.
Range(x )=Max ( X ) – Min( x) The range is highly sensitive to extreme values
n
σ =∑ pi (x i−E ( x))2
2
x if X is discrete random variable
1
∞
= ∫ ((x i−E (x)) ) f (x )dx
2
if X is continuous random variable
−∞
The variance makes it difficult to use the same unit of measures as the mean value. If X is in dollars, the
variance is in squared dollars. To put the dispersion with the same unit of measure as the mean, the
remedy is to square the variance and obtain the standard deviation.
Properties
This means, the variance is equal to the mean of the square minus the square of the mean
var ( a )=0
If y=bX +a
2
var ( y ) =b var (x )
Stand deviation(y)=/b/var(x)
n
σ x =∑ pi ¿ ¿
2
if X is discrete random variable
1
∞
=∫ ¿ ¿ if X is continuous random variable
−∞
Chebyshev’s inequality
1
p ( μ x −k σ x ≥ X ≥ μ x +k μ x ) ≤
k2
1
p (| X−μ x|≥ k σ x ) ≤
k2
This result doesn’t depend on any assumption about the distribution of the random variable.
E( X)
p( X ≥ t)≤
t
Multivariate distributions
With multivariate distributions, we look at two or more random variables at once and their joint
distribution functions. It is possible to investigate relationships between two or more random variables in
this way.
Bivariate distributions
Discrete Bivariate distributions
Consider two random variables, X and Y. The possible values of X are x 1 , x 2 , … .. , x n and the
possible values of Y are y 1 , y 2 , … .., y n , then there will be a finite set of pairs (x,y), that X and Y
may jointly assume.
If we attach a probability to each of the different joint outcomes, we have a discrete bivariate probability
distribution or a joint probability function. Th joint discrete probability distribution function for two
random variables , X and Y, is the function f(x,y) such that for any point (x,y) in the X-Y plane , f(x,y)=
p(X=x and Y=y). Thus, f(x,y) gives us the joint probability that the two random variables , X and Y ,
assume the joint outcome (x, y).
Y Marginal
1 2 3 probability
of X
X 0 p(X=0 & Y=1) p(X=0 & Y=2) p(X=0 & Y=3) P(X=0)
1 p(X=1 & Y=1) p(X=1& Y=2) p(X=1 & Y=3) P(X=1)
P(Y=1) P(Y=2) P(Y=3)
Property I
f ( x , y )= p( X =x∧Y = y) ≥ 0 for all (x,y) pairs. The probabilities assigned to the joint outcomes are
nonnegative
Property 2
F ( x , y ) =p ( X ≤ x ,∧Y ≤ y )=∑ ❑ ∑ f ( s ,t) This is known as the joint distribution function or the
s ≤x t≤y
cumulative joint probability function. This is simply summing the outcomes satisfying X≤ x and Y ≤ y
Given the bivariate distribution of X and Y we may want to move back to obtain the univariate
distributions of X and Y. These are known as the marginal probability distributions for X and Y. The
marginal distributions for X and Y, respectively are given by
g ( x )= p(X =x)=∑ f (x , y)
all y
h ( y )= p(Y = y )=∑ f (x , y )
all x
this means, if we want the probability that X equals to a specific value x, we simply sum the joint
probabilities across all values of y after fixing X=x. doing this for all values of X gives us the marginal
probability distribution of X. Similarly we do the same for Y to obtain marginal distribution of Y.
Conditional distributions
When we are interested to determine the distribution of one random variable conditional upon a second
random variable assuming a specific value, we compute the conditional probability distribution for the
variable. The conditional probability distributions of X and Y are given by
f ( x , y)
f c ( x / y )= p ( X= x /Y = y )=
h( y )
f ( x , y)
f c ( y / x )= p ( Y =x / X = y )=
g (x)
b d
p ( a ≤ X ≤ b∧c ≤Y ≤ d )=∫ ❑∫ f ( x , y ) dxdy
a c
This function (f(x,y)) is called the bivariate joint probability density function for the random
variables X and Y
Properties
b d
p ( a ≤ X ≤ b∧c ≤Y ≤ d )=∫ ❑∫ f ( x , y ) dxdy ≥0
a c
∞ ∞
p (−∞ ≤ X ≤ ∞∧−∞≤ Y ≤ ∞ ) =∫ ❑ ∫ f ( x , y ) dxdy=1
−∞ −∞
A function F (x , y ) that gives the probability that X is less than a given value x and the random
varialble Y is jointly less than a given value y
x �丶
F ( x , y ) =p ( X ≤ x ,∧Y ≤ y )=p (−∞ ≤ X ≤ x∧−∞ ≤Y ≤ y )=∫ ❑ ∫ f ( s , t ) dtdsThis is known as the
−∞ −∞
cumulative joint density function or the cumulative joint distribution function.
Property I: F (−∞,−∞ )=0 the probability that x is less than negative infinity and Y is less than
negative infinity is zero
Property 2: F ( ∞ , ∞ )=1 the probability that X is less than positive infinity and Y is less than positive
infinity is one
Property 3
2
∂ F(x , y)
=f (x , y )
∂ x dy
Property 4
p ( a ≤ X ≤ b∧c ≤Y ≤ d )=F ( b ,d ) −m F ( a , d ) −F ( b , c ) + F ( a , c )
∞
g ( x )= p( X =x)= ∫ f ( x , y ) dy
−∞
∞
h ( y )= p(Y = y )=∫ f ( x , y ) dx
−∞
Independence
Two random variables, X and Y are said to be independent if and only if
f ( x , y )=g ( x ) h(�� � )
f ( X 1 , X 2 , … . X N )=f X 1 ( X 1) f X2 ( X 2 ) … f X N ( X N ) .
Expectations
The expectation of a random variable given the bivariate distribution of this variable and one or more
other random variables can be computed in two ways. First, one can simply obtain the marginal
distribution function for the variable of interest and compute its expectation as usual based on the
marginal distibrution. Second, one can also directly compute the expectation of a given random variable
based on the joint density function.
E ( X ) =∑ ❑ ∑ xf (x , y )
all x all y
all x all y
E ( Y )=∑ ❑ ∑ y f ( x , y )
all x all y
var ( Y ) =∑ ❑ ∑ ( y−E ( Y ) ) f ( x , y )
2
allx all y
∞ ∞
var ( X )= ∫ ❑ ∫ ( x−E ( X ) ) f ( x , y ) d y d x
2
−∞ −∞
∞ ∞
E(Y )= ∫ ❑ ∫ y f ( x , y ) d y d x
−∞ −∞
∞ ∞
var (Y )=∫ ❑ ∫ ( y−E ( Y )) f ( x , y ) d y d x
2
−∞ −∞
Property 1
E( X 1 + X 2 ,+ …+ X N ¿=E ( X 1 ) + E ( X 2 ) +…+ E( X N )
(∑ ) ∑
N N
E XI = E(X i)
i=1 1
The expected value of a sum is the sum of the expected values. This is true whether or not the
variables are independent.
( )
N N
E ∑ ai X I =∑ ai E(X � )
i=1 1
Property 2
Let X and Y be two random variables with a joint density function f(x,y). If g(x,y) is some function of X
and Y,
∞ ∞
E( g ( X , Y ) )=∫ ❑ ∫ g( x , y) f ( x , y ) dydx if X and Y are continuous
−∞ −∞
Property 4
( )
N N
var ∑ X I =∑ var ( X i )
i=1 1
Conditional expectations
Often, our interest lies in the expected value of a random variable conditioned on specific value of
another variable. The conditional expectation is given by
�楨( x , y )
E ( Y / x ¿ )=∑ yf ( y / x)=∑ y if Y is discrete
all y all y g ( x)
∞
�( x , y)
E(Y /x )= ∫ y dy if Y is continuous
−∞ g(x)
Measures of association
Besides the various summary measures used for individual random variables, when dealing with two or
more random variables, we can also measure association between the variables. Two measures of
association are commonly used: covariance and correlation.
Covariance
Covariance indicates direction of association based on the sign of the covariance value (+,-, 0). It is
sensitive to units of measurement such that its interpretation as indicator of strength of association is
doubtful.
Properties
Property I
∞ ∞
cov ( X , Y )= ∫ ❑ ∫ ( x−E ( X ) ) ( y−E ( Y ) ) f ( x , �)dy
−∞ −∞
Property 3
cov ( X , Y )=E ( XY ) −E ( X ) E (Y ).
Property 4
While independence implies zero covariance, the converse is not true. That is, zero covariance doesn’t
necessarily imply independence.
Rationale: covariance is a measure of linear association between two random variables. Hence, it is
possible that two variables that have strong association in a non linear way may still have close to zero
covariance (covariance measures only linear association).
Property 5
If X=Y, then
Property 6
Correlation
Corr ( X ,Y )=E
[ ( x−E ( X ) ) ( y− E (Y ) )
σX σY ]
cov ( X , Y )
¿
σ X σ �
The advantage of the correlation measure is it is unit free and it gives better measure of degree of
association between two random variables that is not influenced by the units of measurement used.
Friedman’s Contention
Let the level of economic activity without government intervention be X. Let the amount of economic
activity increment due to government intervention (stabilizing policy) be Y. Both are random variables.
Traditional argument is that the mean level of economic activity with intervention is more stable than
without intervention. This mean level is given by
E( X +Y )
Friedman’s argument is that economic swings are due mainly to variance of GDP than mean level of
GDP. As a result the variance with intervention is
This means, the value of the covariance must be negative as well as large enough to offset the variance of
the intervention variable. This is not sufficient arugment to counter the traditional point. Rather, Friedman
argued that the covariance term is negligible because intervention affects the economy with a variable and
long lag making the case for the covariance to be very small. Hence, rather than stabilizing these
measures may be surprisingly destabilizing the economy.
Properties of correlation
Property 1
−1 ≤ ρ XY ≤ 1
The proof is based on Scwarz inequality which states that for any two random variables U and V
E ¿Property 2
Only if there is a perfect linear relationship between two variables that the correlation between the two
variables will equal one in absolute value.
Property 3
Correlation doesn’t imply causality. Two things may move together but there may be no causal relation
whatsoever between them.
The population is the universe of individuals of interest and contains all the characteristics of interest.
The sample is part of the population taken for study. To draw reasonable conclusions about the population
based on sample observations, the sample must be selected so that it mimics the population. It is a replica-
in miniature-of the population.
Of many ways to select sample, simple random sampling is one way to ensure the above features of a
sample that is representative of the population of interest. Simple random sampling is characterized as
follows
Why sample
This depends critically on the precesion of inference with which we want to make about the
population
A sample of size n from a population can be thought as a set of random variables: X1, X2,…, Xn. The
outcome of the value of the first observation (X1) for instance is unkown until a particular sample is
observed. In one sample, it ma assume one specific value and another value in a second sample and so on.
Hence, it is a random variables. Similarly, all the observations can be thought as realizations of random
variables. The distribution of each of these variables is determined by the distribution in the population. If
a population has mean V and variance D, then each of these rrandom variables have mean V and variance
D.
But, once we have taken a specific sample of size n, we have obtained real values that are simply
realizations of the random variables. If we take another sample, we may get yet different realizations for
each of our random variables.
T =r ( X 1 , X 2 , X 3 … . , X n)
It is a formula that shows how to combine the sample data to form a point estimate of the population
parameter. As such it is called an estimator. Once data is obtained, and we have realized values x1,
x2…xn we plug these into the formula and get an estimate of the population parameter. This estimate
is called a point estimate (it is a single number instead of a range of numbers).
Sample statistics
Since T is a function of random variables, it is also a random variable. That is, its value depends on
the specific sample drawn and since which sample is to be drawn is unknown, so is the value of T
uncertain. Thus, the sample statistic is itself a random variable with probability distribution. The
distribution of a sample statistic is known as a sampling distribution.
Symmetric
Bell shaped
Continuous
Properties
2
If a random variable X is normally distributed with mean μ X and variance σ X , we write
2
X N ( μ X , σ X )The centeral limit thermo is written as
X N ( μ X , σ 2X )
Where
2
2 σX
, σ X=
n
Property 2
2
The normal distribution is completely characteriised by two parameters: μ X , σ X . For every different
value of the mean and variance we have different normal distribution. The different means changes
the center and the different varianes change the spread of the distribution.
Property 3
The normal distribution is symmetric with 0.5 probability of on each side of the mean. And the
mean=mode=median for the normal distribution
Property 4
2 2
Y =a+bX E ( Y )=a+b μ X Var ( Y )=b σ X
And
2 2
Y N (a+ b μ X , b σ X )
Special case
−μ x 1
a= and b=
σx σ x
We have
X−μ x
N (0,1)
σx
Property
Any linear combination of normal random variables is normally distributed. It is from this result that the
statement X is normally distributed even if the sample size is not large as long as the population and
hence, each Xi, is normally distributed.
∑ Xi
1
X=
n
This is the same as the population mean except that the sample mean is a random variable while the
population mean is a constant.
Since a sample statistic is a random variable, and sicne the sample mean is a sample statistic, it is a
random variable and it has mean and variance.
∑ Xi
1
E( X)=E( )
n
n
1
E( X)= E( ∑ X i)
n 1
n
1
E( X)= ∑ E( X ¿¿ i¿)¿ ¿
n 1
n
1
E( X)= ∑ �� �} X ¿
n �흜
1
E( X)= n μ X
n
E( X)=μ X
The mean of the sample mean is the population mean. Here use is made the fact that each random
variable (observation) has the same mean and variance as the population and the population mean is a
constant.
∑ Xi
1
Var (X )=var ( )
n
n
1
var ( X )= 2 ∑ var (X ¿ ¿i¿)¿ ¿
n 1
n
1
2∑ X
var (X )= σ 2¿
n 1
1
var (X )= 2
nσ X 2
n
1 2
var ( X )= σ X
n
The variance of the sample mean is simply the population variance of the variable X (whose value is not
known to us) divided by the sample size.
Having derived the mean of the sample mean and its variance (standard deviation), the next question is
what is its exact shape or how is it distributed. The central limit theorem provides an answer to this.
random variable
X−μ x
√
2
σ x
n
has an approximately normal distribution. The approximation improves with the size of n. the sampling
distribution of the sample mean follows a normal distribution when n is large (n >30). This holds
regardless of the distribution of the population. On the other, even if the sample size is small, but the
population has a normal distribution, then the approximation holds true.
As is done for the sample mean, it also possible to derive the sampling distribution of the sample
variance.
n
1
2
SX = ∑ ( x i−X )2
n−1 1
The sample variance differs from population variance only in one aspect: the denominator is n-1 instead
of n. This has its own advantages and that is the reason for it.
If we draw a random sample of size n repeatedly we will have different sample variances. As such, the
sample variance is a random variable and has its own probability distribution. What distribution does it
follow?
Let Z1, Z2, Z3,…Zn be n independent random variables each with a standard normal distribution. Then
n
∑ Z i2 χn
2
Where the
χ n2
Denotes the chi-squared distribution. If we square and sum n standard normal variables, we have a new
random variable which has a chi-squared distribution. The shape of the chi-squared distribution depends
on n, which is known as the degrees of freedom.
E( χ ¿¿ n¿¿ 2)=n ¿ ¿
2
(n−1) s x 2
2
χ n−1
σ X
The t-Distortion
2 2
Let z and χ n be two independent random variables, where z �� � ( 1,0 ) and χ v is chi-squared
distributed with v degrees of freedom. Then, if we form the new random variable
z
t
√
2
χv
v
E(t ( v ) )=0
It was observed that the sample mean is normally distributed. That is,
X−μ x
N (1,0)
σx
√n
But, usually we don’t know the population standard deviation. Let’s assume we can replace it with the
sample standard deviation, Sx. Now, we have a new random variable given by
X−μ x
sx
√n
But, this random variable is not normally distributed. Rather, it has the t-distribution with n-1 degrees of
freedom. This is clear from the following
X−μ x
sx
X−μ x
=
√n
√
sx (n−1) s
2
√n (n−1)σ 2x
X−μ x
In other words, the random variable s x is put in the form
√n
z
t n−1
√ χv2
n−1
That is it is a random variable that follows the t-distribution, with n-1 degrees of freedom.
The F distribution
2 2
Consider two independent random variables, χ v 1
and χ v2
which are both chi-squared distributed. A new
random variable can be formed as follows
2 2
χv χv
÷1 1
v1 v2
This new variable has the F distribution. There are two parameters that characterize this distribution: the
numerator degrees of freedom and denominator degrees of freedom.
Estimation
Now, any population distribution can be characterized by a few parameters. The normal distribution, for
instance, is characterized by its mean and variance. The exponential distribution is characterized by λ and
so on. The interest is in making an estimate of these population parameters.
Potentially, there is infinite number of parameter estimators. An estimator is a rule that indicates how to
compute a population parameter estimate based on the data given. an actual number obtained after
plugging data into the estimator function is an estimate. An estimate may be a point or an interval
estimate. From an infinite number of potential estimators we have to choose the one that provides us the
best estimate based on some criteria.
Denote the population parameter of interest by θ and denote the estimator by θ^ . There are many
desribale characterstic of an estimator. Typically, the properties used to judge between estimators are
classified as
Finite sample properties : properties that hold regardless of the size of the sample used to
generate the estimate
Asymptotic properties: often properties of finite sample estimators are difficult to derive and we
must resort to asymptotic properties that hold only when the sample size is large.
Criteria 1: Unbiasedness
An estimator is said to be unbiased when the expected value of the estimator equals the true value of the
population parameter. That is,
^
E( θ)=θ
Potentially there are several unbiased estimators of the true population parameter. Hence, we need more
criteria.
n
1
S2X = ∑ ( x i−X )2
n−1 1
n
1
S = ∑ ( x i−X )
2 2
X
n 1
2 n−1
E( S ¿ ¿ X )= var (X ) ¿
n
If two estimators are both unbiased, centered on the true parameter, they will give us the correct value on
average. But, the two estimators may have different variances. Then we choose the one whose estimates
are more concentrated around the true value. We say that one estimator ( θ^ 1) is more efficient than another
(θ^ 2) if it has a smaller sampling variance. That is,
Often there is a tradeoff involved: one estimator may be efficient but biased and another might be
unbiased but less efficient. Which one is best under this situation depends on several considerations. In
general, one way of formalizing these points is by defining a loss function that explicitly shows the loss or
^ . Let the loss function be L(θ^ , θ) which measures the loss
costs associated with deviations between θ∧θ
when the population parameter is θ and we estimate it to be θ^ . The most commonly used loss function is
the mean squared error loss function, where
^ 2
MSE=E(θ−θ)
MSE=var ( θ^ ) +bias 2
n
1
2
SX= ∑ (x i−� )2
n−1 1
n
1
2
SX= ∑
n 1
( x i−X )
2
Infinite sample property:
Asymptotic unbiasedness
lim E ( θ^ ) =θ
n→∞
Consistency
An estimator is said to be consistent when the probability that an estimator differs from actual population
parameter by a small value ε gets zero as n gets to infinity.
lim p (|θ−θ
^ |≥ ε ) =0 for any ε > 0
n→∞
Asymptotic efficiency
An estimator is said to be asymptotically efficient if it converges to the true population parameter value
more quickly than any other estimator
Sufficient estimator
The condition joint distribution of the sample observations, given the estimator value, doesn’t depend on
the population parameter value.
Method of moments
The method seeks to equate the moments implied by a statistical model of the population distribution with
the actual moments observed in the sample. It assumes that relations that hold for the population also hold
for the sample.
The method of moments doesn’t necessarily generate unique estimators. One can start with the mean or
the variance etc and still may get different estimators. This however has benefits. The different estimators
should give the same result if the assumed distribution of the population is correct.
The main idea of MML is that the data we observe are more likely to be associated with some
distributions than with others. Typically, we start by assuming that our data is drawn from a some
population distribution with unknown parameterθ . We then select our estimate of this parameter so as to
maximize the likelihood of seeing the data we actually saw.
Formally, consider a random drawing of the sample observations X1, X2, ….Xn. from a distribution
f ( X /θ) with unkown parameter θ .the likelihood of observing the sample we saw is the joint density
function
L(θ)=f ( X 1 /θ ) f ( X 2 /θ ) … … . f ( X n /θ)
This is known as the likelihood function. For convenience we convert this into a logarithm form and get
the log likelihood function. Then we maximize
L (θ )=ln f ( X 1 / θ ) + ln f ( X 2 /θ ) … … .+ ln f ( X n /θ)
With respect to θ
Properties
If an ML Estimator is found to be unbiased, it is the most efficient possible estimator of all
possible estimators. If we can show the estimator is unbiased, we know it is the best.
Almost all ML estimators are consistent
Most MLEs estimators follow a normal distribution when the sample size is large. That is, they
are asymptotically normal.
Though it is not as general as the two other methods, it is also widely used in some situations especially in
relation to linear models.
It selects the single estimator to minimize the sum of squared errors (SSE).
For instance, if we have n random variables, Y1,Y2,…., Yn, we can represent each random observation as
Y i=μ Y + ε i
Since each observation provides an unbiased estimator of the mean value of the population,
Because,
E ( ε i )=0
And
ε ii =Y −μY
∑ ε =∑ (Y i−μY )2
2
i
1 i
∑ yi
i
^μY =
n
Often we want to develop a range of numbers within which we expect the population parameter to lie
with certain level of confidence. An interval estimation is intended for this purpose. Put simply, interval
estimation involves determining an upper and lower bound within which the population parameter resides
with certain level of confidence. The degree of our certainty on the estimate is referred to as confidence
coefficient and is denoted by 1-α. The confidence interval is then 100% (1-α).
An interval estimation of the population mean, μ X , consistsof two bounds within which we expect μ X to
reside. That is,
LB ≤ μ X ≤UB
Where LB and UB are the lower bound and upper bound, respectively.
The probabity that μ X lies within the provided interval is known as the confidence coefficient and is
denoted by 1−α . We call α the significance level. for a specific value of α we refer to LB ≤ μ X ≤UB as
the 100(1-α)% confidence interval.
y=f ( x )ou
y=f ( x )+ ϵ
In this model, the value of y is also affected by other random factors: factors other than x.
Because the error term is a random variable, and y is a linear function of this random error, then y is also
a random variable.
For every value of x, there are a number of possible values of y, which vary around the mean value of y,
the distance from this mean value is what the error term represents
Assumptions
The additional assumptions are needed for hypotheses testing related with the estimated population
parameters (the coefficient of x and the constant)
Assumption 2: the error term variance is the same for all observations
va �㈹ ( ε i ) =σ 2ε for all i
when this is not satisfied, it is said that there is Hetroscedasticity (agains homoscedasticity)
this assumption means that the errors are drawn from a distribution with same variance.
Assumption 3: covariance of the error terms for any two observations is zero
The two assumptions together imply that, the errors are identically distributed independent random
variables.