Professional Documents
Culture Documents
For example in a class we have 5 students of ages 13,14,14,15,34. the average value comes to18.
This has happened because of one value has distorted the whole picture.
On the contrary if we take the variance and standard deviation this will at once be revealed.
In other words, if you cut out a probability distribution (like a bell curve) then you could balance it on your finger right at the value of
its mean.
For example, picture a probability distribution where there is a 20% chance of getting a 5 and an 80% chance of getting a 10.
In that case, the mean is 0.2*5+0.8*10=1+8=9. As you can see, there is 80% "mass" exactly 1 away from 9 and there is 20% mass
exactly 4 away from 1. 80%*1 = 20%*4. If the probability distribution was a lever (like a see-saw) with the fulcrum 9 away from
the lighter side, the lever would be perfectly balanced.
For example, if there was a 100% chance of getting a 5, then the variable would not be random. It would be deterministic. It would
thus have 0 variance. However, if there was only an 80% chance of getting a 5, a 10% chance of getting a 4 and a 10% chance of
getting a 6, then the random variable would have a variance of 0.2 with reflects the 10% and 10% chances on the left and right of 5.
In other words, 2*10%=0.2. If you change those 10% to 5%, you get a variance of 0.1 because 2*5%=0.1. The variance is somehow
measuring the "spread" of the data.
It's measuring the amount of noise you're going to get around your mean (the mean in this case is 5).
The variance is in square units because it is actually the "expected" squared distance from the mean. In other words, if the
variance is 1, we expect that if we square the distance from the mean, we'll get a value around 1.
The standard deviation just converts this expectation back into our old units.
That way if we have a variance of 4, we'll have a standard deviation of 2. It's just more convenient to express variance in normal
units rather than square units.
Probability Density Function describes the relative likelihood for this random variable to occur at a given point in the
observation space.
The probability of a random variable falling within a given set is given by the integral of its density over the set.
Distribution Function describes the range of possible values that a random variable can attain and
the probability that the value of the random variable is within any (measurable) subset of that range
standard deviation (represented by the symbol sigma, ) shows how much variation or "dispersion" exists from the average
(mean, or expected value). A low standard deviation indicates that the data points tend to be very close to the mean; high standard
deviation indicates that the data points are spread out over a large range of values.
nterpretation and application
Example of two sample populations with the same mean and different standard deviations. Red population has mean 100 and SD
A large standard deviation indicates that the data points are far from the mean and a small standard deviation indicates that they are
clustered closely around the mean.
For example, each of the three populations {0, 0, 14, 14}, {0, 6, 8, 14} and {6, 6, 8, 8} has a mean of 7. Their standard deviations are
7, 5, and 1, respectively.
The third population has a much smaller standard deviation than the other two because its values are all close to 7.
It will have the same units as the data points themselves.
If, for instance, the data set {0, 6, 8, 14} represents the ages of a population of four siblings in years, the standard deviation is 5
years.
As another example, the population {1000, 1006, 1008, 1014} may represent the distances traveled by four athletes, measured in
meters. It has a mean of 1007 meters, and a standard deviation of 5 meters.
Standard deviation may serve as a measure of uncertainty. In physical science, for example, the reported standard deviation of a
group of repeated measurements gives the precision of those measurements. When deciding whether measurements agree with a
theoretical prediction, the standard deviation of those measurements is of crucial importance: if the mean of the measurements is
too far away from the prediction (with the distance measured in standard deviations), then the theory being tested probably needs to
be revised. This makes sense since they fall outside the range of values that could reasonably be expected to occur, if the
prediction were correct and the standard deviation appropriately quantified. See prediction interval.
While the standard deviation does measure how far typical values tend to be from the mean, other measures are available. An
example is the mean absolute deviation, which might be considered a more direct measure of average distance, compared to
the root mean square distance inherent in the standard deviation.
[edit]Application examples
The practical value of understanding the standard deviation of a set of values is in appreciating how much variation there is from the
"average" (mean).
[edit]Climate
As a simple example, consider the average daily maximum temperatures for two cities, one inland and one on the coast. It is helpful
to understand that the range of daily maximum temperatures for cities near the coast is smaller than for cities inland. Thus, while
these two cities may each have the same average maximum temperature, the standard deviation of the daily maximum temperature
for the coastal city will be less than that of the inland city as, on any particular day, the actual maximum temperature is more likely to
be farther from the average maximum temperature for the inland city than for the coastal one.
[edit]Particle physics
Particle physics uses a standard of "5 sigma" for the declaration of a discovery. [3] At five-sigma there is only one chance in nearly
two million that a random fluctuation would yield the result. This level of certainty prompted the announcement that a particle
consistent with the Higgs boson has been discovered in two independent experiments at CERN.[4]
[edit]Sports
Another way of seeing it is to consider sports teams. In any set of categories, there will be teams that rate highly at some things and
poorly at others. Chances are, the teams that lead in the standings will not show such disparity but will perform well in most
categories. The lower the standard deviation of their ratings in each category, the more balanced and consistent they will tend to be.
Teams with a higher standard deviation, however, will be more unpredictable. For example, a team that is consistently bad in most
categories will have a low standard deviation. A team that is consistently good in most categories will also have a low standard
deviation. However, a team with a high standard deviation might be the type of team that scores a lot (strong offense) but also
concedes a lot (weak defense), or, vice versa, that might have a poor offense but compensates by being difficult to score on.
Trying to predict which teams, on any given day, will win, may include looking at the standard deviations of the various team "stats"
ratings, in which anomalies can match strengths vs. weaknesses to attempt to understand what factors may prevail as stronger
indicators of eventual scoring outcomes.
In racing, a driver is timed on successive laps. A driver with a low standard deviation of lap times is more consistent than a driver
with a higher standard deviation. This information can be used to help understand where opportunities might be found to reduce lap
times.
[edit]Finance
In finance, standard deviation is often used as a measure of the risk associated with price-fluctuations of a given asset (stocks,
bonds, property, etc.), or the risk of a portfolio of assets [5] (actively managed mutual funds, index mutual funds, or ETFs). Risk is an
important factor in determining how to efficiently manage a portfolio of investments because it determines the variation in returns on
the asset and/or portfolio and gives investors a mathematical basis for investment decisions (known as mean-variance optimization).
The fundamental concept of risk is that as it increases, the expected return on an investment should increase as well, an increase
known as the "risk premium." In other words, investors should expect a higher return on an investment when that investment carries
a higher level of risk or uncertainty. When evaluating investments, investors should estimate both the expected return and the
uncertainty of future returns. Standard deviation provides a quantified estimate of the uncertainty of future returns.
For example, let's assume an investor had to choose between two stocks. Stock A over the past 20 years had an average return of
10 percent, with a standard deviation of 20 percentage points (pp) and Stock B, over the same period, had average returns of 12
percent but a higher standard deviation of 30 pp. On the basis of risk and return, an investor may decide that Stock A is the safer
choice, because Stock B's additional two percentage points of return is not worth the additional 10 pp standard deviation (greater
risk or uncertainty of the expected return). Stock B is likely to fall short of the initial investment (but also to exceed the initial
investment) more often than Stock A under the same circumstances, and is estimated to return only two percent more on average.
In this example, Stock A is expected to earn about 10 percent, plus or minus 20 pp (a range of 30 percent to -10 percent), about
two-thirds of the future year returns. When considering more extreme possible returns or outcomes in future, an investor should
expect results of as much as 10 percent plus or minus 60 pp, or a range from 70 percent to 50 percent, which includes outcomes
for three standard deviations from the average return (about 99.7 percent of probable returns).
Calculating the average (or arithmetic mean) of the return of a security over a given period will generate the expected return of the
asset. For each period, subtracting the expected return from the actual return results in the difference from the mean. Squaring the
difference in each period and taking the average gives the overall variance of the return of the asset. The larger the variance, the
greater risk the security carries. Finding the square root of this variance will give the standard deviation of the investment tool in
question.
Population standard deviation is used to set the width of Bollinger Bands, a widely adopted technical analysis tool. For example, the
upper Bollinger Band is given as x + nx. The most commonly used value for n is 2; there is about a five percent chance of going
outside, assuming a normal distribution of returns.
[edit]Geometric interpretation
To gain some geometric insights and clarification, we will start with a population of three values, x1, x2, x3. This defines a point P =
(x1, x2, x3) in R3. Consider the line L = {(r, r, r) : r R}. This is the "main diagonal" going through the origin. If our three given values
were all equal, then the standard deviation would be zero and P would lie on L. So it is not unreasonable to assume that the
standard deviation is related to the distance of P to L. And that is indeed the case. To move orthogonally from L to the point P, one
begins at the point:
whose coordinates are the mean of the values we started out with. A little algebra shows that the distance
between P and M (which is the same as the orthogonal distance between P and the line L) is equal to the standard deviation of
the vector x1, x2, x3, multiplied by the square root of the number of dimensions of the vector (3 in this case.)
[edit]Chebyshev's inequality
Main article: Chebyshev's inequality
An observation is rarely more than a few standard deviations away from the mean.
Chebyshev's inequality ensures that, for all distributions for which the standard deviation is defined,
the amount of data within a number of standard deviations of the mean is at least as much as given in the following table.
50% 2
75% 2
89% 3
94% 4
96% 5
97% 6
[6]
Dark blue is less than one standard deviation from the mean. For the normal distribution, this accounts for 68.27 percent of the
set; while two standard deviations from the mean (medium and dark blue) account for 95.45 percent; three standard deviations
(light, medium, and dark blue) account for 99.73 percent; and four standard deviations account for 99.994 percent. The two
points of the curve that are one standard deviation from the mean are also the inflection points.
The central limit theorem says that the distribution of an average of many independent, identically distributed random
variables tends toward the famous bell-shaped normal distribution with a probability density function of:
where is the expected value of the random variables, equals their distribution's standard deviation divided by n1/2,
and n is the number of random variables. The standard deviation therefore is simply a scaling variable that adjusts
how broad the curve will be, though it also appears in the normalizing constant.
If a data distribution is approximately normal, then the proportion of data values within z standard deviations of the
mean is defined by:
Proportion =
where is the error function. If a data distribution is approximately normal then about 68 percent of the data
values are within one standard deviation of the mean (mathematically, , where is the arithmetic mean),
about 95 percent are within two standard deviations ( 2), and about 99.7 percent lie within three standard
deviations ( 3).
This is known as the 68-95-99.7 rule, or the empirical rule.
For various values of z, the percentage of values expected to lie in and outside the symmetric interval,
CI = (z, z), are as follows:
1.959964 95% 5% 1 / 20
COVARIANCE
In probability theory and statistics, covariance is a measure of how much two random variables change together.
If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for
the smaller values,
i.e.,the variables tend to show similar behavior, the covariance is a positive number.
In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the
variables tend to show opposite behavior, the covariance is negative.
The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of
the covariance is not that easy to interpret.
The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the
linear relation.
A distinction must be made between
(1) the covariance of two random variables, which is a population parameter that can be seen as a property of the joint probability
distribution, and
(2) the sample covariance, which serves as an estimated value of the parameter
VARIANCE
In probability theory and statistics,
the variance is a measure of how far a set of numbers is spread out.
It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value).
In particular, the variance is one of the moments of a distribution.
In that context, it forms part of a systematic approach to distinguishing between probability distributions. While other such
approaches have been developed, those based on moments are advantageous in terms of mathematical and computational
simplicity.
The variance is a parameter describing in part either the actual probability distribution of an observed population of
numbers, or
the theoretical probability distribution of a sample (a not-fully-observed population) of numbers.
In the latter case a sample of data from such a distribution can be used to construct an estimate of its variance: in the simplest
cases this estimate can be the sample variance, defined below.
.
Contents
[hide]
1 Definition
2 Properties
4 Comments
5 See also
6 References
7 External links
[edit]Definition
The covariance between two jointly distributed real-valued random variables x and y with finite second moments is defined[1] as:
where E[x] is the expected value of x, also known as the mean of x. By using the linearity property of expectations, this can be
simplified to
For random vectors and (of dimension m and n respectively) the mn covariance matrix is equal to
where mT is the transpose of the vector (or matrix) m.
The (i,j)-th element of this matrix is equal to the covariance Cov(xi, yj) between the i-th scalar component of x and
the j-th scalar component of y. In particular, Cov(y, x) is the transpose of Cov(x, y).
Variance is a special case of the covariance when the two variables are identical:
If x, y, W, and V are real-valued random variables and a, b, c, d are constant ("constant" in this
context means non-random), then the following facts are a consequence of the definition of
covariance:
For sequences x1, ..., xn and y1, ..., ym of random variables, we have
For a sequence x1, ..., xn of random variables, and constants a1, ..., an, we have
Let be a random vector, let denote its covariance matrix, and let be a
matrix that can act on . The result of applying this matrix to is a new vector with
covariance matrix
.
This is a direct result of the linearity of expectation and is useful when applying
a linear transformation, such as a whitening transformation, to a vector.
[edit]Uncorrelatedness and independence
If x and y are independent, then their covariance is zero. This follows because
under independence,
The converse, however, is not generally true. For example, let x be uniformly
distributed in [-1, 1] and let y = x2. Clearly, x and y are dependent, but
Then we have:
QED.
[edit]Calculating the sample covariance
,
which is an estimate of the covariance between
variable j and variable k.
The sample mean and the sample covariance
matrix are unbiased estimates of the mean and
the covariance matrix of the random vector ,a
row vector whose jth element (j = 1, ..., K) is one of
the random variables. The reason the sample
covariance matrix has in the
denominator rather than is essentially that the
Comments
Other moments may also be defined. For example, the n th inverse moment about zero is and the n th logarithmic
Contents
[hide]
o 1.1 Mean
o 1.2 Variance
o 1.3 Skewness
o 1.4 Kurtosis
2 Cumulants
3 Sample moments
4 Problem of moments
5 Partial moments
7 See also
8 References
9 External links
Significance of the moments
The nth moment of a real-valued continuous function f(x) of a real variable about a value c is
It is possible to define moments for random variables in a more general fashion than moments for real valuessee moments
in metric spaces. The moment of a function, without further explanation, usually refers to the above expression with c = 0.
Usually, except in the special context of the problem of moments, the function f(x) will be a probability density function.
The nth moment about zero of a probability density function f(x) is the expected value of Xn and is called a raw
moment or crude moment.[2] The moments about its mean are called central moments; these describe the shape of the
function, independently of translation.
If f is a probability density function, then the value of the integral above is called the nth moment of the probability distribution.
More generally, if F is a cumulative probability distribution function of any probability distribution, which may not have a density
function, then the nth moment of the probability distribution is given by the RiemannStieltjes integral
where X is a random variable that has this cumulative distribution F, and E is the expectation operator or mean.
When
then the moment is said not to exist. If the nth moment about any point exists, so does (n 1)th moment, and all
lower-order moments, about every point.
The zeroth moment of any probability density function is 1, since the area under any probability density
function must be equal to one.
Significance of moments (raw, central, standardized) and cumulants (raw, standardized), in connection with
named properties of distributions
2 variance 1 variance 1
3 skewness skewness
5+
[edit]Mean
Main article: Mean
The first raw moment is the mean.
[edit]Variance
Main article: Variance
The second central moment is the variance. Its positive square root is the standard deviation .
[edit]Normalized moments
The normalized nth central moment or standardized moment is the nth central moment divided by n; the
normalized nth central moment of x = E((x )n)/n. These normalized central moments are dimensionless
quantities, which represent the distribution independently of any linear change of scale.
[edit]Skewness
Main article: Skewness
The fourth central moment is a measure of whether the distribution is tall and skinny or short and squat, compared
to the normal distribution of the same variance. Since it is the expectation of a fourth power, the fourth central
moment, where defined, is always non-negative; and except for a point distribution, it is always strictly positive. The
fourth central moment of a normal distribution is 34.
The kurtosis is defined to be the normalized fourth central moment minus 3. (Equivalently, as in the next section, it
is the fourth cumulant divided by the square of the variance.) Some authorities[3][4] do not subtract three, but it is
usually more convenient to have the normal distribution at the origin of coordinates. If a distribution has a peak at
the mean and long tails, the fourth moment will be high and the kurtosis positive (leptokurtic); and conversely; thus,
bounded distributions tend to have low kurtosis (platykurtic).
The kurtosis can be positive without limit, but must be greater than or equal to 2 2; equality only holds for binary
distributions. For unbounded skew distributions not too far from normal, tends to be somewhere in the area of
2 and 22.
The inequality can be proven by considering
where T = (X )/. This is the expectation of a square, so it is non-negative whatever a is; on the other hand,
it's also a quadratic equation in a. Itsdiscriminant must be non-positive, which gives the required relationship.
[edit]Mixed moments
Mixed moments are moments involving multiple variables.
Some examples are covariance, coskewness and cokurtosis. While there is a unique covariance, there are
multiple co-skewnesses and co-kurtoses.
[edit]Higher moments
High-order moments are moments beyond 4th-order moments. The higher the moment, the harder it is to
estimate, in the sense that larger samples are required in order to obtain estimates of similar quality.[citation needed]
[edit]Cumulants
The first moment and the second and third unnormalized central moments are additive in the sense that
if X and Y are independent random variables then
and
and
(These can also hold for variables that satisfy weaker conditions than independence. The first
always holds; if the second holds, the variables are called uncorrelated).
In fact, these are the first three cumulants and all cumulants share this additivity property.
[edit]Sample moments
The moments of a population can be estimated using the sample k-th moment
Partial moments are sometimes referred to as "one-sided moments." The nth order lower
and upper partial moments with respect to a reference point r may be expressed as
Partial moments are normalized by being raised to the power 1/n. The upside
potential ratio may be expressed as a ratio of a first-order upper partial
moment to a normalized second-order lower partial moment.
[edit]Central Moments in metric spaces
Let (M, d) be a metric space, and let B(M) be the Borel -algebra on M, the -
algebra generated by the d-open subsets of M. (For technical reasons, it is
also convenient to assume that M is a separable space with respect to
the metric d.) Let 1 p +.
The pth central moment of a measure on the measurable space (M, B(M))
about a given point x0 in M is defined to be
is said to have finite pth central moment if the pth central moment
of about x0 is finite for some x0 M.
This terminology for measures carries over to random variables in the
usual way: if (, , P) is a probability space and X : M is a random
variable, then the pth central moment of X about x0 M is defined to be
and X has finite pth central moment if the pth central moment
of X about x0 is finite for some x0 M.
INDEPENDENCE
In probability theory, to say that two events are independent (alternatively statistically independent,marginally
independent or absolutely independent[1]) means that the occurrence of one does not affect the probability of the other. Similarly,
two random variables are independent if the observed value of one does not affect the probability distribution of the other.
The concept of independence extends to dealing with collections of more than two events or random variables.
Contents
[hide]
1 Definition
2 Properties
o 2.1 Self-dependence
3 Examples
4 See also
5 References
[edit]Definition
[edit]For events
[edit]Two events
Two events and are independent iff their joint probability equals the product of their probabilities:
.
Why this defines independence is made clear by rewriting with conditional probabilities:
.
Thus, the occurrence of does not affect the probability of , and vice versa. Although the derived
expressions may seem more intuitive, they are not the preferred definition, as the conditional probabilities may
A finite set of events is pairwise independent iff every pair of events is independent[2]. That is, iff for
.
A finite set of events is mutually independent iff every event is independent of any intersection of the
.
This is called the multiplication rule for independent events.
For more than two events, a mutually independent set of events is pairwise independent, but the
converse is not necessarily true.
[edit]For random variables
[edit]Two random variables
Two random variables and are independent iff for every and , the
Intuitively, two random variables X and Y are conditionally independent given Z if,
once Z is known, the value of Y does not add any additional information about X. For
instance, two measurements X and Y of the same underlying quantity Z are not
independent, but they are conditionally independent given Z (unless the errors in the
two measurements are somehow connected).
The formal definition of conditional independence is based on the idea of conditional
distributions. If X, Y, and Z are discrete random variables, then we define X and Y to
be conditionally independent given Z if
for all x, y and z such that P(Z = z) > 0. On the other hand, if the random variables
are continuous and have a joint probability density function p,
then Xand Y are conditionally independent given Z if
for any x, y and z with P(Z = z) > 0. That is, the conditional distribution
for X given Y and Z is the same as that given Z alone. A similar equation
holds for the conditional probability density functions in the continuous
case.
Independence can be seen as a special kind of conditional
independence, since probability can be seen as a kind of conditional
probability given no events.
[edit]Independent -algebras
The definitions above are both generalized by the following definition of
independence for -algebras. Let (, , Pr) be a probability space and
let Aand B be two sub--algebras of . A and B are said to
be independent if, whenever A A and B B,
Two events are independent (in the old sense) if and only
if the -algebras that they generate are independent (in the
new sense). The -algebra generated by an event E is,
by definition,
[edit]Self-dependence
Note that an event is independent of itself iff
.
Thus if an event or its complement almost
surely occurs, it is independent of itself. For
example, if is choosing any number but 0.5
from a uniform distribution on the unit interval,
is independent of itself, even
though, tautologically, fully determines .
[edit]Expectation and covariance
so the covariance is
zero. (The converse of these, i.e. the
proposition that if two random variables
have a covariance of 0 they must be
independent, is not true.
See uncorrelated.)
[edit]Characteristic function
Two independent random
variables and have the
property that the characteristic
function of their sum is the product of
their marginal characteristic functions:
[edit]Rolling a die
The event of getting a 6 the first
time a die is rolled and the event
of getting a 6 the second time
are independent. By contrast, the
event of getting a 6 the first time a
die is rolled and the event that the
sum of the numbers seen on the
first and second trials is 8
are not independent.
[edit]Drawing cards
If two cards are
drawn with replacement from a
deck of cards, the event of
drawing a red card on the first trial
and that of drawing a red card on
the second trial are independent.
By contrast, if two cards are
drawn without replacement from a
deck of cards, the event of
drawing a red card on the first trial
and that of drawing a red card on
the second trial are
again not independent.
i. [edit]Pairwise and mutual
independence
ii.
iii.
iv. Pairwise independent, but not mutually
independent, events.
v.
vi.
vii. Mutually independent events.
cases,
b.
c.
d.
i. In the mutually independent case
however:
e.
f.
g.
i. See also [4] for a three-event example in
which
[hide]
1 Logic
2 Probability
3 Statistics
4 See also
5 Notes
6 References
[edit]Logic
In logic, two mutually exclusive propositions are propositions that logically cannot be true at the same time. Another term for
mutually exclusive is "disjoint". To say that more than two propositions are mutually exclusive, depending on context, means that
one cannot be true if the other one is true, or at least one of them cannot be true. The term pairwise mutually exclusive always
means two of them cannot be true simultaneously.
[edit]Probability
In probability theory, events E1, E2, ..., En are said to be mutually exclusive if the occurrence of any one of them automatically
implies the non-occurrence of the remaining n 1 events. Therefore, two mutually exclusive events cannot both occur. Formally
said, the intersection of each two of them is empty (the null event): A and B = . In consequence, mutually exclusive events have
the property: P(A and B) = 0.[1]
For example, one cannot draw a card that is both red and a club because clubs are always black. If one draws just one card from
the deck, either a red card or a club can be drawn. When A and B are mutually exclusive, P(A or B) = P(A) + P(B).[2] One might ask,
"What is the probability of drawing a red card or a club?" This problem would be solved by adding together the probability of drawing
a red card and the probability of drawing a club. In a standard 52-card deck, there are twenty-six red cards and thirteen clubs: 26/52
+ 13/52 = 39/52 or 3/4.
One would have to draw at least two cards in order to draw both a red card and a club. The probability of doing so in two draws
would depend on whether the first card drawn were replaced before the second drawing, since without replacement there would be
one fewer card after the first card was drawn. The probabilities of the individual events (red, and club) would be multiplied rather
than added. The probability of drawing a red and a club in two drawings without replacement would be 26/52 * 13/51 = 338/2652, or
13/102. With replacement, the probability would be 26/52 * 13/52 = 338/2704, or 13/104.
In probability theory the word "or" allows for the possibility of both events happening. The probability of one or both events occurring
is denoted P(Aor B) and in general it equals P(A) + P(B) P(A and B).[2] Therefore, if one asks, "What is the probability of drawing a
red card or a king?", drawing any of a red king, a red non-king, or a black king is considered a success. In a standard 52-card deck,
there are twenty-six red cards and four kings, two of which are red, so the probability of drawing a red or a king is 26/52 + 4/52
2/52 = 28/52. However, with mutually exclusive events the last term in the formula, P(A and B), is zero, so the formula simplifies to
the one given in the previous paragraph.
Events are collectively exhaustive if all the possibilities for outcomes are exhausted by those possible events, so at least one of
those outcomes must occur. The probability that at least one of the events will occur is equal to 1. [3] For example, there are
theoretically only two possibilities for flipping a coin. Flipping a head and flipping a tail are collectively exhaustive events, and there
is a probability of 1 of flipping either a head or a tail. Events can be both mutually exclusive and collectively exhaustive. [3] In the case
of flipping a coin, flipping a head and flipping a tail are also mutually exclusive events. Both outcomes cannot occur for a single trial
(i.e., when a coin is flipped only once). The probability of flipping a head and the probability of flipping a tail can be added to yield a
probability of 1: 1/2 + 1/2 =1.[4]
[edit]Statistics
In statistics and regression analysis, an independent variable that can take on only two possible values is called a dummy variable.
For example, it may take on the value 0 if an observation is of a male subject or 1 if the observation is of a female subject. The two
possible categories associated with the two possible values are mutually exclusive, so that no observation falls into more than one
category, and the categories are exhaustive, so that every observation falls into some category. Sometimes there are three or more
possible categories, which are pairwise mutually exclusive and are collectively exhaustive for example, under 18 years of age, 18
to 64 years of age, and age 65 or above. In this case a set of dummy variables is constructed, each dummy variable having two
mutually exclusive and jointly exhaustive categories in this example, one dummy variable (called D1) would equal 1 if age is less
than 18, and would equal 0 otherwise; a second dummy variable (called D2) would equal 1 if age is in the range 18-64, and 0
otherwise. In this set-up, the dummy variable pairs (D1, D2) can have the values (1,0) (under 18), (0,1) (between 18 and 64), or (0,0)
(65 or older) (but not (1,1), which would nonsensically imply that an observed subject is both under 18 and between 18 and 64).
Then the dummy variables can be included as independent (explanatory) variables in a regression. Note that the number of dummy
variables is always one less than the number of categories: with the two categories male and female there is a single dummy
variable to distinguish them, while with the three age categories two dummy variables are needed to distinguish them.
Such qualitative data can also be used for dependent variables. For example, a researcher might want to predict whether someone
goes to college or not, using family income, a gender dummy variable, and so forth as explanatory variables. Here the variable to be
explained is a dummy variable that equals 0 if the observed subject does not go to college and equals 1 if the subject does go to
college. In such a situation, ordinary least squares(the basic regression technique) is widely seen as inadequate; instead probit
regression or logistic regression is used. Further, sometimes there are three or more categories for the dependent variable for
example, no college, community college, and four-year college. In this case, themultinomial probit or multinomial logit technique is
used.
In mathematics, two sets are said to be disjoint if they have no element in common. For example, {1, 2, 3} and {4, 5, 6} are disjoint
sets.[1]
[edit]Explanation
Formally, two sets A and B are disjoint if their intersection is the empty set, i.e. if
This definition extends to any collection of sets. A collection of sets is pairwise disjoint or mutually disjoint if, given any two
sets in the collection, those two sets are disjoint.
Formally, let I be an index set, and for each i in I, let Ai be a set. Then the family of sets {Ai : i I} is pairwise disjoint if for
any i and j in I with i j,
For example, the collection of sets { {1}, {2}, {3}, ... } is pairwise disjoint. If {Ai} is a pairwise disjoint collection (containing
at least two sets), then clearly its intersection is empty:
However, the converse is not true: the intersection of the collection {{1, 2}, {2, 3}, {3, 1}} is empty, but the collection
is not pairwise disjoint. In fact, there are no two disjoint sets in this collection.
A partition of a set X is any collection of non-empty subsets {Ai : i I} of X such that {Ai} are pairwise disjoint and
In probability theory and statistics, a sequence or other collection of random variables is independent and identically
distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.[1]
The abbreviation i.i.d. is particularly common in statistics (often as iid, sometimes written IID), where observations in a sample are
often assumed to be effectively i.i.d. for the purposes of statistical inference. The assumption (or requirement) that observations be
i.i.d. tends to simplify the underlying mathematics of many statistical methods (see mathematical statistics and statistical theory).
However, in practical applications of statistical modelingthe assumption may or may not be realistic. The generalization
of exchangeable random variables is often sufficient and more easily met.
The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum
(or average) of i.i.d. variables with finite variance approaches a normal distribution.
Note that IID refers to sequences of random variables. "Independent and identically distributed" implies an element in the sequence
is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where
the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order
Markov sequence). An IID sequence does not imply the probabilities for all elements of the sample space or event space must be
the same.[2] For example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.
Contents
[hide]
1 Examples
2 Generalizations
3 See also
4 References
[edit]Examples
[edit]Uses in modeling
The following are examples or applications of independent and identically distributed (i.i.d.) random variables [dubious discuss]:
A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands
on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see
the Gambler's fallacy).
A sequence of fair or loaded dice rolls is i.i.d.
A sequence of fair or unfair coin flips is i.i.d.
In signal processing and image processing the notion of transformation to IID implies two specifications, the "ID"
(ID = identically distributed) part and the "I" (I = independent) part:
(I) the signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white signal (one
One of the simplest statistical tests, the z-test, is used to test hypotheses about means of random variables. When using the z-
test, one assumes (requires) that all observations are i.i.d. in order to satisfy the conditions of the central limit theorem.
[edit]Generalizations
Many results that are initially[clarification needed] stated for i.i.d. variables are true more generally.[clarification needed]
[edit]Exchangeable random variables
Main article: Exchangeable random variables
The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced
by Bruno de Finetti. Exchangeability means that while variables may not be independent or identically distributed, future ones
behave like past ones formally, any value of a finite sequence is as likely as any permutation of those values the joint probability
distribution is invariant under the symmetric group.
This provides a useful generalization for example, sampling without replacement is not independent, but is exchangeable and is
widely used inBayesian statistics.
[edit]Lvy process
Main article: Lvy process
In stochastic calculus, i.i.d. variables are thought of as a discrete time Lvy process: each variable gives how much one changes
from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process. One may generalize
this to include continuous time Lvy processes, and many Lvy processes can be seen as limits of i.i.d. variablesfor instance,
the Wiener process is the limit of the Bernoulli process.
In linear algebra, a family of vectors is linearly independent if none of them can be written as a linear combination of finitely many
other vectors in the family. A family of vectors which is not linearly independent is called linearly dependent. For instance, in the
three-dimensional real vector space we have the following example.
Here the first three vectors are linearly independent; but the fourth vector equals 9 times the first plus 5 times the second plus
4 times the third, so the four vectors together are linearly dependent. Linear dependence is a property of the family, not of any
particular vector; for example in this case we could just as well write the first vector as a linear combination of the last three.
In probability theory and statistics there is an unrelated measure of linear dependence between random variables.
Contents
[hide]
1 Definition
2 Geometric meaning
3 Example I
o 3.1 Proof
4 Example II
o 4.1 Proof
5 Example III
o 5.1 Proof
6 Example IV
o 6.1 Proof
9 See also
10 External links
[edit]Definition
A finite subset of n vectors, v1, v2, ..., vn, from the vector space V, is linearly dependent if and only if there exists a set
of n scalars, a1, a2, ..., an, not all zero, such that
Note that the zero on the right is the zero vector, not the number zero.
If such scalars do not exist, then the vectors are said to be linearly independent.
Alternatively, linear independence can be directly defined as follows: a set of vectors is linearly independent if and
only if the only representations of the zero vector as linear combinations of its elements are trivial solutions, i.e.,
whenever a1, a2, ..., an are scalars such that
A geographic example may help to clarify the concept of linear independence. A person describing the
location of a certain place might say, "It is 5 miles north and 6 miles east of here." This is sufficient
information to describe the location, because the geographic coordinate system may be considered as a
2-dimensional vector space (ignoring altitude). The person might add, "The place is 7.81 miles northeast
of here." Although this last statement is true, it is not necessary.
In this example the "5 miles north" vector and the "6 miles east" vector are linearly independent. That is
to say, the north vector cannot be described in terms of the east vector, and vice versa. The third "7.81
miles northeast" vector is a linear combination of the other two vectors, and it makes the set of
vectors linearly dependent, that is, one of the three vectors is unnecessary.
Also note that if altitude is not ignored, it becomes necessary to add a third vector to the linearly
independent set. In general, n linearly independent vectors are required to describe any location in n-
dimensional space.
[edit]Example I
An alternative method uses the fact that n vectors in are linearly dependent if and only
if the determinant of the matrix formed by taking the vectors as its columns is zero.
In this case, the matrix formed by the vectors is
We may write a linear combination of the columns as
Since the determinant is non-zero, the vectors (1, 1) and (3, 2) are linearly
independent.
Otherwise, suppose we have m vectors of n coordinates, with m < n.
Then A is an nm matrix and is a column vector with m entries, and we are
again interested in A = 0. As we saw previously, this is equivalent to a list
of n equations. Consider the first m rows of A, the first m equations; any
solution of the full list of equations must also be true of the reduced list. In
fact, if i1,...,im is any list of m rows, then the equation must be true for those
rows.
for all possible lists of m rows. (In case m = n, this requires only
one determinant, as above. If m > n, then it is a theorem that the
vectors must be linearly dependent.) This fact is valuable for
theory; in practical calculations more efficient methods are
available.
[edit]Example II
a.
a. Then e1, e2, ..., en are linearly independent.
b. [edit]Proof
c. Suppose that a1, a2, ..., an are elements of R such
that
b.
a. Since
c.
i. then ai = 0 for all i in {1, ..., n}.
ii. [edit]Example III
d. aet + be2t = 0
i. for all values of t. We need to show
that a = 0 and b = 0. In order to do this,
we divide through by et (which is never
zero) and subtract to obtain
e. bet = a.
i. In other words, the function bet must be
independent of t, which only occurs
when b = 0. It follows that a is also zero.
ii. [edit]Example IV
f.
i. [edit]Proof
g.
i. Forming the simultaneous equations:
h.
i. we can solve (using, for
example, Gaussian elimination) to
obtain:
i.
i. where can be chosen arbitrarily.
ii. Since these are nontrivial results, the
vectors are linearly dependent.
iii. [edit]Projective space of linear
dependences
j.
i. If such a linear dependence exists, then
the n vectors are linearly dependent. It
makes sense to identify two linear
dependences if one arises as a non-
zero multiple of the other, because in
this case the two describe the same
linear relationship among the vectors.
Under this identification, the set of all
linear dependences among v1, ...., vn is
a projective space.
ii. [edit]Linear dependence between
random variables
Here we introduce a nonnegative weight function in the definition of this inner product. In simple cases, w(x) = 1,
exactly.
We say that these functions are orthogonal if that inner product is zero:
We write the norms with respect to this inner product and the weight function as
where
is the "Kronecker delta" function. In other words, any two of them are orthogonal, and the norm of
each is 1 in the case of the orthonormal sequence. See in particular the orthogonal polynomials.
[edit]Examples
The vectors (1, 3, 2), (3, 1, 0), (1/3, 1, 5/3) are orthogonal to each other, since (1)(3) +
(3)(1) + (2)(0) = 0, (3)(1/3) + (1)(1) + (0)(5/3) = 0, and (1)(1/3) + (3)(1) + (2)(5/3) = 0.
The vectors (1, 0, 1, 0, ...)T and (0, 1, 0, 1, ...)T are orthogonal to each other. The dot product
of these vectors is 0. We can then make the generalization to consider the vectors in Z2n:
for some positive integer a, and for 1 k a 1, these vectors are orthogonal, for example (1, 0, 0, 1, 0, 0, 1, 0)T, (0, 1, 0,
Take two quadratic functions 2t + 3 and 5t2 + t 17/9. These functions are
orthogonal with respect to a unit weight function on the interval from 1 to 1. The
product of these two functions is 10t3 + 17t2 7/9 t 17/3, and now,
Another scheme is orthogonal frequency-division multiplexing (OFDM), which refers to the use, by a single transmitter, of
a set of frequency multiplexed signals with the exact minimum frequency spacing needed to make them orthogonal so
that they do not interfere with each other. Well known examples include (a, g, and n) versions of 802.11 Wi-
Fi; WiMAX; ITU-T G.hn, DVB-T, the terrestrial digital TV broadcast system used in most of the world outside North
America; and DMT (Discrete Multi Tone), the standard form of ADSL.
In OFDM, the subcarrier frequencies are chosen so that the subcarriers are orthogonal to each other, meaning that
crosstalk between the subchannels is eliminated and intercarrier guard bands are not required. This greatly simplifies the
design of both the transmitter and the receiver. Unlike in conventional FDM, a separate filter for each subchannel is not
required.
In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence
or non-occurrence of R and the occurrence or non-occurrence of B are independent events in their conditional probability
distribution given Y. In other words, R and B are conditionally independent given Y if and only if, given knowledge
that Yoccurs, knowledge of whether R occurs provides no information on the likelihood of B occurring, and knowledge of
whether B occurs provides no information on the likehood of Roccurring.
In the standard notation of probability theory, R and B are conditionally independent given Y if and only if
or equivalently,
Two random variables X and Y are conditionally independent given a third random variable Z if and only if they are
independent in their conditional probability distribution givenZ. That is, X and Y are conditionally independent given Z if
and only if, given any value of Z, the probability distribution of X is the same for all values of Y and the probability
distribution of Y is the same for all values of X.
where denotes the conditional expectation of the indicator function of the event , , given the
sigma algebra . That is,
Two random variables X and Y are conditionally independent given a -algebra if the above equation holds for all R in
(X) and B in (Y).
Two random variables X and Y are conditionally independent given a random variable W if they are independent given
(W): the -algebra generated by W. This is commonly written:
or
This is read "X is independent of Y, given W"; the conditioning applies to the whole statement: "(X is independent of Y)
given W".
If W assumes a countable set of values, this is equivalent to the conditional independence of X and Y for the events of the
form [W = w]. Conditional independence of more than two events, or of more than two random variables, is defined
analogously.
The following two examples show that X Y neither implies nor is implied by X Y | W. First, suppose W is 0 with
probability 0.5 and is the value 1 otherwise. When W = 0 take X and Y to be independent, each having the value 0 with
probability 0.99 and the value 1 otherwise. When W = 1, X andY are again independent, but this time they take the value
1 with probability 0.99. Then X Y | W. But X and Y are dependent, because Pr(X = 0) < Pr(X = 0|Y = 0). This is because
Pr(X = 0) = 0.5, but if Y = 0 then it's very likely that W = 0 and thus that X = 0 as well, so Pr(X = 0|Y = 0) > 0.5. For the
second example, suppose X Y, each taking the values 0 and 1 with probability 0.5. Let W be the product XY. Then
when W = 0, Pr(X = 0) = 2/3, but Pr(X = 0|Y = 0) = 1/2, so X Y | W is false. This is also an example of Explaining Away.
See Kevin Murphy's tutorial [2] where Xand Y take the values "brainy" and "sporty".
In probability theory, two random variables being uncorrelated does not imply their independence. In some contexts,
uncorrelatedness implies at least pairwise independence (as when the random variables involved have Bernoulli distributions).
It is sometimes mistakenly thought that one context in which uncorrelatedness implies independence is when the random variables
involved arenormally distributed. However, this is incorrect if the variables are merely marginally normally distributed but not jointly
normally distributed.
Suppose two random variables X and Y are jointly normally distributed. That is the same as saying that the random vector (X, Y)
has a multivariate normal distribution. It means that the joint probability distribution of X and Y is such that for any two constant (i.e.,
non-random) scalars a and b, the random variable aX + bY is normally distributed. In that case if X and Y are uncorrelated, i.e.,
their covariance cov(X, Y) is zero, then they are independent. [1] However, it is possible for two random variables X and Y to be so
distributed jointly that each one alone is marginally normally distributed, and they are uncorrelated, but they are not independent;
examples are given below.
Contents
[hide]
1 Examples
2 References
[edit]Examples
X and Y.
Suppose X has a normal distribution with expected value 0 and variance 1. Let W = 1 or 1, each with probability 1/2, and
assume W is independent of X. Let Y = WX. Then
X and Y.
Suppose X has a normal distribution with expected value 0 and variance 1. Let
where c is a positive number to be specified below. If c is very small, then the correlation corr(X, Y) is near 1; if c is
very large, then corr(X, Y) is near 1. Since the correlation is a continuous function of c, theintermediate value
theorem implies there is some particular value of c that makes the correlation 0. That value is approximately 1.54. In
that case, X and Y are uncorrelated, but they are clearly not independent, since X completely determines Y.
To see that Y is normally distributedindeed, that its distribution is the same as that of Xlet us find itscumulative
distribution function:
(This follows from the symmetry of the distribution of X and the symmetry of the condition that |X| < c.)
Observe that the sum X + Y is nowhere near being normally distributed, since it has a substantial
probability (about 0.88) of it being equal to 0, whereas the normal distribution, being a continuous
distribution, has no discrete part, i.e., does not concentrate more than zero probability at any single point.
Consequently X and Y are not jointly normally distributed, even though they are separately normally
distributed.
In probability theory, a set of events is jointly or collectively exhaustive if at least one of the events must occur. For example,
when rolling a six-sided die, the outcomes 1, 2, 3, 4, 5, and 6 are collectively exhaustive, because they encompass the entire range
of possible outcomes.
Another way to describe collectively exhaustive events, is that their union must cover all the events within the entire sample space.
For example, events A and B are said to be collectively exhaustive if
Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result
There are point and interval estimators. The point estimators yield single-valued results, although this includes the possibility of
single vector-valued results and results that can be expressed as a single function. This is in contrast to an interval estimator, where
Statistical theory is concerned with the properties of estimators; that is, with defining properties that can be used to compare
different estimators (different rules for creating estimates) for the same quantity, based on the same data. Such properties can be
used to determine the best rules to use under given circumstances. However, in robust statistics, statistical theory goes on to
consider the balance between having good properties, if tightly defined assumptions hold, and having less good properties that hold
[hide]
1 Background
2 Definition
3 Quantified properties
4 Behavioural properties
5 See also
6 References
7 External links
[edit]Background
An "estimator" or "point estimate" is a statistic (that is, a function of the data) that is used to infer the value of an
unknown parameter in a statistical model. The parameter being estimated is sometimes called the estimand.[citation needed] It can be
either finite-dimensional (in parametric and semi-parametric models), or infinite-dimensional (semi-nonparametric and non-
parametric models).[citation needed] If the parameter is denoted then the estimator is typically written by adding a "hat" over the
symbol: . Being a function of the data, the estimator is itself a random variable; a particular realization of this random variable is
called the "estimate". Sometimes the words "estimator" and "estimate" are used interchangeably.
The definition places virtually no restrictions on which functions of the data can be called the "estimators". The attractiveness of
different estimators can be judged by looking at their properties, such as unbiasedness, mean square error, consistency,asymptotic
distribution, etc.. The construction and comparison of estimators are the subjects of the estimation theory. In the context of decision
theory, an estimator is a type of decision rule, and its performance may be evaluated through the use of loss functions.
When the word "estimator" is used without a qualifier, it usually refers to point estimation. The estimate in this case is a single point
in the parameter space. Other types of estimators also exist: interval estimators, where the estimates are subsets of the parameter
space.
The problem of density estimation arises in two applications. Firstly, in estimating the probability density functions of random
variables and secondly in estimating the spectral density function of a time series. In these problems the estimates are functions that
can be thought of as point estimates in an infinite dimensional space, and there are corresponding interval estimation problems.
[edit]Definition
Suppose there is a fixed parameter that needs to be estimated. Then an "estimator" is a function that maps the sample space to
a set of sample estimates. An estimator of is usually denoted by the symbol . It is often convenient to express the theory using
the algebra of random variables: thus if X is used to denote a random variable corresponding to the observed data, the estimator
(itself treated as a random variable) is symbolised as a function of that random variable, . The estimate for a particular
observed dataset (i.e. for X=x) is then , which is a fixed value. Often an abbreviated notation is used in which is
[edit]Quantified properties
Error
where is the parameter being estimated. Note that the error, e, depends not only on the estimator (the estimation formula or
The mean squared error of is defined as the expected value (probability-weighted average, over all samples) of the
It is used to indicate how far, on average, the collection of estimates are from the single parameter being estimated.
Consider the following analogy. Suppose the parameter is the bull's-eye of a target, the estimator is the process of
shooting arrows at the target, and the individual arrows are estimates (samples). Then high MSE means the average
distance of the arrows from the bull's-eye is high, and low MSE means the average distance from the bull's-eye is low.
The arrows may or may not be clustered. For example, even if all arrows hit the same point, yet grossly miss the target,
the MSE is still relatively large. Note, however, that if the MSE is relatively low, then the arrows are likely more highly
Sampling deviation
where is the expected value of the estimator. Note that the sampling deviation, d, depends not only
Variance
The variance of is simply the expected value of the squared sampling deviations; that
are from the expected value of the estimates. Note the difference between MSE and variance. If the parameter is
the bull's-eye of a target, and the arrows are estimates, then a relatively high variance means the arrows are
dispersed, and a relatively low variance means the arrows are clustered. Some things to note: even if the variance
is low, the cluster of arrows may still be far off-target, and even if the variance is high, the diffuse collection of
arrows may still be unbiased. Finally, note that even if all arrows grossly miss the target, if they nevertheless all hit
Bias
The bias of is defined as . It is the distance between the average of the collection
of estimates, and the single parameter being estimated. It also is the expected value of the error,
since . If the parameter is the bull's-eye of a target, and the arrows are
estimates, then a relatively high absolute value for the bias means the average position of the arrows is off-target,
and a relatively low absolute bias means the average position of the arrows is on target. They may be dispersed, or
may be clustered. The relationship between bias and variance is analogous to the relationship between accuracy
and precision.
Unbiased
The estimator is an unbiased estimator of if and only if . Note that bias is a property of the
estimator, not of the estimate. Often, people refer to a "biased estimate" or an "unbiased estimate," but they really
are talking about an "estimate from a biased estimator," or an "estimate from an unbiased estimator." Also, people
often confuse the "error" of a single estimate with the "bias" of an estimator. Just because the error for one estimate
is large, does not mean the estimator is biased. In fact, even if all estimates have astronomical absolute values for
their errors, if the expected value of the error is zero, the estimator is unbiased. Also, just because an estimator is
biased, does not preclude the error of an estimate from being zero (we may have gotten lucky). The ideal situation,
of course, is to have an unbiased estimator with low variance, and also try to limit the number of samples where the
error is extreme (that is, have few outliers). Yet unbiasedness is not essential. Often, if just a little bias is permitted,
then an estimator can be found with lower MSE and/or fewer outlier sample estimates.
An alternative to the version of "unbiased" above, is "median-unbiased", where the median of the distribution of
estimates agrees with the true value; thus, in the long run half the estimates will be too low and half too high. While
this applies immediately only to scalar-valued estimators, it can be extended to any measure of central tendency of
Relationships
squared error = variance + square of bias. In particular, for an unbiased estimator, the variance equals the
MSE.
The standard deviation of an estimator of (the square root of the variance), or an estimate of the standard
[edit]Behavioural properties
Consistency
A consistent sequence of estimators is a sequence of estimators that converge in probability to the quantity being
estimated as the index (usually the sample size) grows without bound. In other words, increasing the sample size
increases the probability of the estimator being close to the population parameter.
Mathematically, a sequence of estimators {tn; n 0} is a consistent estimator for parameter if and only if, for all >
The consistency defined above may be called weak consistency. The sequence is strongly consistent, if
An estimator that converges to a multiple of a parameter can be made into a consistent estimator by
multiplying the estimator by a scale factor, namely the true value divided by the asymptotic value of the
estimator. This occurs frequently in estimation of scale parameters by measures of statistical dispersion.
Asymptotic normality
An asymptotically normal estimator is a consistent estimator whose distribution around the true
The central limit theorem implies asymptotic normality of the sample mean as an estimator of the true
mean. More generally, maximum likelihood estimators are asymptotically normal under fairly weak
regularity conditions see the asymptotics section of the maximum likelihood article. However, not all
estimators are asymptotically normal, the simplest examples being case where the true value of a
Efficiency
squared error (MSE). These cannot in general both be satisfied simultaneously: a biased estimator may
have lower mean squared error (MSE) than any unbiased estimator: despite having bias, the estimator
variance may be sufficiently smaller than that of any unbiased estimator, and it may be preferable to use,
Among unbiased estimators, there often exists one with the lowest variance, called the minimum
variance unbiased estimator (MVUE). In some cases an unbiased efficient estimator exists, which, in
addition to having the lowest variance among unbiased estimators, satisfies the CramrRao bound,
Concerning such "best unbiased estimators", see also CramrRao bound, GaussMarkov