You are on page 1of 5

Properties of the Normal distribution

There are four key properties associated with the normal distribution:
1. It is bell shaped (and thus symmetrical) in appearance.
2. It is measure of central tendency (mean, median and mode) are all identical
3. Its middle spread is equal to 1.33 standard deviations. This means the interquartile range is
contained within an interval of two-thirds of a standard deviation below the mean to two-
thirds of a standard deviation above the mean.
4. Its associated random variable has an infinite range (-∞ < X < ∞)

The Exponential distribution


The exponential distribution is defined by a single parameter, its mean λ, the average number
of arrivals per unit of time. The value 1/λ is equal to the average time between arrivals. For
example, if the average numbers of arrival in a minute is λ=4, then the average time between
arrivals is 1/ λ = 0.25 minutes or 15 seconds. The probability that the length of time before
the next interval is less than X, is defined in Equation below.
Exponential distribution, P (arrival time < X) = 1−e−λX
Where e = the mathematical constant approximated by 2.71828
λ = the population mean number of arrivals per unit
X = any value of the continuous variable where 0 < X < ∞

Confidence interval estimation of the mean (σ known)


The central limit theorem and/or knowledge of the population distribution is used to
determine the percentage of sample means that fall within certain distance of the population
mean. For instance, in a hypothetical cereal-filling example, 95% of all sample means are
between 362.12 and 373.88 grams. This statement is based on deductive reasoning. However,
it is exactly opposite the inductive reasoning that is needed here.
Inductive reasoning is needed because, in statistical inference, the results of a single sample
are used to draw conclusions about the population, not vice versa. In practice, the population
mean is the unknown quantity that is to be estimated. Suppose that, in the cereal-filling
example, the population mean μ is unknown but the true population standard deviation σis
known to be 15 grams. Thus, rather than take μ± (1.96) (σ/ √ N ) to find the upper and lower
limits around μ, the sample mean x̅ is substituted to the unknown μ and x̅± (1.96) (σ/√ N ) is
used as an interval to estimate μ. Although in a practice a single sample of size n is selected
and the mean x̅ is computed, in order to understand the full meaning of the interval estimate,
a hypothetical set of all possible samples, each of size n should be examined.
Student’s t Distribution
At the beginning of 20th century, a statistician named William S. Gosset, an employee of
Guinness Breweries in Ireland (see reference 3), was interested in making inferences about
the mean when σ was unknown. Because Guinness employees were not permitted to publish
research work under their own names, Gosset adopted the pseudonym “Student.” The
distribution that he developed has come to be known as Student’s t distribution.
IF the random variable X is normally distributed, then the following statistic has a t
x̅ – μ
distribution with n-1 degrees of freedom: t = S
( )
√N

Determining Sample Size


In each example concerning confidence interval estimation, the sample size was selected
without regard to the width of the resulting confidence interval. In the business world,
determining the proper sample size is a complicated procedure, subject to the constraints of
budget, time, and ease of selection. If, in the Saxon Home improvement company example,
you want to estimate the mean dollar amount of the sales invoices or the proportion of sales
invoices that contain errors, you must determine in advance how good the estimate must be.
Thus, you need to decide how large a sampling error to allow in estimating each of the
parameters. You must also determine in advance how confident you want to be of correctly
estimate the true population parameter
Sampling erro r 2 e is defined as e = Z σ/√ N from equation x́ ± Z σ/√ N

Z2 σ2
Where n, sample size is n = 2
e

Hypothesis-Testing methodology
The Null and alternative hypotheses
Hypothesis testing typically begins with some theory, claim, or assertion about a particular
parameter of a population. For example, for purposes of statistical analysis, your initial
hypothesis about the cereal example is that the process is working properly, meaning that the
mean fill is 368 grams, and no corrective action is needed.
The hypothesis that the population parameter is equal to the company specification is referred
to as null hypothesis. A null hypothesis is always one of the status quo, and is identified by
the symbol H0
Here the null hypothesis is that the filling process is working properly, that the mean fill per
box is the 368-gram specification. This can be stated as
H0 : μ = 368
Note that even though information is available only from the sample, the null hypothesis is
written in terms of the population parameter. This is because the parameter of interest is
focussed on the entire filling process (the population) of all the cereal boxes being filled. The
sample statistic will be used to make inference about the entire filling process. One inference
may be that the results observed from the sample data indicate that the null hypothesis is
false. IF the null hypothesis is considered false, something else must be true.
Whenever a null hypothesis is specified, an alternative hypothesis must also be specified, one
that must be true if the null hypothesis is found false. The alternative hypothesis H 0 is the
opposite of null hypothesis H0. This is stated in the cereal example as
H1 : μ ≠ 368

Risks in Decision making using Hypothesis-Testing methodology


When using a sample statistic to make decisions about a population parameter, there is a risk
that an incorrect conclusion will be reached. Two different types of errors can occur when
applying hypothesis-testing methodology. Type I errors and Type II errors. In the cereal-
filling-process example, a Type 1 error occurs if the conclusion is reached (based on sample
information) that the average population amount filled is not 368 when in fact it is 368.

On the other hand, the Type II error occurs if the conclusion (based on sample information) is
that the average population fill amount is 368 when in fact it is not 368.

Type I Error 

A Type I error occurs if the null hypothesis H, is rejected when in fact it is true and should
not be rejected, the probability of a Type I error occurring is 

Type II Error

A Type Il error occurs if the null hypothesis H, is not rejected when in fact it is false and
should be rejected. The probability of a Type II error occurring is β.

The p-Value Approach to Hypothesis Testing

In recent years, with the advent of widely available statistical and spreadsheet software, the
concept of the p-value as an approach to hypothesis testing has increasingly gained
acceptance. The p-value is often referred to as the observed level of significance, which is the
smallest level at which H0 can be rejected for a given set of data. The decision rules for
rejecting H0, in the p-value approach follows.

If the p-value is greater than or equal to, the null hypothesis is not rejected

If the p-value is less than , the null hypothesis is rejected.

To understand the p-value approach, consider the cereal-filling-process example. You tested
whether or not the mean fill amount was equal to 368 grams. A Z value of +1.50 was
obtained and the null hypothesis was not rejected because +1.50 was less than the upper
critical value of +1.96 and more than the lower critical value of -1.96.
SIMPLE LINEAR REGRESSION MODEL

Yi = β0 + β1Xi + €i

Where 

β0- Intercept for the population 


β1- slope for the population 
€i - random error in Y for observation i

In this model, the slope of the line, β 1 represents the expected change in Y per unit change in
X. It represents the average amount that Y changes (either positively or negatively) for a
particular unit change in X. The intercept β0 represents the average value of Y when X Equals
0. The last component of the model, represents the random error in Y for each observation i
that occurs. In other word, €i is the vertical distance Yi is above or below the line of
regression.

Predictions in Regression Analysis: Interpolation versus Extrapolation

When using a regression model for prediction purposes, it is important that you consider only
the relevant range of the independent variable in making predictions. This relevant range
includes all values from the smallest to the largest X used in developing the regression model.
Hence, when predicting Y for a given value of X, you can interpolate within this relevant
range of the X values, but you should not extrapolate beyond the range of X values. When
you use the square footage to predict al sales note from Table 1 that the square footage (in
thousands of square feet) varies from 1.1 to 5.8. Therefore, predictions of annual sales should
be made only for stores whose size is between 1.1 and 5.8 thousands of square feet. Any
prediction of annual sales for stores whose size is outside this range presumes that the fitted
relationship holds outside the range of 1.1 to 5.8 thousands of Square feet. For example, you
cannot extrapolate the linear relationship beyond 5,800 square feet in Example 1. It would be
improper to use the regression equation to forecast the sales for a new store containing 8,000
square feet. It is quite possible that store size has a point of diminishing returns. If that were
true as square footage increases beyond 5,800 square feet, the effect on sales might become
smaller and smaller.

RESIDUAL ANALYSIS

In the preceding discussion of the site selection data, a simple linear regression model has
been used.

In this section, a graphical approach called residual analysis is developed to evaluate whether
the regression model that has been fitted to the data is an appropriate model. In addition,
residual analysis also enables potential violations of the assumptions of the regression model
to be evaluated.
Evaluating the Aptness of the Fitted Model

The residual or estimated error value ei, is defined as the difference between the observed(Y i)
and predicted (Ŷi) values of the dependent variable for a given value of Xi, Graphically a
residual depicted on a scatter diagram as the vertical distance between an observed value of Y
and the line defined by the simple regression equation. Numerically the residual is defined in
Equation (9).

The residual is equal to the difference between the observed value of Y and the predicted
value of Y.

ei = Yi - Ŷi --- (9)

MEASURING AUTOCORRELATION:

THE DURBIN-WATSON STATISTIC

One of the basic assumptions of the regression model is the independence of the errors. This
assumption is often violated when data are collected over sequential periods of time because
a residual at any one point in time may tend to be similar to residuals at adjacent points in
time. Such a pattern in the residuals is called autocorrelation. When substantial
autocorrelation is present in a set of data, the validity of a regression model can be in serious
doubt.

You might also like