You are on page 1of 7

Problem in Regression Analysis:

Regression analysis may face some serious problem. These are problem which may accrue in regression
analysis.

Multicollinearity:

It is a phenomenon in which one predictor variable in a multiple regression model can be linearly
predicted from the others with a substantial degree of accuracy. Multicollinearity exists whenever
an independent variable is highly correlated with one or more of the other independent variables
in a multiple regression equation. Multicollinearity is a problem because it undermines the
statistical significance of an independent variable. In general, multicollinearity can lead to wider
confidence intervals and less reliable probability values for the independent variables. That is, the
statistical inferences from a model with multicollinearity may not be dependable.

Examples:

Examples of correlated predictor variables (also called multicollinear predictors) are: a person’s
height and weight, age and sales price of a car, or years of education and annual income.

one might expect the air temperature on the 1st day of the month to be more similar to
the temperature on the 2nd day compared to the 31st day. If the temperature values that
occurred closer together in time are, in fact, more similar than the temperature values that
occurred farther apart in time, the data would be autocorrelated .

HETEROSCEDASTICITY:

Heteroscedasticity means unequal scatter. In regression analysis, we talk about heteroscedasticity


in the context of the residuals or error term. Specifically, heteroscedasticity is a systematic change
in the spread of the residuals over the range of measured values. Heteroscedasticity is a problem
because ordinary least squares (OLS) regression assumes that all residuals are drawn from a
population that has a constant variance (homoscedasticity). To satisfy the regression assumptions
and be able to trust the results, the residuals should have a constant variance.

Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term
differs across values of an independent variable. The impact of violating the assumption of
homoscedasticity is a matter of degree, increasing as heteroscedasticity increases.

Examples:
In regression, an error is how far a point deviates from the regression line. Ideally, your data should
be homoscedastic (i.e. the variance of the errors should be constant). Outside of classroom
examples, this situation rarely happens in real life. Most data are heteroscedastic by nature. Take,
for example, predicting women’s weight from their height. In a Stepford Wives world, where
everyone is a perfect dress size 6, this would be easy: short women weigh less than tall women.
But in the real world, it’s practically impossible to predict weight from height. Younger women
(in their teens) tend to weigh less, while post-menopausal women often gain weight. But women
of all shapes and sizes exist over all ages. This creates a cone shaped graph for variability.

HOMOSCEDASTICITY:

Homoscedasticity. This assumption means that the variance around the regression line is the same
for all values of the predictor variable (X). The plot shows a violation of this assumption. For the
lower values on the X-axis, the points are all very near the regression line.

Autocorrelation:
Autocorrelation is a mathematical representation of the degree of similarity between a given time
series and a lagged version of itself over successive time intervals. It is the same as calculating the
correlation between two different time series, except autocorrelation uses the same time series
twice: once in its original form and once lagged one or more time period.
Example:
For example, one might expect the air temperature on the 1st day of the month to be more similar
to the temperature on the 2nd day compared to the 31st day. If the temperature values that occurred
closer together in time are, in fact, more similar than the temperature values that occurred farther
apart in time, the data would be autocorrelated.

MODEL SPECIFICATION:

Model specification refers to the determination of which independent variables should be included
in or excluded from a regression equation. In general, the specification of a regression model
should be based primarily on theoretical considerations rather than empirical or methodological
ones. A multiple regression model is, in fact, a theoretical statement about the causal relationship
between one or more independent variables and a dependent variable. Indeed, it can be observed
that regression analysis involves three distinct stages: the specification of a model, the estimation
of the parameters of this model, and the interpretation of these parameters. Specification is the first
and most critical of these stages. Our estimates of the parameters of a model and our interpretation
of them depend on the correct specification of the model. Consequently, problems can arise
whenever we mis specify a model. There are two basic types of specification errors. In the first,
we mis specify a model by including in the regression equation an independent variable that is
theoretically irrelevant. In the second, we mis specify the model by excluding from the regression
equation an independent variable that is theoretically relevant.

Mean:
The statistical mean refers to the mean or average that is used to derive the central tendency of the
data in question. It is determined by adding all the data points in a population and then dividing
the total by the number of points. The resulting number is known as the mean or the average.

Example:
For example, take this list of numbers: 5,15,25,35,70. The mean is found by adding all of the
numbers together and dividing by the number of items in the set: 05 + 15 + 25 + 35+ 70 / 5 = 30.
Median:
The median is a simple measure of central tendency. To find the median, we arrange the
observations in order from smallest to largest value. If there is an odd number of observations,
the median is the middle value. If there is an even number of observations, the median is the
average.

Example:
For example, the mean of the following set of numbers is 28 (27 + 29 / 2).
23, 24, 26, 26, 27, 29, 30, 31, 33, 34

Mode:
The mode of a set of data values is the value that appears most often. If X is a discrete random
variable, the mode is the value x (i.e, X = x) at which the probability mass function takes its
maximum value. In other words, it is the value that is most likely to be sampled.

Example:
For example, the mode in the following set of numbers is 22:
22, 22, 22, 23, 24, 26, 26, 28, 22, 30, 31, 33

Standard Deviation:
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set
of values. A low standard deviation indicates that the values tend to be close to the mean of the
set, while a high standard deviation indicates that the values are spread out over a wider range.

Example:
Following is the formula for measuring in standard deviation:
Bias & Unbias:

In statistics, the bias (or bias function) of an estimator is the difference between this estimator's
expected value and the true value of the parameter being estimated. An estimator or decision
rule with zero bias is called unbiased.

Variance:
Unlike range and quartiles, the variance combines all the values in a data set to produce a measure
of spread. The variance (symbolized by S2) and standard deviation (the square root of the variance,
symbolized by S) are the most commonly used measures of spread.

Example:

We know that variance is a measure of how spread out a data set is. It is calculated as the average
squared deviation of each number from the mean of a data set. For example, for the numbers 1, 2,
and 3 the mean is 2 and the variance is 0.667.

[(1 - 2)2 + (2 - 2)2 + (3 - 2)2] ÷ 3 = 0.667

[squaring deviation from the mean] ÷ number of observations = variance


TIME SERIES DATA:

A time series is a sequence of numerical data points in successive order. In investing, a time series
tracks the movement of the chosen data points, such as a security’s price, over a specified period
of time with data points recorded at regular intervals. There is no minimum or maximum amount
of time that must be included, allowing the data to be gathered in a way that provides the
information being sought by the investor or analyst examining the activity.

Time series Analysis helps us understand what are the underlying forces leading to a particular
trend in the time series data points and helps us in forecasting and monitoring the data points by
fitting appropriate models to it.

EXAMPLE:
Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value
of the Dow Jones Industrial Average.

Cross Sectional Data:


Cross-sectional data is actually the data collected at the same point of time for several individuals.
These could be on a single day, for example, the percentage changes in a company’s stocks on
January 29, 2008. Cross-sectional data can also be for a single week, month, or year.

Examples:
Examples of cross-sectional data could be opinion polls, income distribution, GDP per capita etc.

PANEL DATA:

Panel data, also known as longitudinal data or cross-sectional time series data in some special
cases, is data that is derived from a (usually small) number of observations over time on a (usually
large) number of cross-sectional units like individuals, households, firms, or governments.
In the disciplines of econometrics and statistics, panel data refers to multi-dimensional data that
generally involves measurements over some period of time. As such, panel data consists of
researcher's observations of numerous phenomena that were collected over several time periods
for the same group of units or entities.

EXAMPLE:

Examples include estimating the effect of education on income, with data across time and
individuals; and estimating the effects of income on savings, with data across years and countries.

Pooled Data:
Pooled data is actually the mixture of time series data and cross-sectional data.
Pooling can refer to combining data, but it can also refer to combining information
rather than the raw data. One of the most common uses of pooling is in estimating
a variance. If we believe that 2 populations have the same variance, but not
necessarily the same mean, then we can calculate the 2 estimates of the variance
from samples of the 2 groups, then pool them (take a weighted average) to get a
single estimate of the common variance. We do not compute a single estimate of
the variance from the combined data because if the means are not equal then that
will inflate the variance estimate.

Examples:
Examples of pooled data can be the GDP per capita of all Asian countries over ten
years.

You might also like