34 views

Uploaded by dibsc

- Pizza Corner
- StataCheatSheet Analysis
- Test Bank Statistics
- Regression
- Multi Regression
- Bpa Ke Br (Samsung)
- 7125000235
- Williams Et Al. - 2013 - Assumptions of Multiple Regression Correcting Two
- Simple Linear Regression
- 1 the Table Below Shows the Results of Multiple Regression Analysis to Identify the Influence of Independent Variables
- Regression Analysis
- autokorelasi
- Ayam 2
- Marketing Mix
- Multivariate Correlation (MRP Bisuit Company)
- 10.1.1.555.4743
- Demand Estimation
- Chapter 21 business statistic
- Exercise 15.12
- Model piata ATB~BET 2007.xls

You are on page 1of 15

Suppose you have data of population (in hundred thousand) of a medium size city over 20 years

(based on every 5-year census) as shown table 1. You want to predict the population in the year

2005. How do you predict that population? There are many ways to predict the future based on

the previous data. We are going to learn one of the famous methods called m m .

Let us plot the data above into a graph. One point in the graph represents data of one year. Since

we have 5 data, thus we have five points.

From the figure above, we can see that the five points seems to have a linear relationship (could

be model by a straight line)! Can you guess a line that will pass around those points? That line is

not necessarily passed through any point of the data, but it should be the best model that we can

use to represent the five points.

We have several proposals here. I will plot three seems best proposal:

2 ped line (solid line)

2 £reen line (small dot line)

If we could make that ³best line model´, we may use it to predict the next point of population in

the year 2005. The prediction point is just an intersection of vertical line that pass through year

2005 and the ³best line model´. Which one of the three proposals is the ³best line model´?

To determine the ³best´ line model, we need a criterion to say that it is the best. We can generate

many criteria. Can you think of any criteria that we can use to say that that line is the best one?

2 The best line model should pass through the most point of the data

2 The best line model should pass through the first point of the data (the blue line or slash

dot line do that)

2 The best line model should be in the middle of all points of the data (perhaps, the green

line or small dot line do)

All of those criteria above are good for each of you, but how do we know that the criteria above

is the best? We need an objective criterion, rather than subjective criterion by saying ³It is the

best because I said so!´ Objective criterion needs measurement. Perhaps we can measure the

error (mistake) between the model (as prediction of the data) and the real data points. The

diagram below is showing how we measure the error. When the point data is above the line

model, we say that the error is positive, while if the line model is above the data, we say the error

is negative.

We need an agreement here that our best line model is really the optimum line. Some of you may

say that what we need to do is to sum all the error and the best line model is the line model that

minimizes the sum of error. That seems a great idea, but it does not work because some error is

positive and some of the error is negative. The sum of error may be zero. If we sum all the error,

we may get many lines. How do we solve this problem?

The most common agreement among scientist and engineers, mathematician and statistician is a

criterion that the j

. When we square the

error, regardless it is positive or negative, the number become positive.

Now we come back to our data, we measure the sum of square error among the three proposal

lines and the results are shown in the table below. We may obtain that the red line give the

minimum sum of square error (=0.09) among the three proposals.

year Population data (hundred lue line or slash dot ped line or solid £reen line or small

thousand) line line dot line

1985 2.9 2.90 (0.00) 2.76 (0.02) 2.60 (0.09)

1990 3.2 3.70 (0.25) 3.44 (0.06) 3.54 (0.12)

1995 4.1 4.50 (0.16) 4.12 (0.00) 4.48 (0.14)

2000 4.9 5.30 (0.16) 4.80 (0.01) 5.42 (0.27)

SSE= Sum Square Error

( O ) ( O O)

O

In above procedure we have proposed several best line models by guessing it and then we find

the best line model based on the criteria of minimizing sum square error. The problem is in the

V Vof the line. Can we obtain the best line model without guess it?

Fortunately, the answer is yes. The best line model can be computed using formula of linear

regression. We denote as the mean of . Notation is the estimated of , that is

represented by the best line model. The mm m m
is

Now we can predict the number of population in the city for year 2005. y inputting 2005 to the

year we get population = 5.48 (x hundred thousand persons).

The purpose of modeling is to find the best model that can represent your data. Suppose you

have a regression formula as the best line model. How can we be sure

that the best line is linear? In other words, how fit is the data to our model? There are unlimited

numbers of model combination aside from linear model. Our data may be represented by

curvilinear or non-linear model.

The first step is to see visually by plotting the data. Use independent variable as x-axis and

dependent variable as y-axis. This plot will give you idea on what type of model you may use as

the best-fit model for your data. Modeling is quite an art that we need to µguess' what is the best

model. If the plot shows that the data is not linear, you must try to use other type of model or

other combination of variables. Do not force yourself to use linear model when your data is non-

linear!

Several indices can be used to examine the goodness of fit of the model. These indices must be

used with care and understanding on the meaning. Most common indices are

2. Adjusted p-squared

3. Standard Error

4. F statistics

5. t statistics

To say that your model is fit, you need to prove that those indices should exceed the criteria.

elow is the brief discussion of these indices together with the criteria.

determination. It is the proportion of variation explained by the best line model. It depends on the

ratio of sum of square error from the regression model (SSE) and the sum of squares difference

around the mean (SST = sum of square total)

where and .

However, the SST and SSE are not measure of the variance. To use the proportion of variances,

we need to average the sum of square. As the result we have

is the number of sample and is the number of coefficients in the model. Obviously, the

j

j

V

V If your p-squared is less than 0.5, it is recommended that you

consider other type of model rather than linear model.

Standard Error is another index that often be used for goodness of fit of the model

Another index for goodness of fit of the model is F-statistic,

freedom

pegression

pesidual (Error)

Total

If the p-squared approach one, the value of standard error will approach zero and the value of F

statistic goes to infinity. The F statistic is compared with the F value from the F distribution with

degree of freedom ( , ).

You may allow some degree of error for your model to be quite small. This error degree is called

significant level, denoted by . For many practical purposes, we use = 5%. If the significant

level is less than 0.05, the model is said to be best fit. Since the three indices are related to

each other, for practical purposes, we often use only p-squared as the index to represent best fit

of the model.

While the other four indices above represent the fitness of the model, t statistics explain

the fitness of model parameter. If the t-statistics of a parameter is less than t

distribution with degree of freedom n-2 at significant level , that parameter cannot explain

the model well. For practical purposes, when your data is more than n >30 samples, we can use

the value of Normal distribution to approximate the t distribution. For significant level = 0.05,

you may use threshold of 1.96. Thus,

j

.

In the next sections, you may see how to obtain our best line model using linear regression

formula by hand calculation or spreadsheet. You may apply that formula without worrying about

how to compute using the linear regression formula, check how you could do it with just a few

clicks and little typing using Microsoft Excel.

Suppose we have the following 5 data points and we want to predict the population data for the year

2005 using linear regression model. In this section, we will use hand calculation or spreadsheet using

linear regression formula.

To use that formula, we make a table consists of four columns. The first two columns are the

data Year as and Population as . The third column is multiplication of for each row and

the last column is square of . You can use MS Excel spreadsheet to ease your hand

computation.

x y xy x^2

Year Population square error sq. mean difference

1980 2.1 4158 3920400 0.0004 1.80

1985 2.9 5756.5 3940225 0.0196 0.29

1990 3.2 6368 3960100 0.0576 0.06

1995 4.1 8179.5 3980025 0.0004 0.44

2000 4.9 9800 4000000 0.01 2.13

sum O OOO O O

average

count n = 5

We sum all the five rows and for the first two columns, we calculate the average of year as

=1990 and the average of population as = 3.44. Since we have 5 data, then

Using this regression line, we can predict the number of population in the city for year 2005. y

inputting 2005 to the year we get population = 5.48 (x hundred thousand persons).

Microsoft Excel has built in function SLOPE and INTEpCEPT to calculate linear regression

slope and intercept.

Suppose we have the following 5 data points and we want to predict the population data for the

year 2005 using linear regression model.

We put the independent variable Year as and dependent variable Population as . The

Systematic procedure is as follow

1. We put the data in two columns table in the spreadsheet as shown in the figure below. For

each column, we put name of the array using MS excel menu Insert-Name-Define. For example,

the name is Year and Population

2. To compute the slope of linear regression line using MS excel function, type

=Slope(Population, Year) as shown in the figure below

3. To determine the intercept of the linear regression line, type = Intercept (Population, Year) as

shown in the figure below

4. To predict the population (Y value) for the year 2005, we can name the slope, intercept and

next year cells (using insert name ±define as in step 1) and then compute the prediction =

next_year * slope + intercept.

We get the same results as the manual computation that the slope is 0.136, intercept is -267.2 and

the prediction of population for the year 2005 is 5.48 (times hundred thousand people).

Microsoft Excel has excellent Data Analysis menu and one of them contain regression.

To use the data analysis menu (suppose your MS excel do not have it), you need to install the

Add-ins: £o to menu ï ± check the ï and then press OK

button. Next time when you open the Tools menu again, you will see Data Analysis in the

bottom of Tools menu.

1. Click menu ï and you will see Data Analysis dialog. Scroll down and

you will see pegression. Select the pegression and click OK button.

2. pegression dialog will pop up and you can fill the Input Y range and input X range. If you

have defined the name of your data (e.g. Population and Year) using menu Insert ± Name ±

Define, you can just type the name of your data instead of searching it. When you click the OK

button in the regression dialog, MS excel will automate the procedure to compute regression for

you.

Suppose we have the following 5 data points and we want to predict the population data for the

year 2005 using linear regression model.

Here is the result of the Data Analysis of pegression. There are many numbers on it and I will

tell you only what is important:

First, you need to examine the Coefficients. These are the intercept and the slope of the

regression analysis. The slope is 0.136, intercept is -267.2 for the 5 observations data above.

Thus, your regression model is

Next, your regression model needs to pass all the three criteria below

2. The significant F (from ANOVA) must be smaller than 0.05

3. The absolute value of t-statistics must be larger than 1.96 for =0.05 and must larger

than 1.645 for =0.10

If you cannot pass any of the three criteria above, you should plot your data and check whether

your data is really a linear model (could be model by a straight line). If your data is not linear,

you may transform it to linear model or you should you non-linear regression instead. The

explanation on why these criteria are important is quite long and I would rather ask you to read

statistical book for the explanation.

For our example, it seems all the criteria above could be passed by our model. Using this

regression model, we can predict the number of population in the city for year 2005. y inputting

2005 to the year we get population = 5.48 (x hundred thousand persons).

After you plot your data in say XY type graph, you can also get the regression line directly from

the graph. Here is the procedure

1. I assume you already have the data and plot it in XY type. Suppose the data is only five points

of populations in 20 years gathered every 5 consecutive years as shown in the figure below.

2. Click on any data point. Then do pight click and pop up menu will appear. Select

ïm

4. £o to Options Tab by clicking on it. Check ³ m´ and ³

m ! m´ then click "#button.

5. The results of regression line as well as the regression equation model and the p-squared value

will appear on the chart.

This trend line is quite dynamic. If you change your data, the trend line (and the regression

equation) will also change automatically.

- Pizza CornerUploaded byNaresh Kumar
- StataCheatSheet AnalysisUploaded bySagardeep Roy
- Test Bank StatisticsUploaded byGagandeep Singh
- RegressionUploaded byoove
- Multi RegressionUploaded byfansuri80
- Bpa Ke Br (Samsung)Uploaded byGhea Religia
- 7125000235Uploaded byMurad89
- Williams Et Al. - 2013 - Assumptions of Multiple Regression Correcting TwoUploaded byVíctor Jurado
- Simple Linear RegressionUploaded byPooja Garg
- 1 the Table Below Shows the Results of Multiple Regression Analysis to Identify the Influence of Independent VariablesUploaded bySubala Rally
- Regression AnalysisUploaded byGiyanti Linda Purnama
- autokorelasiUploaded byghania
- Ayam 2Uploaded bynisasyar
- Marketing MixUploaded bysandeepkaur25
- Multivariate Correlation (MRP Bisuit Company)Uploaded byShreyansh Choudhary
- 10.1.1.555.4743Uploaded byWai Yeen
- Demand EstimationUploaded bybhundofcbm
- Chapter 21 business statisticUploaded byPei Xin
- Exercise 15.12Uploaded byLeonard Gonzalo Saavedra Astopilco
- Model piata ATB~BET 2007.xlsUploaded byAriadna Sandu
- plots deskriptif uji korelasi.docUploaded byRahmatAbdullah
- Output WORDUploaded byLianRingo
- Predicting the Popularity of Online ContentUploaded byqutobol
- Assignment - StataUploaded byShahbaz Ahmad
- Fundamental Concepts of Model Building Using Regression Analysis-Mitra Sept 2 2016Uploaded byPankaj Vishwakarma
- DS101 Project.docxUploaded byDemi To
- SPSS - ouputUploaded byMohammed Bilal
- V2N2P5Uploaded byNiveditha Nandakumar
- 11Uploaded byAnonymous 932wiZUNjD
- Cava BlarUploaded byMəhəmməd Əli Həzizadə

- Assignment of Computer by AIJAZUploaded bySuhail Yousuf Reshie
- Korean Vocabulary 21-27Uploaded byangelmsg
- IJEE_10_v1n5Uploaded bysreelakshmisnr
- Front Page Computerized Library SystemUploaded byDio Paciente Doble
- Terms Design.pdfUploaded bySteven Jerome M. Ingco
- 0141_THERMAL MEDICINE, HEAT SHOCK PROTEINS & CANCER.pdfUploaded byJames Bicher
- luneta barska ac11084.pdfUploaded byFábio Ferreira Mudo
- London Business SchoolUploaded bySyam Cheruvu
- Annual Report 2016 17Uploaded byM Maisam Sultani
- DFT studyUploaded byT J
- SAP HR Interview QuestionsUploaded bybarbie
- 5D-Planner.pdfUploaded bycmollinedoa
- biobliography copyUploaded byjackelinev51
- final edu 210 artifact 4Uploaded byapi-362205934
- Rise_Prospectus_Doc.pdfUploaded byt9920606
- e m Waves Notes for Class XiiUploaded byasha
- mtd-20-9Uploaded byMahendran Mahe
- French SubjonctifUploaded byLiam Mitchell
- common drugs dosesUploaded bythelordhani
- w4 creating and implementing an absorb-type activity needs assessment crosarioUploaded byapi-245326957
- HCl Lawson Talent Management WhitepaperUploaded byEduardo Falcão
- prob26Uploaded byMahendra Kumar Malviya
- Rechargeable Portable USB ChargerUploaded bySalvadorDignum
- Caa-mcs51 User ManualUploaded byKyle 111
- Nabhstandards Vks AkaUploaded byAnonymous ibmeej9
- Section 1Uploaded bycarina
- Electrical NotesUploaded byBenjie Latriz
- Thinking IntelligenceUploaded byBush Ansari
- ACIDO-LINOLEICO-CONJUGADOUploaded bymila chsa
- 2010 SAJC H2 Chem Prelim P1Uploaded byGiovanni Anderson