0 Up votes0 Down votes

13 views5 pagesUsing the kings county dataset you have to perform analysis on the data

Mar 25, 2017

© © All Rights Reserved

DOCX, PDF, TXT or read online from Scribd

Using the kings county dataset you have to perform analysis on the data

© All Rights Reserved

13 views

Using the kings county dataset you have to perform analysis on the data

© All Rights Reserved

You are on page 1of 5

Assignment 1

Due Date: 27th March 2017

question)

You are given a file with 26,000+ entries, and many columns (features). You can

open this file on Excel, but try not to change it in any way. We will use MATLAB to do

whatever to the data we need.

The coding parts below are also listed under To Do in comments in the MATLAB

code, so you know where to make the changes in the code. However you need to

also write written answers to a few questions. Do that in a word document. The

submission of the assignment would be both completed MATLAB code, and the word

document.

Part 1)

Load the data of the file into a variable data. Use the csvread command. If you

use the load command, it will give an error.

Why is the load command giving an error? Can you find out whats the problem with

the given data? Be specific to the data set. If there is some column, row, cell, or

some format thats making it give the error, specify that. (Write the answer in

Word document)

You can try things out in the command window, rather than changing the code in

the file.

Part 2)

Now we would desire that a data just comes in, and we give it to a function to

process. However, life is not that simple. Too many times we are dealing with some

of the following issues:

a. Some rows or columns are not needed, extra, or just plain garbage.

b. There are too many missing values in some of the columns, for them to be

included

c. There are garbage (erroneous) values.

Not all of the above applies to the current data set, but some might. So go ahead

and clean the data. Note that opening home_data.csv with a wordpad/notepad,

rather than Excel, might help in figuring out what is going on.

All of the cleaning you may want to do should be done in MATLAB via code. That is:

DO NOT CHANGE THE CSV FILE. IF YOU WANT TO TAKE OUT SOME ROWS/COLUMNS

ETC, DO IT PROGRAMMATICALLY IN MATLAB.

Part 3).

By now you have a file with valid data on which you can work (quite similar to what

we had in data1.txt during the labs). So now you need to decide which features you

want to use. We will assume we are using all the features (other than the last

column, because that is the value we are trying to predict).

So set X to take all the columns (other than the last column), and set y to be the

last column.

But you are welcome to try out other possibilities, like taking only some of the

features and seeing how it changes the answer.

Part 4)

You will next see a part of the code, that is initializing theta , numbiters and

alpha.

You will note that we using only 20 iterations. This is because we want to try various

values of alpha starting from 1.

Plot the errors using alpha = 1, 0.5, 0.25, 0.1. (The code for plotting is already

given). Which of the alphas are bad? Why? Paste the 4 plots on your Word file.

Now set the alpha to your favorite value from above, and increase the number of

iterations from 20 to 200.

There is some good news though. If the number of features is small (and we treat

even 10,000 as small), then we can avoid doing Gradient Descent for Linear

Regression. We have a closed form formula for the best values of theta. It is:

1

=( X T X ) X T y

Where X is the matrix of features that you are using, and y is the desired output. We

might discuss its derivation in class later. However right now what you need to know

is, that this is the theta that would minimize the error as much as is possible with

the given training data.

Compute theta2 in MATLAB using the above equation. Does your gradient descent

give the same error as the Normal Equation? (If the answers are very different there

is something wrong).

Part 6)

Look at the output of thetas. Which are the two most important features in

determining the house price? Which are the two least important features? (Write

this in your word file).

You might think that Linear Regression limits the hypothesis to be a line. For

example we are looking for theta such that we will predict:

y predicted =x1 1 + x 2 2 + x n n

But what if we thought that the relationship was non-linear? For example what if we

believe that the output should also depend on ( 1)2 ? So that we should have:

y predicted =x1 1 + x 2 2 + x n n+ xn +1 21 .

Well the happy news is that Linear Regression can easily adapt to this situation. All

that you need is to add one more feature to X, by adding one more column. The

new column would the square of another column in the example above.

[ ]

1 2 7

For example if: X= 1 3 9 and we believe that we should include a new

1 5 3

feature which is the square of the 3rd feature in the matrix, then we have to make:

[ ]

1 2 7 49

X = 1 3 9 81

1 5 39

So the new features (whether they are logs, square roots, squares of an existing

feature) are just added as columns. And now, when we will do Linear Regression on

this new X, we would actually be doing Polynomial Regression! Great!

Now open the csv file in Excel, so you can read the titles of the features. Which

features do you think are so important, that even a little change to them, can affect

the price of the house a lot? How about squaring them and adding those columns to

X.

Do this to a couple of features (that is add a couple more columns to X before doing

Normalization). Does the error reduce somewhat?

Even after the improvements, you would note that the root mean squared error still

looks a bit large. But suppose this was the best we could do. Now we need to sell

this to whoever our boss was.

How would you argue that this error is bearable? What would you say to make it

sound as good as you can?

1) What is the average price of a house in the data? How does the size of the

error compare to the average price?

2) How would you include the word outliers in your convincing speech.

3) Are there any other error measures (such as the mean absolute error, that

may make it look better?). NOTE, if you are checking another error metric, do

NOT change computeError function; because our gradient descent is just

working with mean squared error.

Just compute the new kind of error separately after you have received the

final values of theta.

Write your convincing argument (with any numbers that you might have),

in your Word document.

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.