800 views

Uploaded by Dhanesh Kumar Kasinathan

it describes how to start simple with machine learning

- Regression
- Report 2 Forecasting n Econ July 29
- Estimating Packed Cell Volume Levels of Hiv Positive Patients by Age Duration of Infection and Sex
- REGRESI Least Square
- ia T-F Modelsj
- Stature Estimation From Foot Length
- Error Anlysis
- Impact of Fundamental and Technical Factors on the Price of Equity Shares in India
- 1472-6750-5-31
- Quantitative Methods for Financial Analyssis Sample-8
- Organisational climate and its influence upon performance_A .pdf
- Regression Analysis
- Comparing the Performance of Bus Routes
- CART (1)
- sasmanual.pdf
- The Examination of Correctional Officers
- 16652-16650-2-PB
- RELATIONSHIP BETWEEN CADMIUM ACCUMULATION AND SOIL EC (A CASE STUDY OF FIVE AUTOMOBILE WORKSHOPS).
- Gravity Model Final
- tmp8B62.tmp

You are on page 1of 12

https://www.chrisstucchio.com/blog/2014/equal...

fancy machine learning algorithm

(https://www.chrisstucchio.com/blog/2014

/equal_weights.html)

Wed 25 June 2014 linear regression (https://www.chrisstucchio.com/tag/linear-

statistics (https://www.chrisstucchio.com/tag/statistics.html) / unit-weighted

regression (https://www.chrisstucchio.com/tag/unit-weighted-regression.html) /

decisionmaking (https://www.chrisstucchio.com/tag/decisionmaking.html)

Follow @stucchio

Like

Tweet

I'm currently dating two lovely women, Svetlana and Elise. Unfortunately continuing to

date both of them is unsustainable so I must choose one.

1 of 12

https://www.chrisstucchio.com/blog/2014/equal...

Get Notifications

In order to make such a choice, I wish to construct a ranking function - a function which

takes as input the characteristics of a woman and returns as output a single number.

This ranking function is meant to approximate my utility function

(http://en.wikipedia.org/wiki/Utility) - a higher number means that by making this

choice I will be happier. If the ranking closely approximates utility, then I can use the

ranking function as an effective decisionmaking tool.

In concrete terms, I want to build a function f : Women which approximately

predicts my happiness. If f (Svetlana) > f (Elise) I will choose Svetlana, and vice versa

if the reverse inequality holds.

One of the simplest procedures for building a ranking function dates back to 1772, and

was described by Benjamin Franklin (http://www.procon.org/view.backgroundresource.php?resourceID=1474):

...my Way is, to divide half a Sheet of Paper by a Line into two Columns,

writing over the one Pro, and over the other Con. Then...I put down under

the different Heads short Hints of the different Motives...I find at length

where the Ballance lies...I come to a Determination accordingly.

The mathematical name for this technique is unit weighted regression

(http://en.wikipedia.org/wiki/Unit-weighted_regression), and the more commonplace

2 of 12

https://www.chrisstucchio.com/blog/2014/equal...

Smart

0

+1

Great Legs

+1 +1

Black

+1 0

Rational

0

0

Exciting

+1 0

Not Fat

+1 +1

Lets me work 0

0

Unit-weighted regression consists of taking the values in each column and adding them

up. Each value is either zero or one. The net result is that f (Elise) = 4 and

f (Svetlana) = 3. Elise it is!

Get Notifications

I present the method in a slightly different format - in each column a different choice is

listed. Each row represents a characteristic, all of which are pros. A con is transformed

into a pro by negation - rather than treating "Fat" as a con, I treat "Not Fat" as a pro. If

one of the choices possesses the characteristic under discussion, a +1 is assigned to the

relevant row/column, otherwise 0 is assigned:

A pro/con list is one of the simplest ranking algorithms you can construct. The

mathematical sophistication required is grade school arithmetic and it's so easy to

program that even a RoR hipster could do it. As a result, you should not hesitate to

implement a pro/con list for decision processes.

The key factor in the success of unit-weighted regression is feature selection. The rule of

thumb here is very simple - choose features which you have good reason to believe are

strongly predictive of the quantity you wish to rank. Such features can usually be

determined without a lot of data - typically a combination of expert opinion and easy

correlations are sufficient. For example, I do not need a large amount of data to

determine that I consider "not fat" to be a positive predictor of my happiness with a

woman.

Conversely, if the predictiveness of a feature is not clear, it should not be used.

Anyone who read one of the many (http://www.amazon.com/gp/product/0596529325

/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0596529325&

linkCode=as2&tag=christuc-20&linkId=SBLFGHS3FMO4V34J) good

(http://www.amazon.com/gp/product/0387310738/ref=as_li_tl?ie=UTF8&camp=1789&

creative=390957&creativeASIN=0387310738&linkCode=as2&tag=christuc20&linkId=XP3TXBAHMPIFWUPK) books (http://www.amazon.com/gp/product

/0387848576/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&

creativeASIN=0387848576&linkCode=as2&tag=christuc20&linkId=ME7GPF4I6NS27XDL) on machine learning can probably name several fancy

machine learning techniques - neural networks, decision trees, etc. And they are

3 of 12

https://www.chrisstucchio.com/blog/2014/equal...

probably asking why I would ever use unit-weighted regression, as opposed to one of

these techniques? Why not use linear regression, rather than forcing all the coefficients

to be +1?

Elise = [0, 1, 1, 0, 1, 1, 0]

Then the predictor function uses a set of weights which can take on values other than

+1:

f (x) =

hi xi

Get Notifications

I'll be concrete, and consider the case of linear regression in particular. Linear regression

is a lot like a pro/con list, except that the weight of each feature is allowed to vary. In

mathematical terms, we represent each possible choice as a binary vector - for

example:

The individual weights fi represent how important each variable is. For example, "Smart"

might receive a weight of +3.3, "Not fat" a weight of +3.1 and "Black" a weight of +0.9.

The weights can be determined with a reasonable degree of accuracy by taking past

data and choosing the weights which minimize the difference between the "true" value

and the approximate value - this is what least squares (http://en.wikipedia.org

/wiki/Least_squares) does.

The difficulty with using a fancier learning tool is that it only works when you have

sufficient data. To robustly fit a linear model, you'll need tens to hundreds of data points

per feature. If you have too few data points, you run into a real danger of overfitting building a model which accurately memorizes the past, but fails to predict the future.

You can even run into this problem if you have lots of data points, but those data points

don't represent all the features in question.

It also requires more programming sophistication to build, and more mathematical

sophistication to recognize when you are running into trouble.

For the rest of this post I'll be comparing a Pro/Con list to Linear Regression

(http://en.wikipedia.org/wiki/Linear_regression), since this will make the theoretical

comparison tractable and keep the explanation simple. Let me emphasize that I'm not

pushing a pro/con list as a solution to all the ranking problems - I'm just pushing it as a

nice simple starting point.

regression

This is where things get interesting. It turns out that Pro/Con list is at least 75% as good

as a linear regression model.

Suppose we've done linear regression and found linear regression coefficients

Suppose instead of using the vector

4 of 12

h.

Tuesday 18 October 2016 03:46 PM

https://www.chrisstucchio.com/blog/2014/equal...

An error is made whenever the pro/con list and linear regression rank two vectors

differently - i.e., linear regression says "choose Elise" while the pro/con list says "choose

Svetlana". The error rate of the pro/con list is the probability of making an error given

two random feature vectors xand y, i.e.:

Get Notifications

u (x y)])

< 0)

error rate(h) = P(sign([h (x y)][

1/4. There are of course vectors hfor which the error rate is higher, and others for which

it is lower. But on average, the error rate is bounded by 1/4.

In this sense, the pro/con list is 75% as good as linear regression.

We can confirm this fact by computer simulation - generating a random ensemble of

vectors h, and then measuring how accurately unit-weighted regression agrees with it.

The result:

More concretely, I computed this graph via the following procedure. For every dimension

N, I created a large number of vectors hby drawing them from the uniform Dirichlet

Distribution (http://en.wikipedia.org/wiki/Dirichlet_distribution). This means that the

vectors

5 of 12

|hi | = 1.0

https://www.chrisstucchio.com/blog/2014/equal...

and

i, hi 0

probability of 50% each.

Get Notifications

Mathematical results

I don't know how to prove how closely a Pro/Con list approximates linear regression for

binary feature vectors. However, if we assume that the feature vectors xand yare

normally distributed (http://en.wikipedia.org/wiki/Normal_distribution) instead, I can

prove the following theorem:

Theorem: Suppose his drawn from a uniform Dirichlet distribution and x, y have

components which are independent identical normally distributed variables. Then:

E[error rate(h)]

arctan(

(N

1)/(N + 1))

1

<

This means that averaged over all vectors h, the error rate is bounded by 1/4. There are

of course individual vectors

is 1/4.

hwith a higher or lower error rate, but the typical error rate

Unfortunately I don't know how to prove this is true for Bernoulli (binary) vectors

Any suggestions would be appreciated.

x, y.

If we run a Monte Carlo simulation, we can see that this theorem appears roughly

correct:

6 of 12

https://www.chrisstucchio.com/blog/2014/equal...

Get Notifications

/142620be989dcf2767bc).

In fact, the graph suggests the bound above is close to being exact. The theorem is

proved in the appendix.

Let us consider a very simple, 3-dimensional example to build some intuition. In this

example, h = [0.9, 0.05, 0.05] - a bit of an extreme case, but reasonable. In this

example, what sorts of vectors x, y will result in unit-weighted regression disagreeing

with the true ranking? Here is one example:

x = [1, 0, 0]

y = [0, 1, 1]

In this case,

Intuitively, what is happening here is that the vector

Suppose for simplicity that the feature vectors

d = x y, and note that di N(0, 1). This is an unrealistic assumption, but one which

is mathematically tractable. We want to compute the probability:

7 of 12

https://www.chrisstucchio.com/blog/2014/equal...

= 2P((h d > 0andu d < 0)).

For simplicity, we will attempt to compute the latter quantity.

h = u+ p where p u = 0. Then:

Get Notifications

Define

= P(u d + p d > 0andu d < 0)

u d < 0)

Note that d is generated by a multivariate normal distribution, with covariance matrix

equal to the identity. As a result:

u d N(0, N 1 )

where

p d N (0,

p

i)

2

Note: Obtaining this statistical independence is why we needed to assume the feature

vectors were normal - showing statistical independence in the case of binary vectors is

harder. A potentially easier test case than binary vectors might be random vectors

chosen uniformly from the unit ball in l , aka vectors for which maxi |xi| < 1 .

We've now reduced the problem to simple calculus. Define

Let

v = u d and w = p d . Then:

Changing variables to

0

v

Here 0

8 of 12

u2 = N 1

and

p2 = pi .

v2

w2

C exp

+

dwdv

( 2u2

2p2 )

/2

v2

w2

0 /2

r 2

C exp

+

dwdv

=

C

e

rdrd

=

( 2u2

0 0

2

2p2 )

= arccot(u /p ), so:

Tuesday 18 October 2016 03:46 PM

https://www.chrisstucchio.com/blog/2014/equal...

/2 arccot(p /u )

arctan(p /u )

arctan(N

p )

=

=

2

2

2

The worst case to consider is when

So

Get Notifications

N1

|p|2 = |h u|2 =

= p2

N

p = (N

1)/N while u = 1/N

N

1) as N . This

, and arccot(1/

implies:

arctan(

N

1)

1

2

4

This means that in the worst case, unit-weighted regression is no better than chance.

Let us now consider the average case over all vectors h. To handle this case, we must

impose a probability distribution on such vectors. The natural distribution to consider is

the uniform distribution on the unit-simplex, which is equivalent to a Dirichlet

(http://en.wikipedia.org/wiki/Dirichlet_distribution) distribution with

1 = 2 = = N = 1.

So what we want to compute is:

arctan(N

|h u|2 dh)

2

=

p = |h u|2 :

arctan(N

|h u|2 )

2

dh

arctan(N

1)/(N(N + 1)))

(N

2

arctan(

(N

1)/(N + 1))

2

9 of 12

https://www.chrisstucchio.com/blog/2014/equal...

z arctan(N

z) is a concave function.

For large N this quantity approaches arctan(1)/2 = (/4)/(2) = 1/8.

/wiki/Jensen's_inequality) since

Note: I believe that the reason the Bernoulli feature vectors appear to have lower error

than the Gaussian feature vectors for small N appears to be caused by the fact that for

small N, there is a significant possibility that a feature vector might be 0 in the relevant

components. The net result of this is that h (x y) = 0 fairly often, meaning that

many vectors have equal rank. This effect becomes improbable as more features are

introduced.

Get Notifications

Thus, we have shown that the average error-rate of unit-weighted regression is bounded

above by 1/4. It also shows that treating feature vectors as Gaussian rather than

Boolean vectors appears to be a reasonable approximation to the problem - if anything it

introduces extra error.

All the pre-readers I shared this with had two major but tangential questions which are

worth answering once and for all.

First, Olga Kurylenko (https://www.google.com/search?q=olga+kurylenko&

oq=olga+kurylenko) and Oluchi Onweagba (https://www.google.com

/search?q=oluchi+onweagba).

Second, I didn't waste time with gimp. Imagemagick was more than sufficient:

# -resize x594 will shrink height to 594, preserve aspect ratio

$ convert olga-kurylenko-too-big.jpg -resize 'x594' olga-kurylenko.jpg;

# -tile x1 means tile the images with 1 row, however many columns are needed

$ montage -mode concatenate -tile x1 olga-kurylenko.jpg oluchi-onweagba.jpg composite.j

pg

Email Address

Subscribe

Comments

10 of 12

10 Comments

Recommend

Share

https://www.chrisstucchio.com/blog/2014/equal...

Login

Sort by Best

Personally, I appreciate the example. I don't know if it was an intentional troll against how overly

sensitive the tech crowd is around race and gender, but it worked that way nonetheless. There's

nothing even remotely sexist or racist here. Thanks for the informative article, I learned so much

about so many things.

11

Reply Share

Get Notifications

Very unfortunately chosen example... people will very easily dismiss it as sexist (plus look at smart

and black criteria...) Oh well.... BTW a simple Pro/Con list to evaluate the pros and cons of a Pro/Con

list vs. Linear Regression would have been 75% as good as your detailed analysis :)

6

Reply Share

stucchio

Mod

I address this topic in the post:

"The difficulty with using a fancier learning tool is that it only works

when you have sufficient data. To robustly fit a linear model, you'll

need tens to hundreds of data points per feature. If you have

too few data points, you run into a real danger of overfitting building a model which accurately memorizes the past, but fails to

predict the future. You can even run into this problem if you have lots

of data points, but those data points don't represent all the features

in question."

Reply Share

A developer 2 years ago

Interesting article Chris. I think your example is bit too overtly sexist though. While it may resonate

with a male audience, it would be a major put off to a female audience.

That said I found the article informative and fascinating.

3

Reply Share

NotSexist > A developer 2 years ago

An article isn't sexist just because it uses a subset of the sexes as an example. That peopl

jump to such a conclusion is more telling about those people and the state of our culture. The

existence of such people surely means this article is higher risk, but its not an inherently "evil"

11 of 12

https://www.chrisstucchio.com/blog/2014/equal...

Back to

top

Get Notifications

(https://github.com/DandyDev/pelican-bootstrap3), Pelican

(http://docs.getpelican.com/), Bootstrap (http://getbootstrap.com)

12 of 12

- RegressionUploaded bychinuscc
- Report 2 Forecasting n Econ July 29Uploaded byHelloman1234
- Estimating Packed Cell Volume Levels of Hiv Positive Patients by Age Duration of Infection and SexUploaded byIJSTR Research Publication
- REGRESI Least SquareUploaded byekoefendi
- ia T-F ModelsjUploaded byVageesha Shantha Veerabhadra Swamy
- Stature Estimation From Foot LengthUploaded bybaduike
- Error AnlysisUploaded byKush Kumar Singh
- Impact of Fundamental and Technical Factors on the Price of Equity Shares in IndiaUploaded byMohammad Johar
- 1472-6750-5-31Uploaded byfather45
- Quantitative Methods for Financial Analyssis Sample-8Uploaded byKeith Martinez
- Organisational climate and its influence upon performance_A .pdfUploaded bymathevanane
- Regression AnalysisUploaded bysahuvaibhav
- Comparing the Performance of Bus RoutesUploaded byAdhitya Setyo Pamungkas
- CART (1)Uploaded bycahyadi aditya
- sasmanual.pdfUploaded byArmandoValdés
- The Examination of Correctional OfficersUploaded byOmar Gutiérrez
- 16652-16650-2-PBUploaded byAndreas Wibowo
- RELATIONSHIP BETWEEN CADMIUM ACCUMULATION AND SOIL EC (A CASE STUDY OF FIVE AUTOMOBILE WORKSHOPS).Uploaded byIJAR Journal
- Gravity Model FinalUploaded byAkshay Aggarwal
- tmp8B62.tmpUploaded byFrontiers
- eviewsintroUploaded by3ces
- TechniqUploaded byPooja Lalwani
- regression projectUploaded byapi-300078551
- 52Uploaded byYohanes
- 9. Prosiding Four a Hasni Flypaper EffectUploaded byRisda Yanti
- ML&DM Project KonstantinNiedermannUploaded byKonstantin Niedermann
- Syllabus St154Fall (Online)Uploaded bypnkc80
- Ricardo J. Simpson - Ball's Formula Method Revisited.pdfUploaded bypedroloxxx
- (11.1-11.4)Uploaded byAnonymous qzVAt2OM9T
- ERIC Multivariate Analysis CommresearcjUploaded byIala Raolona

- Erlang in Anger Text.v1.1.0Uploaded bymujislav
- agile-just-basics.pdfUploaded byDhanesh Kumar Kasinathan
- Erlang in ProductionUploaded byDhanesh Kumar Kasinathan
- Petri NetsUploaded byMitesh Patel
- Biles.spadeUploaded byDhanesh Kumar Kasinathan
- Clarified CQRSUploaded byBiswanath Senapati
- BiodieselUploaded byAndrei Cio
- Systemic Thinking.pdfUploaded byDhanesh Kumar Kasinathan
- a12-sunUploaded byDhanesh Kumar Kasinathan
- 4-Forecasting Techniques in CropsUploaded byDhanesh Kumar Kasinathan

- New Features PSIM V9.3Uploaded byM Nuruddin Ar Rabbani
- Kazemi One Millisecond Face 2014 CVPR PaperUploaded byRomil Shah
- Quantitative Methods for Management SciencesUploaded bySumudu Adikari
- Artificial Intelligence TutorialUploaded byAnonymous HQuX2LUsD
- Computer Security GoalsUploaded byAnkur Agrawal
- Article - Maximizing the Nurses’ Preferences in Nurse Scheduling ProblemUploaded byDavid
- Digital Image Processing Unit-8Uploaded byKpsteja Teja
- Homework 1Uploaded byHendra Lim
- Apriori Itemset GenerationUploaded byjollydmello
- Lti Discrete-time Systems in the Transform DomainUploaded byPavan Kumar Thopa
- Midterm SolUploaded byMohasin Azeez Khan
- Linear Buckling Analysis_Points to Must RememberUploaded byPraveen Jain
- m4l26 Lesson 26 The Direct Stiffness Method: Temperature Changes and Fabrication Errors in Truss AnalysisUploaded byVitor Vale
- Unweaving a Web of DocumentsUploaded bySerge
- 6ff5d305249658150618dbebe7ab21ae82eaUploaded bypancawawan
- Lecture 26 CorrectedUploaded bya4
- Test Bank for Principles of Corporate Finance 10th Edition Brealey, Myers, AllenUploaded bya682182415
- syylabusUploaded bySanaya Sanap
- 13_references.pdfUploaded bygauravkumawat18
- A Transmit Preprocessing Technique for Multiuser MIMO Systems Using a Decomposition ApproachUploaded byayvidleog
- 00898091Uploaded bydebasishmee5808
- 0802.1508Uploaded byVera Gardasevic Mitrovic
- rn(43)Uploaded byLorenzo Carrieri
- mc_basicsUploaded byAnkit Latiyan
- GCT - optimizationUploaded bySafeer M
- ELT-43007 Matlab Ex3Uploaded by185534 ktr.ece.14
- 11-FixedPointArithmetic.pdfUploaded bySenthil Kumar
- BFS.pptxUploaded byMahesh Panchal
- A Quick Review on Features and Classification Techniques in Online Recognition of Handwritten Dravidian ScriptsUploaded bybaiju badagara
- [Carlos_Martin-Vide]_Scientific_Applications_of_La(BookZZ.org).pdfUploaded byFakron Jamalin