This action might not be possible to undo. Are you sure you want to continue?

**CS3244 Machine Learning – Semester 1, 2012/13
**

Solution to Tutorial 4

1. What are the values of weights w

0

, w

1

, and w

2

for the perceptron whose decision

surface is illustrated in Figure 4.3? Assume the surface crosses the x

1

axis at ÷1,

and the x

2

axis at 2.

Answer:

The line for the decision surface corresponds to the equation x

2

=2x

1

+2, and since

all points above the line should be classified as positive, we have x

2

÷ 2x

1

÷ 2 >0.

Hence w

0

=÷2, w

1

=÷2, and w

2

=1.

2. Consider two perceptrons defined by the threshold expression w

0

+w

1

x

1

+w

2

x

2

>0.

Perceptron A has weight values

w

0

=1, w

1

=2, w

2

=1

and Perceptron B has weight values

w

0

=0, w

1

=2, w

2

=1

True or false? Perceptron A is more-general-than perceptron B. (More-general-

than is defined in Chapter 2).

Answer:

True. Perceptron A is more general than B, because any point lying above B will

also be above A. i.e. as per definition in chapter 2 for more-general-than:

)] ) ( ) ) ( )[( ( 1 x A 1 x B X x = ÷ = e ¬

B

A

+

+

+

÷

÷

÷

2

3. Derive a gradient descent training rule for a single unit with output o, where

o = w

0

+w

1

x

1

+w

1

x

1

2

+. . . +w

n

x

n

+w

n

x

n

2

Answer:

First, the error function is defined as:

2

2

1

) o t ( ) w ( E

d

D d

d

÷ =

¿

e

The update rule is the same, namely:

i i i

w w : w A + =

i

i

w

E

w

c

c

÷ = A q

For w

0

,

) ( ) 1 )( (

) ( ) ( 2

2

1

) (

2

1

) (

2

1

0

2

0

2

0 0

d

D d

d d

D d

d

d d d

D d

d d

D d

d d

D d

d

o t o t

o t

w

o t o t

w

o t

w w

E

÷ ÷ = ÷ ÷ =

÷

c

c

÷ = ÷

c

c

= ÷

c

c

=

c

c

¿ ¿

¿ ¿ ¿

e e

e e e

Thus

) (

0 d

D d

d

o t w ÷ = A

¿

e

q

For w

1

,w

2

,…,w

n

)) ( )( (

) ( ) ( 2

2

1

) (

2

1

) (

2

1

2

2 2

id id d

D d

d

d d

i

d

D d

d d

D d

d

i

d

D d

d

i i

x x o t

o t

w

o t o t

w

o t

w w

E

+ ÷ ÷ =

÷

c

c

÷ = ÷

c

c

= ÷

c

c

=

c

c

¿

¿ ¿ ¿

e

e e e

Thus

) )( (

2

id id d

D d

d i

x x o t w + ÷ = A

¿

e

q

3

4. Consider a two-layer feedforward ANN with two inputs a and b, one hidden unit c,

and one output unit d. This network has five weights (w

ca

, w

cb

, w

c0

, w

dc

, w

d0

), where

w

x0

represents the threshold weight for unit x. Initialize these weights to the values

(0.1, 0.1, 0.1, 0.1, 0.1), then give their values after each of the first two training

iterations of the BACKPROPAGATION algorithm. Assume learning rate q =0.3,

momentum o = 0.9, incremental weight updates, and the following training

examples:

a b d

1 0 1

0 1 0

Answer:

The network and the sigmoid activation function sigmoid function are as follows:

y

e

) y (

÷

+

=

1

1

o

Training example 1:

The outputs of the two neurons, noting that a=1and b=0:

53867 0 15498 0 1 1 0 5498 0 1 0

5498 0 2 0 1 1 0 0 1 0 1 1 0

. ) . ( ) . . . ( o

. ) . ( ) . . . ( o

d

c

= = × + × =

= = × + × + × =

o o

o o

The error terms for the two neurons, noting that d=1:

002836 0 1146 0 1 0 5498 0 1 5498 0

1146 0 53867 0 1 53867 0 1 53867 0

. . . ) . ( .

. ) . ( ) . ( .

c

d

= × × ÷ × =

= ÷ × ÷ × =

o

o

Compute the correction terms as follows, noting that a=1, b=0 and q=0.3:

0 0 002836 0 3 0

000849 0 1 002836 0 3 0

000849 0 1 002836 0 3 0

0189 0 5498 0 1146 0 3 0

0342 0 1 1146 0 3 0

0

0

= × × = A

= × × = A

= × × = A

= × × = A

= × × = A

. . w

. . . w

. . . w

. . . . w

. . . w

cb

ca

c

dc

d

0

a

b

c d

w

d0

w

c0

w

ca

w

cb

w

dc

4

and the new weights become:

1 0 0 1 0

100849 0 000849 0 1 0

100849 0 000849 0 1 0

1189 0 0189 0 1 0

1342 0 0342 0 1 0

0

0

. . w

. . . w

. . . w

. . . w

. . . w

cb

ca

c

dc

d

= + =

= + =

= + =

= + =

= + =

Training example 2:

The outputs of the two neurons, noting that a=0 and b=1:

5497 0 1996 0 1 1342 0 55 0 1189 0

55 0 200849 0 1 100849 0 1 1 0 0 100849 0

. ) . ( ) . . . ( o

. ) . ( ) . . . ( o

d

c

= = × + × =

= = × + × + × =

o o

o o

The error terms for the two neurons, noting that d=0:

004 0 1361 0 1189 0 55 0 1 55 0

1361 0 5497 0 0 5497 0 1 5497 0

. ) . ( . ) . ( .

. ) . ( ) . ( .

c

d

÷ = ÷ × × ÷ × =

÷ = ÷ × ÷ × =

o

o

Compute the correction terms as follows, noting that a=0, b=1, q=0.3 and o=0.9:

0012 0 0 9 0 1 004 0 3 0

00086 0 000849 0 9 0 0 004 0 3 0

0004 0 000849 0 9 0 1 004 0 3 0

0055 0 0189 0 9 0 55 0 1361 0 3 0

01 0 0342 0 9 0 1 1361 0 3 0

0

0

. . ) . ( . w

. . . ) . ( . w

. . . ) . ( . w

. . . . ) . ( . w

. . . ) . ( . w

cb

ca

c

dc

d

÷ = × + × ÷ × = A

= × + × ÷ × = A

÷ = × + × ÷ × = A

÷ = × + × ÷ × = A

÷ = × + × ÷ × = A

and the new weights become:

0988 0 0012 0 1 0

1016 0 00086 0 100849 0

100849 0 0004 0 100849 0

1134 0 0055 0 1189 0

1242 0 01 0 1342 0

0

0

. . . w

. . . w

. . . w

. . . w

. . . w

cb

ca

c

dc

d

= ÷ =

= + =

= ÷ =

= ÷ =

= ÷ =

5

5. Revise the BACKPROPAGATION algorithm in Table 4.2 so that it operates on units

using the squashing function tanh in place of the sigmoid function. That is, assume

the output of a single unit is ) x . w tanh( o

= . Give the weight update rule for output

layer weights and hidden layer weights. Hint: ) x ( tanh ) x ( h tan

2

1÷ = ' .

Answer:

Steps T4.3 and T4.4 in Table 4.2 will become as follows, respectively:

k

outputs k

kh h h

k k k k

w ) o (

) o t )( o (

o o

o

¿

e

÷ ÷

÷ ÷ ÷

2

2

1

1

6

6. Consider the alternative error function described in Section 4.81.

¿ ¿ ¿

+ ÷ ÷

e e j , i

ji kd

D d outputs k

kd

w ) o t ( ) w ( E

2 2

2

1

¸

Derive the gradient descent update rule for this definition of E. Show that it can be

implemented by multiplying each weight by some constant before performing the

standard gradient descent update given in Table 4.2.

Answer:

¿ ¿ ¿

c

c

+ ÷

c

c

=

c

c

Vc

c

÷ = A

A + ÷

e e j , i

ji

ji

kd

D d outputs k

kd

ji ji

ji

ji

ji ji ji

w

w

) o t (

w w

) w ( E

w

) w ( E

w

w w w

2 2

2

1

¸

q

The first term in the R.H.S of the above equation can be derived in the same manner

as in equation (4.27), while we continue to work on the 2

nd

term. For output nodes,

leads to:

ji j ji ji

ji ji j j j j ji ji

ji ji j j j j

ji

x w w

w x ) o ( o ) o t ( w w

w x ) o ( o ) o t (

w

) w ( E

qo |

q¸ q

¸

+ ÷

÷ ÷ ÷ + ÷

+ ÷ ÷ ÷ =

c

c

2 1

2 1

where q¸ | 2 1÷ = and ) o ( o ) o t (

j

j j j j

÷ ÷ = 1 o

Similarly, for hidden units, we can derive:

ji j ji ji

x w w qo | + ÷

where q¸ | 2 1÷ = and

¿

e

÷ =

) j ( Downstream k

kj k j j

w ) o ( o

j

o o 1

The above shows the update rule can be implemented by multiplying each weight

by some constant before performing the gradient descent update given in Table 4.2.

7

7. Assume the following error function:

2 2

2

1

2 ) ( w w w E µ ì o + ÷ =

where o, ì and µ are constants. The weight w is updated according to gradient

descent with a positive learning rate q. Write down the update equation for

w(k+1) given w(k). Find the optimum weight w that gives the minimal error E(w).

What is the value of the minimal E(w)? (8 marks)

Answer:

) ( ) ( ) 1 (

) (

w k w k w

w

w

E

w

w

w

E

µ ì q

µ ì q q

µ ì

÷ + = +

÷ =

c

c

÷ = A

+ ÷ =

c

c

When E(w) becomes the smallest, 0 =

c

c

w

E

Thus, optimal

µ

ì

=

optimal

w

Minimal error:

µ

ì

o

µ

ì

µ

ì

o

2

2

2

2 ) (

2

2

2 2

2

÷ = + ÷ =

optimal

w E

8

8. WEKA outputs the following confusion matrix after training a J 48 decision tree

classifier with the contact-lenses dataset. (a) Count the number of True Positives,

True Negatives, False Positives and False Negatives for each the three classes, i.e.

soft, hard and none. (b) Calculate the TP rate (Recall), FP rate, Precision and F-

measure for each class.

a b c <- - cl assi f i ed as

4 0 1 | a = sof t

0 1 3 | b = har d

1 2 12 | c = none

Answer:

soft:

(a) TP =4

TN =18

FP =1

FN =1

(b) TP rate =Recall =TP / (TP +FN) =4/5 =0.8

FP rate =FP / (FP +TN) =1/19 =0.053

Precision =TP / (TP +FP) =4/5 =0.8

F-Measure =2×0.8×0.8/(0.8+0.8) =0.8

hard:

(a) TP =1

TN =18

FP =2

FN =3

(b) TP rate =Recall =TP / (TP +FN) =1 / 4 =0.25

FP rate =FP / (FP +TN) =2/20 =0.1

Precision =TP / (TP +FP) =1 / 3=0.333

F-Measure =2×0.25×0.333/(0.25+0.333) =0.286

none:

(a) TP =12

TN =5

FP =4

FN =3

(b) TP rate =Recall =TP / (TP +FN) =12 / 15 =0.8

FP rate =FP / (FP +TN) =4/9 =0.444

Precision =TP / (TP +FP) =12 / 16 =0.75

F-Measure =2×0.8×0.75/(0.8+0.75) =0.774

Sign up to vote on this title

UsefulNot useful- soultion6
- 07
- Section 5.2 Learning Parameters Part 2
- FGM
- Kazemi
- Multi-Variable Optimization Methods
- Gradient Operator
- Deteksi Tepi
- CS 223-B L4 Features2
- The Gentlest Introduction to Tensorflow
- Spatial Filtering Image Processing
- ConjGrad
- Local Image Featuring
- 3-DeltaRule.pdf
- CSE399b 04 Edge
- Ex 4301908912
- edge1
- Gradient Descent Easy version
- sobgeqop
- A comparison of various edge detection techniques used in image processing.docx
- Assignment 2 p 1
- edge1
- Variable Gradient Method
- nn
- Gradient Descent [RIT] Adv&Disadv
- Exit%20Gradients
- emf
- DefinicionesExplicitasOK (1)
- 02DoychevEdgesFeaturesHandout
- Edge and Corner Detection
- Solution 4 Ann Weka 2012