Liu, Smith 1

Jamie Liu and Adam Smith 6.825 – Project 2 11/4/2004 We learned a lot from this project. Enjoy.

1. Variable Elimination Functionality
After executing our variable elimination procedure, we obtained the following results for each of the queries below. For the sake of easy analysis of the PropCost probability distributions obtained throughout this project from the insurance network, we define the function f to be a weighted average across the discrete domain, resulting in a single scalar value representative of the overall cost. More specifically, f = 1E5*PHundredThou + 1E6*PMillion + 1E4*PTenThou + 1E3*PThousand 1. P(Burglary | JohnCalls = true, MaryCalls = true)
<[Burglary] = [false]> = 0.7158281646356072 <[Burglary] = [true]> = 0.284171835364393

2. P(Earthquake | JohnCalls = true, Burglary = true)
<[Earthquake] = [false]> = 0.8239331615949207 <[Earthquake] = [true]> = 0.17606683840507917

3. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou)
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.1729786918964137 [Million]> = 0.02709352198178344 [TenThou]> = 0.3427002442093675 [Thousand]> = 0.45722754191243536

(f = 48275.62)

These results are consistent with those obtained by executing the given enumeration procedure, and those given in Table 1 of the project hand-out.

2. More Variable Elimination Exercise
A. Insurance Network Queries

Liu, Smith 2

1. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) If the MakeModel of the car in question is that of a sports car then, based on the network as illustrated in Figure 1 of the handout, we expect that the driver would be less risk averse, the driver would have more money, the car would be of higher value. All of these things should cause the cost of insurance to “go up,” relative to our previous query which did not involve any evidence about the MakeModel of the car. An increase in the PropCost domain sense means that the probability distribution should be shifted towards the higher cost elements of the domain (e.g. Million might have a higher probability than Thousand). Indeed, this is what happens. As can be seen below, f is about four thousand dollars greater in this case relative to that from Section 1.3.
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.17179333672003955 [Million]> = 0.03093877334365239 [TenThou]> = 0.34593039737969233 [Thousand]> = 0.45133749255661565

(f = 52028.74)

2. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) In this case, counter-intuitive as it may seem, if the driver is a GoodStudent, then the overall cost of insurance goes up. This follows from the network as shown in Figure 1 of the project handout, i.e. GoodStudent is only connected to the network through two parents: Age and SocioEcon. Since Age is an evidence variable, SocioEcon is the only node affected by the augmentation of GoodStudent to the evidence. More specifically, if the adolescent driver is a good student, they are likely to have more money, and thus drive fancier cars, be less risk averse, et cetera. This result is manifested in the results after variable elimination given the proper evidence. More specifically, f is a little less than four thousand dollars greater in this case relative to that from Section 1.3.
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.1837467917616061 [Million]> = 0.029748793596801583 [TenThou]> = 0.32771416728772235 [Thousand]> = 0.4587902473538701

(f = 51859.40)

Liu, Smith 3

B. Carpo Network Queries

1. P(N112 | N64 = “3”, N113 = “1”, N116 = “0”)
<[N112] = [0]> = 0.9880400004226929 <[N112] = [1]> = 0.01195999957730707

2. P(N143 | N146 = “1”, N116 = “0”, N121 = “1”)
<[N143] = [0]> = 0.899999996961172 <[N143] = [1]> = 0.10000000303882783

3. Random Elimination Ordering
A. Histograms

Histogram of Computation Time under Random Elimination Ordering: Problem 1
6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10 Trials

Figure 1. Histogram of Computation Time for P(ProbCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar).

Liu, Smith 4

Histogram of Computation Time under Random Elimination Ordering: Problem 2
6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10 Trials
Figure 2. Histogram of Computation Time for P(ProbCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True).

Histogram of Computation Time under Random Elimination Ordering: Problem 3
6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10 Trials

Figure 3. Histogram of Computation Time for P(N112 | N64 = "3", N113 = "1", N116 = "0").

Liu, Smith 5

Histogram of Computation Time under Random Elimination Ordering: Problem 4
6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10 Trials

Figure 4. Histogram of Computation Time for P(N143 | N146 = "1", N116 = "0", N121 = "1").

B. Discussion
Error! Reference source not found. through Error! Reference source not found. illustrate the running time of a random order variable elimination algorithm for each of the problems in Task 2 of the project handout. We ran the algorithm ten times for each problem. For each bar, if there it is stacked with a purple bar on top of it, then the heap ran out of memory during that execution. In this case, we know that the execution would have taken at least the amount of time illustrated by the blue bar, the time it executed before running out of memory. We suppose that each execution where the computer ran out of memory would have taken at least 5000 seconds to complete. It is worth noting that the time taken on the successful runs (the samples without a purple bar) is much lower than the time taken to execute the unsuccessful runs before they crashed. I.e. the successful blue bars tend to be shorter than the unsuccessful blue bars. This indicates that either random ordering tends to get it very right or very wrong.

Liu, Smith 6

4. Greedy Elimination Ordering
A. Histograms

Greedy Variable Elimination Runtimes

1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 Problem Number
Figure 5. Greedy Variable Elimination Runtimes for 10 trials of running each of 4 problems.

Problem Insurance – 1 Insurance – 2 Carpo – 1 Carpo – 2

Average Time (seconds) 0.629 1.086 0.088 0.087

Table 1. Average time of execution for variable elimination for the problems from Task 2. Averages are constructed across ten independant runs each, which are illustrated in Figure 5.

B. Discussion
As can be seen from Table 1, the time needed for variable elimination is much smaller for a greedy elimination ordering versus a random ordering. This makes a lot of sense, because the random ordering could happen to eliminate a parent of many children, creating a huge factor which slows down the algorithm and eats up memory. On the contrary, greedy ordering variable elimination works very well. Even in the cases from Section 3 in which we did not run out of memory, the greedy algorithm tends to be about 100-200 times faster.

Liu, Smith 7

5. Likelihood Weighting and Gibbs Sampling Functionality
Each of our results below look like they are in the right neighborhood. We give more explicit quality results in the problems that follow this one.

A. Basic Results – Likelihood Weighting
1. P(Burglary | JohnCalls = true, MaryCalls = true)
<[Burglary] = [false]> = 0.5448387970739699 <[Burglary] = [true]> = 0.4551612029260302

2. P(Earthquake | JohnCalls = true, Burglary = true)
<[Earthquake] = [false]> = 0.9997158283603297 <[Earthquake] = [true]> = 2.8417163967036946E-4

3. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou)
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.17105091038203132 [Million]> = 0.021563876240368398 [TenThou]> = 0.35877461270610517 [Thousand]> = 0.44861060067149516

4. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar)
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.16339257873401916 [Million]> = 0.030620517617711222 [TenThou]> = 0.35048331774243846 [Thousand]> = 0.4555035859058312

5. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True)
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.20177159162635994 [Million]> = 0.032866049889275516 [TenThou]> = 0.30414914618811645 [Thousand]> = 0.46121321229624807

6. P(N112 | N64 = “3”, N113 = “1”, N116 = “0”)
<[N112] = [0]> = 0.9910128302117664 <[N112] = [1]> = 0.00898716978823346

Liu, Smith 8

7. P(N143 | N146 = “1”, N116 = “0”, N121 = “1”)
<[N143] = [0]> = 0.9172494563262301 <[N143] = [1]> = 0.08275054367376986

B. Basic Results – Gibbs Sampling
1. P(Burglary | JohnCalls = true, MaryCalls = true)
<[Burglary] = [false]> = 0.71 <[Burglary] = [true]> = 0.29

2. P(Earthquake | JohnCalls = true, Burglary = true)
<[Earthquake] = [false]> = 0.842 <[Earthquake] = [true]> = 0.158

3. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou)
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.06 [Million]> = 0.01 [TenThou]> = 0.355 [Thousand]> = 0.5750000000000001

4. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar)
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.09 [Million]> = 0.011 [TenThou]> = 0.34 [Thousand]> = 0.559

5. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True)
<[PropCost] <[PropCost] <[PropCost] <[PropCost] = = = = [HundredThou]> = 0.213 [Million]> = 0.038 [TenThou]> = 0.372 [Thousand]> = 0.377

6. P(N112 | N64 = “3”, N113 = “1”, N116 = “0”)
<[N112] = [0]> = 0.97 <[N112] = [1]> = 0.03

Liu, Smith 9

7. P(N143 | N146 = “1”, N116 = “0”, N121 = “1”)
<[N143] = [0]> = 0.922 <[N143] = [1]> = 0.078

6. Ignoring Prefix of Samples in Gibbs Sampling
A. Results
Prefix Throwaway in Gibbs Sampling
4.00E-03 3.50E-03 3.00E-03 2.50E-03 2.00E-03 1.50E-03 1.00E-03 5.00E-04 0.00E+00 0 200 400 600 800 1000 Size of Prefix Thrown Away

Figure 6. Quality (KL divergence) of estimates produced by Gibbs sampler. Each run used 2000 samples, and threw away the first x samples, the independant variable expressed on the x-axis.

KL Divergence

Prefix Throwaway in Gibbs Sampling
Average KL Divergence 1.20E-03 1.00E-03 8.00E-04 6.00E-04 4.00E-04 2.00E-04 0.00E+00 0 200 400 600 800 1000 Size of Prefix Thrown Away

Figure 7. Averages for different prefix throwaway sizes from Figure 6.

Liu, Smith 10

B. Discussion
In this analysis, we ran the Gibbs sampler with 2000 samples on the same problem (Carpo – 1). For each iteration, we threw away a variable number of the first samples. The idea is that since Gibbs sampling is a Markov Chain algorithm, each sample highly depends on the samples before it. Since we choose a random initialization vector for each variable, it can take some “burn in” time before the algorithm begins to settle into the right global solution. The results of our experiments are expressed in Figure 6 and Figure 7. We have a fairly nice characteristic curve as can be seen in the average graph, with the only exception being when we threw away the first 600 samples. Looking at each run, however, at x = 600 there was a single outlier with an extremely high KL divergence; we can ignore it based on the many runs that we did. It seems that the ideal “burn in” time, a tradeoff between good initialization and diversity of counted samples, is 800 samples.

7. Detailed Analysis – KL Divergences
A. Results
We present results indexed first by the algorithm (Likelihood Weighting, then Gibbs Samples) and then by the problem. Within each problem we display two graphs: the first showing the results from ten iterations, and the second showing the average KL divergence across each iteration.

Liu, Smith 11

1. Likelihood Weighting
Likelihood Weighting - Problem Insurance1
7.00E-02 6.00E-02 KL Divergence 5.00E-02 4.00E-02 3.00E-02 2.00E-02 1.00E-02 0.00E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 )Number of Samples (x1000

Figure 3. KL Divergences when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, MakeModel = SportsCar).

Likelihood Weighting: Average KL Divergence Problem Insurance1
0.035 0.03 KL Divergence 0.025 0.02 0.015 0.01 0.005 0
00 00 00 00 00 00 00 00 00 10 20 30 40 50 60 70 80 90 10 11 00 12 13 14 15 16 17 18 19 20 00 0 0 0 0 0 0 0 0 0

Number of Samples

Figure 4. Average KL Divergence when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, MakeModel = SportsCar) to sample sizes between 100 and 2000.

Liu, Smith 12

Likelihood Weighting - Problem Insurance2
7.00E-02 6.00E-02 5.00E-02 Divergence 4.00E-02 3.00E-02 2.00E-02 1.00E-02 0.00E+00
20 0 50 0 10 0 40 0 70 0 80 0 30 0 60 0 90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00

Sample Size

Figure 8. KL Divergences when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, GoodStudent = True).
Likelihood Weighting: Average KL Divergence Problem Insurance2
0.02 0.018 0.016 KL Divergence 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0

Figure 9. Average KL Divergence when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, GoodStudent = True).

90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00
Number of Samples

0

0

0

0

0

0

30

40

50

0

80

10

20

60

70

0

Liu, Smith 13

Likelihood Weighting - Problem 3
8.00E-02 7.00E-02 6.00E-02 5.00E-02 4.00E-02 3.00E-02 2.00E-02 1.00E-02 0.00E+00 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 100 200 300 400 500 600 700 800 900

Number of Samples

Figure 10. KL Divergences when applying Likelihood Weighting to P(N112 | N64 = "3", N113 = "1", N116 = "0").

Likelihood Weighting: Average KL Divergence Problem Carpo1
0.05 0.045 0.04 KL Divergence 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
10 0 20 0 30 0 40 0 50 0 60 0 70 0 80 0 90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00

Number of Samples

Figure 11. Average KL Divergence when applying Likelihood Weighting to P(N112 | N64 = "3", N113 = "1", N116 = "0").

Liu, Smith 14

Likelihood Weighting - Problem 4
2.50E-02 2.00E-02 1.50E-02 1.00E-02 5.00E-03 0.00E+00 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 100 200 300 400 500 600 700 800 900

Number of Samples

Figure 12. KL Divergences when applying Likelihood Weighting to P(N143 | N146 = "1", N116 = "0", N121 = "1").

Likelihood Weighting: Average KL Divergence Problem Carpo2
0.007 0.006 KL Divergence 0.005 0.004 0.003 0.002 0.001 0
10 0 20 0 30 0 40 0 50 0 60 0 70 0 80 0 90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00

Number of Samples

Figure 13. Average KL Divergence when applying Likelihood Weighting to P(N143 | N146 = "1", N116 = "0", N121 = "1").

Liu, Smith 15

2. Gibbs Sampling

Gibbs Sampling: KL Divergences vs Number of Samples for Problem 1
1.2 1 0.8 0.6 0.4 0.2 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000

Number of Samples

Figure 14. Divergences resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) for sample sizes between 1000 and 25000.

Liu, Smith 16

Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 1
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000 1000 2000 3000 4000 5000 6000 7000 8000 9000

Number of Samples

Figure 15. Average divergence resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) for sample sizes between 1000 and 25000.

Gibbs Sampling: KL Divergences vs Number of Samples for Problem 2
1.2 1 0.8 0.6 0.4 0.2 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000

Number of Samples

Figure 16. Divergences resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) for sample sizes between 1000 and 25000.

Liu, Smith 17

Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 2
0.25

0.2

0.15

0.1

0.05

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000

Number of Samples

Figure 17. Average divergence resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) for sample sizes between 1000 and 25000.

Gibbs Sampling: KL Divergences vs Number of Samples for Problem 3
4.00E-03 3.50E-03 3.00E-03 2.50E-03 2.00E-03 1.50E-03 1.00E-03 5.00E-04 0.00E+00

19 00 0

50 00

90 00

10 00

11 00 0

70 00

13 00 0

17 00 0

21 00 0

23 00 0

30 00

15 00 0

Number of Samples

Figure 18. Divergences resulting from Gibbs Sampling applied to P(N112 | N64 = "3", N113 = "1", N116 = "0") for sample sizes between 1000 and 25000.

25 00 0

Liu, Smith 18

Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 3
1.20E-03 1.00E-03 8.00E-04 6.00E-04 4.00E-04 2.00E-04 0.00E+00
70 00 50 00 30 00 90 00 10 00 11 00 0 15 00 0 21 00 0 25 00 0 17 00 0 13 00 0 23 00 0 19 00 0

Number of Samples

Figure 19. Average Divergence resulting from Gibbs Sampling applied to P(N112 | N64 = "3", N113 = "1", N116 = "0") for sample sizes between 1000 and 25000.

Gibbs Sampling: KL Divergences vs Number of Samples for Problem 4
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

14000

15000

16000

17000

18000

19000

20000

21000

22000

23000

24000

Number of Samples

Figure 20. Divergences resulting from Gibbs Sampling applied to P(N143 | N146 = "1", N116 = "0", N121 = "1") for sample sizes between 1000 and 25000.

25000

Liu, Smith 19

Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 4
0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000

Number of Samples

Figure 21. Average divergence resulting from Gibbs Sampling applied to P(N143 | N146 = "1", N116 = "0", N121 = "1") for sample sizes between 1000 and 25000.

B. Discussion of Results
Four interesting things: 1. Number of samples in Gibbs versus Likelihood Weighting As seen from the figures in Section 7.A.1, Likelihood weighting tends to converge after about 500 samples, but always after 1000 in our problems and analyses. We originally assumed that Gibbs sampling would converge in about the same time, if not better. It turns out that Gibbs takes much longer; it typically converges by 5000 samples, a full order of magnitude higher, as can be seen from the figures in Section 7.A.2. This is likely because of the Markov Chain approach used; since each sample depends on the ones before it, it can take many iterations before the algorithm settles into the global optima, whereas likelihood weighting by definition discovers the appropriate probabilities (i.e. weights). 2. Variance of time to converge can be high The convergence of Likelihood Weighting in Problem 3, as illustrated in Figure 10 and Figure 11, exhibits very interesting properties. In the other problems, likelihood weighting runs tended to exhibit relatively low variance in time to convergence. However, here we see some runs which converged very quickly, and others that took abnormally

Liu, Smith 20

long. This high variance occurred with high consistency in this problem, and thus is likely induced by some characteristic in the problem; one likely explanation is that our query variable is a leaf node in a very poly-tree-like network. 3. Convergence is logarithmic This is an evident feature of all of the graphs, but has enormous implications for a choice of algorithms. The criterion for “completeness” of an algorithm is that it arrives at the right answer. In the case of the sampling methods that we surveyed, unfortunately it takes an infinite time to arrive at the right answer. However, it is important to note that variable elimination always arrives at the exact answer. Thus, if a user needs completeness (i.e. the right answer), they should probably use variable elimination. However, if they only need a certain level of completeness, i.e. they want to be x% right, they still cannot rely on sampling methods. This gives rise to the x% correct y% of the time metric. We certainly see this from our graphs. 4. Local optima in Gibbs sampling, but not in Likelihood Weighting This is a very interesting point. In both problems 3 and 4 from Task 2 under Gibbs sampling, one of the runs from each of these problems do not converge to zero. Instead, they seem to converge to a local optima (which is not the global optima). This can be seen in the pink line in Figure 18 and the jungle green line in Figure 20. This is probably more likely in some networks than others. We could probably construct a very simple network that would not provoke this behavior.

C. Computational Considerations – Sampling versus Variable Elimination
In comparing the computation time of sampling methods to variable elimination, we limit ourselves to discussion of greedy ordering variable elimination; since random ordering is very sub-optimal (see Section 4). It turns out that for the networks and queries that we considered, variable elimination is the champ on both accuracy and speed. As can be seen from Table 2, variable elimination performed in near-second times on each problem, while Gibbs took about 15 seconds and Likelihood Weighting took around 5 seconds. This is with 1000 samples for the sampling algorithms, and an effective infinite samples for variable elimination. Our results might have been different if the networks involved were much more dense (i.e. connected) or much larger.

Liu, Smith 21

Task2. Insurance 1 Variable Eliminatio n Gibbs Sampling 0.741

Task2. Insurance 2 1.142

Task2 . Carpo 1 0.120

Task2 . Carpo 2 0.090

12.778

13.530

19.228

18.045

4.377 4.687 5.608 5.317 Likelihoo d Weighting Table 2. Execution time of various algorithms on the four problems from Task 2.