You are on page 1of 8

CS412, Fall 2010, Assignment 1: Solution

Question 1 (3 pts): Question 2 (11 pts): 1 (2 pts): Mean(price)=96.17 Median(price)=61.87 Q1(price)=41.54 Q3(price)=135.09 st-dev(price)=80.09 Mean(weight)=7.99 Median(weight)=4.85 Q1(weight)=3.55 Q3(weight)=7.55 st-dev(weight)=8.71 Note: Q1 and Q3 can be calculated with different interpolation methods or no interpolation at all. You need to specify your method if you are asked to calculate Q1 in your midterm exam. 2 (2 pts): Price:

Weight:

3 (2 pts): scatter diagram:

Q-Q diagram:

4, 5 (2 pts):

6 (1 pts): 0.5544, positive correlated. 7 (2 pts), Equal-width partition: Bin 1: 4.5, 5.89, 17.99, 31, 33.5, 49.59, 52.9, 56.99, 59.98, 61, 62.75 and 84 Bin 2: 102.19, 125.5, 128, 142.19 and 159 Bin 3: 189.4 and 229 Bin 4: 328 Equal-frequency partition:

Bin 1: 4.5, 5.89, 17.99, 31 and 33.5 Bin 2: 49.59, 52.9, 56.99, 59.98 and 61 Bin 3: 62.75, 84, 102.19, 125.5 and 128 Bin 4: 142.19, 159, 189.4, 229 and 328 Question 3 (6 pts): 1 (2 pts),

Note: Dimensions and attributes can be diffferent. 2 (2 pts) (1) Drill down on department from "all" to "department_name" Slice for "department_name = Computer Science" Drill down on semester from "all" to "year" Slice for "year = 2010" Drill down on semester from "year" to "season" Slice for "season = Spring" Drill down on students from "all" to "student_number" (2) Drill down on department from "all" to "college" Slice for "college = Engineering" Drill down on students from "college" to "student_number" Denote this intermediate result as A Starting from A Drill down on semester from "all" to "year" Slice for "year = 2010"

Drill down on semester from "year" to "season" Slice for "season = Spring" Denote this intermediate result as B Starting from A Drill down on semester from "all" to "year" Slice for "year = 2009" Drill down on semester from "year" to "season" Slice for "season = Fall" Denote this result as C Compute the different between B and C and sort the difference in decreasing order Select the first row 3 (1 pt), To calculate stdev, we have: so we can keep count(x), x^2 and E(x) at each level. To calculate stdev at higher level, we can simple calculate x^2 and E(x) in a aggregated way, and plus these numbers back to the formula. 4 (1 pt), This is a holistic measure, so there is no easy way to calculate top 10% in each level accurately. But we can calculate an approximate value for each level using a formula similar to approximation median value computation. Top 10% (90th percentile) can be estimated as: In each level, first we partition data into 10 bins using equal-width method. And we keep the following values of each bin, counts, lower and upper boundary, as well as the bin width. To computer the top 10% of a higher level, we can do this: top10% = L1+width*(N/9-(sigma(C)1))/C' where L1 is the lower boundary of the top 10% bin, N is the number in the whole cuboid, sigma(C)1 is the sum of counts of all bins which are lower than the top 10% bin, and C' is the count of the top10 % bin. Question 4 (10 pts) 1 and 2(2 + 2 pts),

Note: this is just one possible answer derived from the reference paper. 3 (2 pts), There are couple of ways to clean up RFID data. We can estimate the data distribution of each type of product, in each measure, and correct registrations which are drifted too far away from the other data instance, we consider that as a noise. And we can correct it by using a normalized neighbor value instead. 4 (2 pts), Using dice operation, find the RFIDs for TV sets of given brand and price range that are in BestBuy in Champaign for the given month. We can add brand and price as separate dimensions while implementing the warehouse. And then, we simple trace back the RFID tags, and look up stay tables to finally select all TV sets shipping from LA seaport to BestBuy in

Champaign. 5 (2 pts), By looking up RFID of this product, we can trace the time it's sold. If it's before expiration date, we need to trace all the other milk which are produced and shipped to this store together with rotted milk, because they are likely to stayed in the same environment from factory to the store before customers take them off the shelf. If any of those milk also go bad, there must be something wrong with the shipping or storage environment.

You might also like