This action might not be possible to undo. Are you sure you want to continue?

Question 1: (Open question, following are some possible answers) In the introduction chapter, we have introduced different ways to perform data mining: (1) using a data mining language to write data mining functions in a commercial data mining system, e.g., Microsoft SQL Server Analysis Manager or Oracle Data Mining, (2) develop your own customized data mining functions from scratch for particular applications, and (3) write functions for invisible data mining. 1. Present examples for each way of building data mining functions. Answer: (answers from rawlani1 and duan9) (1) An insurance company which has its user and business data stored in IBMs’ DB2 relational database. The insurance company can use DB2 Intelligent Miners’ Classification function to create a risk group profile in the form of a data mining model. This profile will contain attributes common to lapsed customers, and can be applied to new customers to see if they belong to the risk group. (Many students think SQL is a data mining language, which incorrect. SQL is a classical database retrieval tech rather than data mining method) (2) Many web data mining problems need to be solve from scratch because this is a new area and tasks could be various, for example, mining link structure of web pages to decide the importance of web pages, or mining Email repositories to discover social networks. (3) Invisible data mining is extensively used in online shopping websites. When users select an item to purchase, they usually see suggestions based on the popularity and similarity of the item from the website, saying something like “Customers who viewed this item, also viewed the following”. This is a classic example of invisible data mining where the end user does nothing to get these suggestions. This kind of usage of invisible data mining can be found in websites selling electronic items, cars, clothes, airline ticketing agents etc like Amazon, Ebay and EDreams. 2. Discuss the pros and cons of each way to perform data mining. Answer: (answers from rawlani1) 1) Commercial Data Mining System Pros: A vast majority of commonly used data mining functions are already present in these systems, so this would save a huge amount of development time. These functions are implemented by data mining experts in high profile companies, so they are very much reliable. Cons: They may not necessarily suit your needs. Integration to the backend is usually not so straight forward. 2) Customized Data Mining Functions Pros: It always meets the application specific needs. Customizable when application requirements change. Cons: To write them correctly, a lot of experience is required. 3) Invisible Data Mining Pros: This has made online shopping very easy. Cons: All the available information cannot be shown to user because of data privacy issues. 3. Suppose you are hired by a web-retail company, such as Amazon.com, discuss three data mining functions you would build for the company so that customers at home can do "invisible data mining". Answer: (answers from rawlani1) 1

Q3.25 or 32. mean: 7.2 1. where 𝑥𝑖 is the sample and 𝑥 is the mean of the sample. mic when browsing laptop). 4) Products recommendation based on the purchase and browsing history of other users.5 weight 1. and standard deviation of price and weight.22 or 82.875.4 weight 4.55 or 3. if Q1 is in between element N and N+1.2 2.99 $82.6 4.5 $22 $52. but other accessories that go along with it (like headphones. The second definition is 1 𝑁−1 𝑁 𝑖=1(𝑥𝑖 − 𝑥 )2 .2 3.9 11. In this assignment. we take both definitions as correct answers. Q3: 142. median: 4. mean: 96.6 18.55 or 6. Draw the boxplots for price and weight.75 $142. 3) Suggested products can be ranked based on customer reviews and brand popularity. These are not similar items.09 For weight. The first definition is 1 𝑁 𝑁 𝑖=1(𝑥𝑖 − 𝑥 )2 .4 1. In this assignment. This definition is called sample standard deviation.8 price $4.3 5.19 $229 $189.85. Q1: 32. which has a uniformly smaller mean squared error than the "sample standard deviation" (see below).1 4. But this estimator.49. Q1.99 $56. This correction (the use of N − 1 instead of N) is known as Bessel's correction.9 5.0 22.1 4.0685. For standard deviation. median. denoted as 𝑆𝑛 .98 $159 $17. In Matlab.1) We can display other items having similar configuration and price to the item the user chooses to view. there are two definitions as well.7 3. 2 while in Microsoft Excel.9 $61 $33.25 ∗ 𝐸𝑙𝑒𝑚𝑒𝑛𝑡 𝑁 + 0.9 38.9403 2. tends to be too low: it is a biased estimator.2 8. if that variance exists and the sample values are drawn independently with replacement. The reason for this correction is that 𝑆 2 is an unbiased estimator for the variance of the underlying population. Q1: 3.89 $49.725.19. Calculate the mean. Question 2: Suppose a student collected the price and weight of 20 products in a shop with the following result: price $5. For price. when applied to a small or moderately-sized sample.99. Answer: 2 .75 ∗ 𝐸𝑙𝑒𝑚𝑒𝑛𝑡(𝑁 + 1).5 2. it defines as 0. std-dev: 84. we take both definitions as correct answers.8 6. but actually they have different semantic meanings. Q3: 7.875. It’s called standard deviation of the sample. Answer: Different textbooks and software have different definitions of Q1 and Q3.7139 or 8. it would return 𝐸𝑙𝑒𝑚𝑒𝑛𝑡 𝑁 +𝐸𝑙𝑒𝑚𝑒𝑛𝑡 (𝑁+1) . 2) We can make suggestions on items that go along with the product being viewed.9 4. median: 60.5 $328 $128 $142.19 $31 $125. and is the maximumlikelihood estimate when the population is normally distributed.5 5.59 $59. std-dev: 8.

0000 3 .4869 2.8068 10.1739 2.0000 1.4)/(38.9538 2.1770 4.3465 2.8306 Price Weight 1.8315 6.3663 1.2544 2.4) * (10-1) + 1 1. max = 10).3753 2.0598 1.7372 1.6603 1.8306 Price Weight 1.0000 1.4603 3.0245 1.2582 4.2983 1.5) * (10-1) + 1 Norm weight = (weight – 1.3.0387 2. Draw scatter plot and Q-Q plot based on these two variables.6603 1.6114 1.1440 10.0761 2.2457 6.5) / (328 – 4.3179 1.5719 1.0000 4.7826 7. Normalize the two variables based on the min-max normalization (min = 1. Answer: 4. Answer: Norm price = (price – 4.1957 1.5435 5.8342 3.4359 4.4402 1.4946 5.2 – 1.8560 1.0027 2.

5358 -0.99)/8.07)/84.7139 as std_dev for price and weight.75 $125.19 $142. Are these two variables positively or negatively correlated? Answer: Pearson correlation coefficient = 0.875 Bin 1: $4.9403 Price -1.1081 3.536307.90 $56.5783 1.99 $22 $31 Bin 2: $33. 1.2785 -0. Suppose the data warehouse consisting of the following dimensions: product. These 2 variables are positively correlated. and equal-depth (equal-frequency) partitioning.4640 Weight -0.7259 -0.3009 -0.40 $229 $328 Question 3: (Open question.50 $128 $142.50 $128 $142.2249 Norm weight = (weight – 7. and time.98 Bin 3: $61 $82.50 $5.5917 -0. Draw a star-schema.8794 -0.3494 -0.2450 -0.4575 -1.89 $17.6677 0.5518 -0. Normalize the two variables based on the z-score normalization. partition them into four bins by each of the following methods: equalwidth partitioning.4285 0.7429 2. Answer: a) Equal-width partitioning: the width of each bin = (328 – 4. Answer: (Some students take sale as measure instead of dimension.9270 -0.19 $159 $189.4351 0.3568 1.99 $22 $31 $33.1018 Or you can use 82.7371 -0.09 and 8.99 $59. based on your consideration of power and convenience of analysis of the warehouse.75 Bin 2: $125.1196 -0.4351 0. following are some possible answers) Design a data warehouse for Walmart-like chain store to analyze the sales transactions.90 $56.50 $49.7537 Price Weight -0. 7. Take the price of the above 20 products.3791 6.59 $52.99 $59.7726 -0.50 $5.0707 -0.5. which is feasible in this scenario as well) From nagaraj3: 4 .2002 0.19 Bin 4: $142. sales.50 $49.6476 -0.59 $52.7472 -0.1581 -0.40 $229 Bin 4: $328 b) Equal-depth partitioning: Bin 1: $4.3791 0.89 $17.4164 -0.5476 1.5125 -0.0872 -0.3456 -0.98 $61 $82. store.3792 1. Calculate the Pearson correlation coefficient. and a set of measures you would like to define.5)/4 = 80. Answer: (from luu1) Norm price = (price – mean)/std_dev = (price – 96.5476 -0.4038 0. -0.19 $159 Bin 3: $189.

slicing. Suppose you start from the top (all-summary) of the multi-dimensional hierarchy. what are the concrete OLAP operations (drilling. * Find which store has the highest monthly increase of sales of bread among all the stores in Illinois Answer: from duan9: * drill down on product to department drill down on time to month dice on product department = toys and month = Jan 2009 5 .Or answer from rawlani1: 2. etc.) you need to find the following: * Average daily profit of each product in the Toys department in January 2009.

1<=i<=n.e. compute stdagg .. freqmedian is the frequency of the median bin and width is the width of the median bin. N is the sum of frequencies of all bins. lower bin i freqi is the sum of the frequencies of all bins lower than the median bin. location and week. Answer: from luu1 Median: We can compute the approximate median as following Step 1: partition the data info k bins using equal-width method. we calculate the standard deviation as well as the mean and count. Median and rank are two holistic measures. Suppose we want to present the standard-deviation of sales by item category. 4. the efficiency of the above algorithm might depend on various method of binning. Assume the number of values to be aggregated in one dimension of one cuboid is n. we store the following values: frequency. i. we compute the median bin ( with cost O(k)) and return the approximate median computed as following1 Median = L1 + N − 2 lower bin i freq i freq median width [1] Where L1 is the lower boundary of the median interval. Discuss how to develop efficient (maybe approximate) methods to compute these two measures in a multi-dimensional space. and we want to calculate the aggregated measure along this dimension. Step 3: When user queries the median of a cell. describe how this measure can be computed efficiently Answer: from chiwang1 For each cuboid. 6 . Step 2: For each bin. and freely drilling up and down in multidimensional space. The best case scenario is when we have 1 Same as formula 2. However. meanagg and countagg from stdi .drill down on time to day average profit on day * drill down on product to type drill down on store to state dice on product type = bread and state of store = Illinois drill down on time to month calculate monthly increase for each month drill down on store to street number argmax monthly increase on store 3. A baseline method is equal-width binning assuming some min and max value of the data.3 in our textbook. In this way the aggregated measures can be computed in a distributed manner. We also store the bin width. meani and counti . lower and upper boundaries.

driver_category. Here. Each Auto_ID represents one vehicle associated with information. 7 . depending on the tolerance of the system. The movement data may contain noise. For approximate rank. The company registers huge amounts of auto movement data in the format of (Auto_ID. The different is that instead of finding the median bin. Answer: from duan9 To handle the noise in data. location and speed. such as vehicle category. etc. Finally. Then we can use some data smoothing techniques to remove noisy data points. the more we know about the data.equal-depth binning for the query cell. following are some possible answers) Suppose a company would like to design a data warehouse that may facilitate the analysis of moving vehicles in an online analytical processing manner. the closer the approximate result. the closer to the ideal case we can get. Discuss how you can develop a method that may automatically disocver some data records are likely erroneously registered in the data repository. Moreover.. 2. regression and outlier analysis. time). for example. measures considered are vehicle_sold and vehicle mileage. Answer: from rawlani1 To design a data warehouse for the analysis of moving vehicles. The approximate rank is Rank(m) = L1 + 𝑚 − 𝑙𝑜𝑤𝑒𝑟 𝑏𝑖𝑛 𝑖 𝑓𝑟𝑒𝑞 𝑖 𝑓𝑟𝑒𝑞 𝑏𝑖𝑛 _𝑚 width Question 4: (Open question. Design such a data warehouse that may facilitate effective on-line analytical processing in multidimensional space. space and performance also depends on the parameter k. The measures can vary depending on the desired target and the power required. we first need to do data cleaning. and each location could be associated with a street in a city. we find the bin that contains element of rank m. We can do similarly. You may assume a street map is available for the city. but the more memory and computation will be required. The bigger k is. time. In reality. we can also set up some rules to detect inconsistent data and remove them based on domain knowledge. Missing values may be filled or dropped entirely. speed. location. we can consider vehicle as a fact table that points to four dimensions: auto. 1.

Hence. We might need to update the weights every hour. If one wants to drive from A to B starting at a particular time. Provided the semantics of the query is not change. Hence. sparsity can be considered as a missing value. we can even reduce the dimensionality of the data cube. Inter-cuboid query: It is the extreme case of intra-cuboid. if speed is missing. Using this algorithm. Discuss how you can develop a method that may construct reliable data warehouse despite of the sparsity of data. Using this weighted graph we can work out the fastest route for the driver by any famous algorithm such as A* and Dijktra. We may use the existing data to find that missing value.g. we don’t care about the direction of the street. Answer: from rawlani1 It may be possible to get a data in the data warehouse that may be sparse. Analyzing a sparse data is not reliable as a single outlier may completely shift the results. but the queried data cell had only few or no values. for our vehicle database we can compute confidence interval. We need to find the weight for all the possible paths from the start location to end location. Then using OLAP operation (drill. This technique is most commonly used in machine learning. We can also integrated that information and create a directed graph. we have 2 ways to resolve this issue: Intra-cuboid query: Expand the sample size by including the nearby cells in the same cuboid as the queried cell that reduces the confidence interval. we can look up the information for those vehicles of the same vehicle category and driver category. In such a case. we can calculate the fastest route. Hence. We have to evaluate the confidence interval in such case where confidence interval is a measure that defines the reliability of the data. speed recorded previously on that particular street and that particular time may be considered instead of that missing value. 8 . if some cell was sparse at a query execution for a particular hour. better it is. dice. The movement data may be sparse. 4. E. Hence there are few efficient values to deal with it. Smaller the confidence interval.. Based on the graph. This may now give some values in that cell. Answer: from luu1 Using this warehouse.) we look up for the speed of a location at a specific time (at level of hour) and will use that as the weight for the street on the city graph.. discuss how a system may use the data in this warehouse to work out a fast route for the driver. Also. we may generalize it to a day. Confidence interval can be large in an example like when data size was large enough. In this remove the dimension by generalizing it.3. If the confidence interval is too large this indicates larger ambiguity in the data. then we can view speed as per a function of location or time.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd