# Data Mining-Fall 2011 Question 1

Homework 1

Due: September 29

[10 points] Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so brieﬂy indicate your reasoning if you think there may be some ambiguity. Example: Age in years. Answer: Discrete, quantitative, ratio 1. Size of ten diﬀerent objects as measured by people’s judgments. 2. Distance between two astronomical objects as measured in light-year. 3. Movie ratings given on a scale of ten. 4. Percentage of ones in an m-by-n binary matrix (only 0/1 entries). 5. Seat numbers assigned to passengers on a ﬂight.

Question 2
[20 points] There is a group of n female students, and another group of n male students. Two ndimensional vectors A and B record the heights of the two groups of students respectively. Consider the variable transformation that is deﬁned by A= B= A − mean(A) std(A) B − mean(B) std(B) (1) (2)

where std stands for the standard deviation of a vector. 1. What will be the eﬀect of this transformation? 2. How will the correlation between A and B change after this transformation?

Question 3
[20 points] For the data set described below, give an example of the types of data mining questions that can be asked (one for each classiﬁcation, clustering, association rule mining, and anomaly detection task) and the description of the data matrix (what are the rows and columns). If necessary, brieﬂy explain the features that need to be constructed. Note that, depending on your data-mining question, the row and column deﬁnitions may be diﬀerent. Example data: a collection of Web pages.

1

class web page. Row: A web page. (9 points) (c) Given a graph G = (V. (a) A clinical dataset containing various measures like temperature. Euclidean. Cosine. Euclidean. where V is the set of nodes in G. Column: Vocabulary of words and features constructed from the hyperlinks of a web page. Column: Vocabulary of words that appear in a web page. DM Task: Anomaly detection Question: Is it a legitimate web page or a web spam? A web spam is a web page created to manipulate search engines and to deceive Web users. Euclidean. (a) x = (1 1 0 0 0). (5 points) (b) Derive the mathematical relationship between cosine similarity and Euclidean distance when each data object has an L2 length of 1. y = (1 0 1 0 0) Jaccard.DM Task: Classiﬁcation of web pages Question: What type of web page? Row: A web page. x and y. consider two graph-based similarity measures: (16 points) 1. along with the diagnosis information. or company’s web page. Row: A web page. are they identical? Explain. Column: Vocabulary of words that appear in a web page and a class attribute that indicates whether it is a personal home page.teaching and research appear together frequently in faculty web pages. Correlation Question 5 [30 points] (a) If two objects have a cosine measure of 1. y = (5 6 7 9 10 8) Cosine. E is the set of edges in G. DM Task: Clustering of web pages Question: What are the documents with similar topics? Row: A web page. calculate the indicated similarity or the distance measures. y = (0 0 0 1 1) Jaccard. E. blood pressure. blood glucose and heart rate for each patient during every visit. Cosine. Correlation (c) x = (0 1 2 4 5 3). Examples of constructed features include the fraction of hyperlinks toURLS that reside in the same network domain or in another network domain. Jaccard similarity: 2 . Correlation (b) x = (0 1 0 1 1). Question 4 [20 points] For each of the following vectors. DM Task: Association rule mining Question: What are the words that appear together frequently? For example. and W is the set of positive weights assigned to edges in G. Column: Vocabulary of words that appear in a web page. W ). For a pair of nodes u and v.

Sim2 : Sim2 (u. 2.Sim1 (u. in terms of the total weights of the edges in the path. v) = e−x (4) where x is the length of the shortest path between u and v. 3 . Mention one advantage and one disadvantage of both the measures when used to compute the similarity between u and v. v) = |Nu ∩ Nv | |Nu ∪ Nv | (3) where Nu and Nv are the set of neighbors of u and v respectively.