You are on page 1of 2

DataMiningSpring2017HW2

1) In real-world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem.

2) gave the following data for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22,
25, 25, 25, 25, 25, 31, 34, 34, 35, 35, 35, 36, 39, 45, 46, 52, 70.
a) Use smoothing by bin means to smooth these data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
b) Determine outliers in the data?


3) Using the data for age given in Q2, answer the following:
a) Use min-max normalization to transform the value 35 for age onto the range
[0.0, 1.0].
b) Use z-score normalization to transform the value 35 for age.

4) Download the Ionosphere dataset from http://archive.ics.uci.edu/ml/datasets/Ionosphere ,


limiting yourself to analyzing to the following subset of the dataset:

§ If your student number ends with 0,1,2,3 or 4 you analyze the 32nd, 33rd and 34th
numerical attribute and the class variable of the dataset:
§ If your student number ends with 5,6,7,8 or 9 you analyze the 7th , 32nd and 33rd
numerical attribute of the dataset and the class variable of the dataset
Apply the following exploratory data analysis techniques to your dataset:
a. Compute the covariance matrix for the three numerical attributes you are analyzing;
also compute the correlation for each of the three pairs of attributes. Interpret the
statistical findings!
b. Create a scatter plot for the last two numerical attributes of your dataset. Interpret the
scatter plot!
c. Create histograms for each of the 3 attributes. Interpret the histograms plot!
d. Analyze the spread and distribution of the 33rd numerical attribute of the original
dataset.
e. Use one more display of your own liking to visualize the 33rd numerical attribute;
compare it with its histogram visualization your created in part 3.
5)

# Load the library.


library("MASS")

# Create a data frame from the main data set.


data <- data.frame(Cars93$AirBags, Cars93$Type)

# Create a table with the needed variables.


car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)

# Perform the Chi-Square test.


print(chisq.test(car.data))

For the R code shown above:


a) What is number of attributes of the complete Dataset
b) Explain the results
c) Find the correlation between DriveTrain and Type attributes

6)

Use a suitable Dataset from your own choice then apply a Data Reduction algorithm ( PCA for
example) to reduce the data dimensionality to a suitable size.
Explain your work in detail!

That’s all
Best wishes

You might also like