You are on page 1of 1

1. In real world data, tuples with missing values for some attributes are a common occurrence.

Methods for handing this problem are :


- Ignore the tuple(s): when the tuple has many missing values
- Fill in the missing manually: time consuming and not effective, not feasible with large data sets
- Use a global constant to fill in the missing value
- Use the attribute mean for all samples belonging to the same class as given tupel
- User the most probable value to fill in the missing value: using regression, inference-based tools or
decision tree reduction.
2.
Issues to consider during data aggregation: potential loss of interesting details
3.
Some methods of Dimensionality Reduction:
- Principal component analysis(PCA):
+ Based on condition that the data in a higher dimensional space need to map to data in a lower
dimension space. The variance of the data in the lower dimensional space should be maximum.
+ Steps: construct the covariance matrix of the data -> compute the eigenvector of theis matrix.
4.
A ‘good’ subset of the original attributes can be found through methods:
- Heuristic methods:
+ Stepwise forward selection: the procedure starts with an empty set of attributes as the reduced set.
Through many iteration, the best of the original attributes are determined and added to the reduced
set.

+ Stepwise backward elimination: the procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.

+ Combination of forward selection and backward elimination: at each step, the procedure selects the
best attributes and removes the worst from among the remaining attributes.

+ Decision tree induction: construct a flow chart like structure where each internal node denotes a
test on an attribute, each branch corresponds to an outcome of the test, and each external node
denotes a class prediction. At each node , the algorithm chooses the best attribute to partition the
data into individual classes. Based on the given data, a tree is construct that those attributes that do
not appear in the tree are assumed to be irrelevant
5.
Purpose of normalization: to scale the data of an attribute so that it falls in a smaller range
Some methods of data normalization:
- Decimal scaling: moving the decimal point of values of the data, we divide each value of the data by
the maximum absolute value of data using the formula: vi' = vi/ 10^j
where j is the smallest integer such that max(|vi'|) < 1.

- Min-max normalization: linear transformation is performed on the original data. Minimum and
maximum value from data is fetched and each value is replaced according to the formula.
v' = (v - min(A)) * (new_max(A) - new_min(A)) / (max(A) - min(A)) + new_min(A)
- Z-score normalization: values are normalized based on mean and standard deviation of the data A.
The formula: new entry = (old entry - standard deviation of A) / mean of A

You might also like