Professional Documents
Culture Documents
5. When dealing with a missing value in a dataset where the missing data
cannot be ignored, there are several imputation methods that can be
used to fill in the gap. Here are three alternatives for imputing the
missing value in the HOURS_PER_WEEK column:
We can see that the frequency keeps on increasing as the age increases with
highest frequency of 10 in bin 45-55.
Now we discretize the AGE attribute into 4 equi-depth intervals:
Since there are 20 data points and we want 4 bins, each bin would ideally
contain 20/4 = 5 points. Assign the data points to each bin until the bin
reaches 5 data points. The highest value in one bin and the lowest value in the
next bin are the boundaries. Here are the bins:
Bin 1: [27, 28, 29, 30, 35] (ages 27 to 35, inclusive)
Bin 2: [36, 37, 38, 40, 44] (ages 36 to 44, inclusive)
Bin 3: [45, 47, 48, 49, 49] (ages 45 to 49, inclusive)
Bin 4: [49, 50, 52, 52, 54] (ages 49 to 54, inclusive)
Notice that the age 49 appears in both Bin 3 and Bin 4 due to its frequency in
the data set. In such cases, we may need to make a choice about where to
split the bin. In practice, one might decide to put an equal number of the
repeated value in each bin, or all in one bin, depending on the distribution and
desired analysis.
For the attribute AGE, we will create intervals based on the values of k as
follows:
For k = 0: We will have the interval [mean - sd, mean), which is [42 - 8, 42) or
[34, 42).
For k = -1: We will have the interval [mean, mean + sd), which is [42, 42 + 8) or
[42, 50).
For k = 1: We will have the interval [mean - 2*sd, mean - sd), which is [42 - 16,
42 - 8) or [26, 34).