You are on page 1of 3

CS 4780

HW 1
Arjun Sarathy (as2784), DB Lee (dl654), Kevin Gao (kg349), Zack Shen (zs225)

Problem 1: Train/Test splits


1. Our data D is the set of audio-label tuples { (x, y) }. x is a 2-dimensional vector such that
xij is the relative amplitude of the jth overtone at the ith time division of the phoneme
audio. This vector is obtained through a Fourier Transform. y is an integer in the range
[1,44] that matches the appropriate phoneme of the audio.

Our loss function is the 0/1 loss function, which counts the number of mislabeling.

We assume the following. First, each speaker pronounces phonemes the same way
regardless of the day, especially because the time interval between samples is very
short (24 hours). Second, the phonemes that a speaker records are uniformly
distributed. Third, the phoneme recordings of the speakers are mutually independent.
So the data is i.i.d.

Divide D into 44 sets according to the phoneme labels. Because the data is i.i.d, split
each phoneme set uniformly at random with the ratio of 81:9:10 (10% of the data is for
testing, and the 10% of the remaining 90% will be for validation). Assign each split into
Training, Validation, and Testing sets, respectively. This way, the data sets are roughly
equally distributed and contain information about all 44 phonemes.

It may be necessary to add in a stratification based on race, gender, or ethnicity in order


to account for differences in accent and timbre.

2. We will employ the same selection method as above: for each of Kilian’s phonemes,
split the recordings uniformly at random with the same ratio and include them into D. If
we want the system to work exclusively on him, we might want to modify the loss
function such that it assigns more weight to mislabeling Kilian’s phonemes.

Problem 2 continued….
Problem 2: K-NN
1. We first found the perpendicular bisectors between each pair of Positive and Negative
points and then took the union of the boundaries. Picture below shows the 1-NN
boundary diagram for the window [0,5]2. If a test point is in the shaded area, its nearest
neighbor will be negative, and vice versa.

2. Initially, (0,100) is nearest neighbor of (4,150), so the test point is labeled as positive.

After scaling with respect to the [0,4] X [0,200] window, (2,75) becomes (0.5, 0.375) and
is the nearest neighbor to the scaled test point, which is (1, 0.75). The test point is
labeled as negative.

3. We assume analytical smoothness of the true regression function. The two nearest
neighbors of (2,2) among our data set are (2,1) and (2,3), both of which gives the
Euclidean distance of 1. Because we assume smoothness and a real-numbered label, we
estimate the test point’s label by taking the average of the labels with weights
corresponding to distances. The label, then, is (21 + 31)/(1+1) = 2.5.

4. If a large majority, maybe two-thirds, of the data points is missing the same feature, we
remove that feature from consideration. If a data point is missing many features, maybe
a third of the features, we remove that data point from our data. Otherwise, if a vector
x is missing a feature fi, scale all fi from data points to [0,1] and estimate the missing
feature in x to be 0.5, the average value.
5. We define “training” as the process of constructing the bare minimum setup needed to
make predictions, and “applying” as the process of making predictions. Let n = number
of data points and d = dimension of each point.

“Training”, as defined, only requires storing the data points in its appropriate space. This
is because, given that we do not optimize with respect to k, we can apply the k-NN
classifier as long as there are data points to compute distances from the test point. The
“training” operation takes O(n) time.

“Applying” a k-NN classifier means to (1) find dist(xtest, xdata) for each of the n data points
xdata, (2) sort the distances to find the k smallest distances, and (3) take the mode or
average of the k corresponding labels. We assume that the distance function takes O(d)
time. Then, the three operations each take O(nd), O(n log n), and O(k) time. For large n,
the overall time is O(nd).

This implies that “applying” a k-NN classifier takes more time than “training” it.

6. To quote Prof. Weinberger himself (from CS4780 SP17 lecture #8), we can reduce the
dimension of the data because faces have a much lower intrinsic dimensionality than
the number of pixels. Images of faces are not uniformly distributed across the space of
pixels. If the data lies in some subspace or a manifold within our data space, it is
possible to come up with a coordinate system with a lower dimensionality without
losing much information. Then, we can run the k-NN procedure with the coordinate
vectors instead of the actual data.

You might also like