Exam Correction: Review Given Answers (Ground Truth Is in Bold, Student Answer Is Underlined)

EXAM CORRECTION
Review given answers
(ground truth is in bold, student answer is underlined)
 Question No. 1
State if the following statements about errors are true or false:
Errors related to bias can be reduced by using large No TRUE FALSE

sample sizes. Answer
The impact of random errors can be reduced by using No TRUE FALSE

large sample sizes. Answer
Systematic errors do not happen by pure randomness. No TRUE FALSE

Answer
The variance of a model measures how the expected No TRUE FALSE

output of a machine learning model di ers from the true Answer
value of the function approximated by the model.
The objective of supervised learning is to nd a model that No TRUE FALSE

reduces to zero the error between the model and the Answer
function approximated by the model. If a model A has a
lower error on the training examples with respect to
another model B, A must be preferred.
The expected error between the label y of a data sample x No TRUE FALSE
and the output of a supervised model f^(x) depends on Answer
bias or on variance, according to the weights of the model.
 Question No. 2
State if the following statements about nearest neighbours algorithms are true or false:
In the K-nearest-neighbours algorithm for classi cation, No TRUE FALSE
the output of a new input vector cannot be obtained as Answer
the majority of the outputs of the K-nearest vectors in
memory.
In the K-nearest-neighbours algorithm, a long training No TRUE FALSE

phase is executed with the examples so that the Answer
generalization error will decrease.
Consider how the predicted output of a vector is predicted No TRUE FALSE

in weighted k-nearest-neighbours: Answer
yi
k j
∑
j=1 d(xj ,x)+d0
y = .
k 1
∑
j=1 d(xj ,x)+d0
When d0 tends to in nity, the contribution of the

examples xj closest to x increases.
When applying the weighted k-nearest-neighbours No TRUE FALSE

algorithm for regression, the output of a new input vector Answer
is predicted depending on the output of the nearest k
vectors in memory, with a weighted average giving more
weight to the closest examples.
Nearest neighbours algorithms can only be applied to No TRUE FALSE

classi cation problems. Answer
 Question No. 3
Given the dataset in the table, state if the following statements are true or false:
Consider dimension X1 and 3.5 (median) as possible split No TRUE FALSE

value. The Gini Impurity of (Y |X1 < 3.5) is 25 .
8
Answer
The entropy of the output is 1. No TRUE FALSE

Answer
The Gini Impurity of the output is 14

. No TRUE FALSE
25
Answer

value. The Gini Impurity of (Y |X2 < 3.5) is 5 .
3
Answer

value. The Gini Impurity of (Y |X2 ≥ 3.5) is
3
. Answer
5

value. The Gini Impurity of (Y |X1 ≥ 3.5) is 12 . Answer
25
 Question No. 4
State if the following statements about classi cation metrics are true or false:
Precision is the proportion of true positives over the all No TRUE FALSE
the predictions predicted as positive by a classi er. Answer
Accuracy is de ned as the proportion of true positives and No TRUE FALSE
true negatives over all the predictions made by a classi er. Answer
Given a dataset of mushrooms, where edible mushrooms No TRUE FALSE

belong to the positive class and deadly mushrooms Answer
belong to the negative class, maximizing precision is more
important than maximizing recall.
Recall is the proportion of true negatives over all the No TRUE FALSE
elements that actually belong to the positive class. Answer
Given a dataset of n tuples in m dimensions, the No TRUE FALSE

confusion matrix is m × m. Answer
 Question No. 5
State if the following statements about linear functions and convexity are true or false:
The union of two convex sets is always convex. No TRUE FALSE

Answer
The function f (x) = 1 + x1 + x2 is linear. No TRUE FALSE

Answer
A function f (x), where x is a vector, is linear if for every No TRUE FALSE

scalar α and β and every couple of vectors x and y: Answer
f (αx + βy) = αf (x) + βf (y).
A function f (x), where x is a vector, is linear if No TRUE FALSE

f (x + y) = f (x) + f (y). Answer
The roots of the equation x1 + 3x2 = 0 form a convex No TRUE FALSE

set. Answer
The intersection of two convex sets is always convex. No TRUE FALSE

Answer
The points satisfying the inequality 6x1 + 9x2 > 0 do not No TRUE FALSE
form a convex set. Answer
 Question No. 6
State if the following statements about training, validating and testing are true or false:
"Statistically signi cant result" means that one can No TRUE FALSE
estimate probabilities of obtaining speci c results, and Answer
that the probability to obtain the result by chance is less
than a threshold value. Therefore the conclusion
"statistically signi cant or not" depends on the threshold
value.
For every k > 0, in leave-one-out cross-validation, one of No TRUE FALSE

the k partitions is left out as validation data and the other Answer
partitions are used as training data.
The number of data examples used for comparing No TRUE FALSE

di erent ML models greatly a ects the level of statistical Answer
signi cance of the results.
The measurements of two phenomena are di erent in a No TRUE FALSE

statistically signi cant way if one can demonstrate in a Answer
theorem that the two measurements will never be equal.
A result is statistically signi cant when it is obtained by No TRUE FALSE

democratic means, asking for the opinion of the largest Answer
possible number of experts.
In strati ed cross-validation, classes are balanced in both No TRUE FALSE

training and validation sets. Answer
In cross-validation, one repeats many train-and-test No TRUE FALSE

experiments, by splitting the original set of examples into Answer
two sets of di erent partitions (one for training and one
for testing), and then maximizing the obtained result.
Strati ed cross-validation is useful to create learning No TRUE FALSE

models with more hidden layers. Answer
 Question No. 7
Given a dataset composed by n tuples (xi , yi ) with 1 ≤ i ≤ n , each output yi is
predicted using some regression model f^. State if the following metrics are suitable to
assess the performance of f^:
1
∑
n
^
|yi − f (xi )| No TRUE FALSE
n i=1
Answer
1
∑
n
^
(yi − f (xi )) No TRUE FALSE
n i=1
Answer
1
∑
n
^
(xi − f (xi ))
2
No TRUE FALSE
n i=1
Answer
1
∑
n
^
(yi − f (xi ))
2
No TRUE FALSE
n i=1
Answer
√
1 n
^
No TRUE FALSE
∑ (yi − f (xi ))
n i=1
Answer
 Question No. 8
Given the points in the following plot, predict the class of triangle points using 3-nearest
neighbours. Consider all blue points as members of the positive class and all red points
as members of the negative class. State if the following statements are true or false:
Precision is
1
. No TRUE FALSE
3
Answer
Recall is
1
. No TRUE FALSE
3
Answer
Accuracy is
2
. No TRUE FALSE
3
Answer
 Question No. 9
State if the following statements about Locally-Weighted Regression are true or false:
Given a dataset of n examples, the weights w∗ found by No TRUE FALSE

least squares methods do not de ne a model which is Answer
valid for all possible inputs.
Given a dataset of n tuples (xi , yi ) with 1 ≤ i ≤ n, the No TRUE FALSE

signi cance si de nes the importance that each xi has on Answer
the prediction of the output of some point x.
Given a dataset of n tuples (xi , yi ) with 1 ≤ i ≤ n, least No TRUE FALSE

squares methods are used to minimize the error Answer
⋅ xi − si ) , where si is the signi cance of xi .
n T 2
∑ yi (w
i
1
Given a dataset of n tuples (xi , yi ) with 1 ≤ i ≤ n, let x No TRUE FALSE

be a point whose output needs to be predicted using Answer
Locally-Weighted Regression. The signi cance si can be
2
||xi −x||
de ned as si = exp(−
WK
) , where WK is some
constant which controls the sensitivity of the model to
points distant from x.
 Question No. 10
State if the following statements about bottom-up clustering are true or false:
Consider a dataset of n tuples in 2-dimensions (x1 , x2 ), No TRUE FALSE

and let S be the covariance matrix of such dataset. Answer
Transform the dataset by multiplying x1 by a scalar α and
obtain a dataset in 2-dimensions (αx1 , x2 ), de ning S1 as
the covariance matrix of the transformed dataset. Since
the covariance matrix only measures how the coordinates
of a set of points vary together, S = S1 .
Given a labeled dataset in m > 2 dimensions, consider No TRUE FALSE
the rst principal component x1 and the second principal Answer
component x2 found by PCA. If the label is removed from
the dataset and PCA is applied again, x1 and x2 might be
di erent from the two principal components found before
removing the label.
A linear transformation L is a linear projection of vectors No TRUE FALSE

from a space of dimension m to a space of dimension p, Answer
where p must be lower than m.
The main diagonal of the covariance matrix corresponds No TRUE FALSE

to the variance along each input dimension. Answer
Given a dataset of centered data, the orthogonal No TRUE FALSE

projection found by PCA from a space of dimension m to Answer
a space of dimension p corresponds to the p eigenvectors
with largest eigenvalues of the covariance matrix.
Orthogonal projections from a space of dimension m to a No TRUE FALSE

space of dimension p are a particular type of linear Answer
transformation, which guarantees that projected data
maximizes the sum of all squared pairwise distances in
the m-dimensional space.
The vectors x = (1, 1) and y = (2, 2) are collinear. No TRUE FALSE

Answer
The vectors x = (1, 1) and y = (1, 0) are orthogonal. No TRUE FALSE

Answer
Given a labeled dataset in m > 2 dimensions, consider No TRUE FALSE

the rst principal component x1 and the second principal Answer
component x2 found by PCA. Let us de ne the mutual
information of x1 and x2 with respect to the label y as
M I (x1 , y) and M I (x2 , y). Since x1 explains the larger
amount of variance among input dimensions,

M I (x1 ) > M I (x2 ).
Data should always be normalized or standardized before No TRUE FALSE

applying PCA, in order to remove the e ect of outliers. Answer
Consider the points of the two classes in the following No TRUE FALSE
plot, where each class is presented with a di erent color. Answer
If using rst PCA to project the points from 2 dimensions
to 1 dimension, it is possible to correctly classify all points.
Consider a dataset of n vectors in m dimensions. The No TRUE FALSE

covariance matrix is an n × m matrix. Answer
The covariance matrix measures how the coordinates of a No TRUE FALSE

set of points vary together, so it can only be used to detect Answer
variations along one of the dimensions of data.
In the following plot, the covariance between the input No TRUE FALSE
dimensions is positive. Also, if using PCA to project the Answer
cloud of points from 2 dimensions to 1 dimension and
then to 2 dimensions again, there is no information loss.
PCA is sensitive to outliers, because points which are far No TRUE FALSE
away from most of the other points contribute greatly to Answer
the sum of squared distances in the projected space.
In PCA, input variables are transformed into a possibly No TRUE FALSE
lower number of uncorrelated input variables called Answer
principal components. In ascending order, principal
components take in account as much of the remaining
variability of data as possible.
 Question No. 11
State if the following statements about top-down clustering are true or false:
The Euclidean distance is an appropriate similarity metric No TRUE FALSE

only when the range of input dimensions is similar and the Answer
same unit of measure is used.
In k-means soft clustering, the batch update should be No TRUE FALSE

preferred with respect to the online update when the Answer
number of entities is very large.
The cosine similarity is not in uenced by the units of No TRUE FALSE

measure used to express the input dimensions of data, Answer
because it depends only on the direction of the vectors.
If an internal representation of the entities to be clustered No TRUE FALSE

is available, dissimilarities between entities can be Answer
computed and each cluster is summarized by a prototype.
If only an external representation of the entities to be No TRUE FALSE

clustered is available, there is no information about the Answer
dissimilarities between entities.
The Mahalanobis distance is an appropriate similarity No TRUE FALSE

metric only when the range of input dimensions is similar Answer
and the same unit of measure is used.
Let x and y be two vectors of length n. The normalization No TRUE FALSE

of the distance between x and y is de ned as Answer
xi −yi
, where minvali and maxvali
n
2
√∑ ( )
i=1 minvali −maxvali
are the minimum and maximum values in input

dimension i.
In k-means soft clustering, each entity is assigned No TRUE FALSE
probabilistically to all clusters. Answer
K-means hard clustering is an iterative algorithm where, at No TRUE FALSE

each step, each entity is assigned to its nearest prototype. Answer
Then, the coordinates of each prototype are updated by
averaging the coordinates of the entities assigned to it,
and the process is repeated until some stopping criteria is
met.
In bottom-up clustering, at each step the most similar sets No TRUE FALSE
of points are merged together until a stopping criteria is Answer
met and the merging stops.
In top-down clustering, at each step the most similar sets No TRUE FALSE
of points are merged together until a stopping criteria is Answer
met and the merging stops.
Clustering represents groups of similar points using No TRUE FALSE

prototypes, which are single points that summarize the Answer
information of observed data.
The shape of Voronoi cells is independent with respect to No TRUE FALSE

the type of distance similarity used by a clustering Answer
algorithm.
Clustering is a multi-objective optimization problem. No TRUE FALSE

Answer
Let x and y be two vectors of length n. ||x − y|| de nes No TRUE FALSE
the Manhattan distance between x and y. Answer
Clustering is a supervised learning technique. No TRUE FALSE

Answer
In k-means soft clustering with online update, the update No TRUE FALSE
of prototypes is done iteratively according to the Answer
relationship between single entities and prototypes. Let us
de ne the update according to an entity x, for the
prototype pc of a cluster c as
Δpc = η ⋅ membership(x, c) ⋅ (x − pc )
where η is a constant and membership(x, c) de nes as a

probability the membership of x to c. According to this
equation, pc is pulled by x in the direction of (x − pc )
according to η and membership(x, c).
The quantization error measures the error obtained by No TRUE FALSE

substituting the entities of each cluster with the respective Answer
prototype. Consequently, a lower quantization error
implies a better clustering.
In k-means hard-clustering, the objective is to partition a No TRUE FALSE

set of entities into k disjoint subsets. Such objective is Answer
reached by minimizing the dissimilarities between the
points and the prototype of each cluster, and maximizing
the distances between the borders of di erent clusters.
 Question No. 12
Consider a dataset with n points in d dimensions. As a No TRUE FALSE

result, in least squares methods, we have to nd the Answer
solution of a set of n equations in d variables.
If the number of di erent points is larger than the number No TRUE FALSE
of dimensions, the solution of the linear system can be Answer
found by computing the inverse of the matrix.
In gradient descent, if the step is less than 0.000000001 No TRUE FALSE

the function value will always decrease. Answer
If the number of points is huge, using gradient descent is No TRUE FALSE

better than using the pseudo-inverse. Answer
 Question No. 13
State if the following statements about the bias-variance dilemma are true or false:
Models with too few parameters tend to produce a large No TRUE FALSE
variation of results if runs of training and testing are Answer
repeated (with some randomization).
Models with too many parameters tend to fail because of No TRUE FALSE
a large bias, since they de ne models which might easily Answer
over t data.
Consider the solution to the regression problem proposed No TRUE FALSE

in the following plot. In this case, the model has large bias. Answer
Having few parameters contribute to the creation of No TRUE FALSE

exible models because they can be easily interpreted by Answer
a human person.
Identifying the best model requires a compromise No TRUE FALSE

between bias and variance. Answer
 Question No. 14
State if the following statements about supervised learning are true or false:
A suitable error measure on the training examples for No TRUE FALSE
regression models is the sum of squared errors between Answer
the known output yi and the output f^(xi ) obtained by
the models: ∑i (yi ^
− f (xi ))
2
.
The internal parameters (weights) of models created using No TRUE FALSE

supervised learning techniques de ne the exibility of the Answer
models. If a model over ts, a possible explanation is that
the number of parameters of the model is too large given
the observed data and so the model cannot properly
generalize on unseen data.
→
Available data is composed by vectors of features x No TRUE FALSE
associated to an output y, and such data is used to build a Answer
function which models the relationship between x → and y.
PCA is a supervised learning technique. No TRUE FALSE

Answer
Models for solving classi cation and regression problems No TRUE FALSE
are built by optimizing an objective function, which is Answer
assumed to be su ciently smooth so that the
generalization of a model built on training examples is
possible.
In linear regression, if the number of parameters of the No TRUE FALSE

model is too large, learning the examples with zero errors Answer
becomes trivial, but the model will have di culties when
generalizing results on unseen data.
 Question No. 15
State if the following statements about neurons and maximum likelihood are true or
false:
If w are weights of logistic regression, P r() is the output and yi the No TRUE FALSE
correct classi cation (0 or 1), the probability of obtaining the given Answer
output on the examples assuming independency is:
ℓ yi (1−yi )
Likelihood(w) = ∑ Pr(yi |xi , w) + (1 − Pr(yi |xi , w))
i=1
.
Consider a single perceptron de ned by the model No TRUE FALSE
^
f (w, x) = w
T
⋅ x, where w is a vector of weights and x is a vector Answer
of input data. Such model can be used to linearly separate the two
classes in the following plot:
If w are weights of logistic regression, P r() is the output and yi the No TRUE FALSE
correct classi cation (0 or 1), the probability of obtaining the given Answer
output on the examples assuming independency is:
.
ℓ yi (1−yi )
Likelihood(w) = ∏ Pr(yi |xi , w) (1 − Pr(yi |xi , w))
i=1
An autoencoder is a multi-layer perceptron which aims at No TRUE FALSE

reproducing the input of the perceptron, using an intermediate Answer
encoding with less variables than the input.
The following is a plausible squashing function to be applied to the No TRUE FALSE

result z of a scalar product to ensure a biologically-plausible Answer
output: squash(z) .
1
= −z
1+e
Given a dataset, after a training phase an autoencoder learns a No TRUE FALSE

compressed representation of input data (without considering the Answer
labels). Once trained, the hidden layer of the autoencoder can be
connected to an additional layer of perceptrons, in order to classify
data according to the representation previously learned by the
autoencoder.
The following function of two input variables x1 and x2 : No TRUE FALSE

f (x1 , x2 ) = BinaryXOR(x1 , x2 ) can be realized using a multi- Answer
layer perceptron with a single layer.
Consider a multilayer perceptron with bias units and 2 layers, No TRUE FALSE
where the rst layer contains 3 perceptrons and the second layer Answer
contains 1 perceptron. With such a structure, the multilayer
perceptron has 2 bias units.
Consider a single perceptron de ned by the model No TRUE FALSE

^
f (w, x) = w
T
, where w is a vector of weights and x is a vector
⋅ x Answer
of input data. Such model can be used to linearly separate the two
classes in the following plot:
Maximum likelihood estimation of parameters means that one No TRUE FALSE

maximizes the probability that the parameters of a model assume Answer
certain values, according to the evidence obtained by observed
data.
In autoencoders, the perceptrons in the middle layer of the neural No TRUE FALSE
network extract, from the original dataset, the input dimensions Answer
which are used to build a model that minimizes the di erence
between input data and rebuilt data.
The following functions of two input variables x1 and x2 : No TRUE FALSE

f (x1 , x2 ) = BinaryOR(x1 , x2 ) and Answer
f (x1 , x2 ) = BinaryAN D(x1 , x2 ) can be respectively realized
with an appropriate choice of the vector w by the model

m(w, x) = w x.
T
The weights of a perceptron can be updated iteratively using No TRUE FALSE
gradient descent. Assuming a xed learning rate ϵ in both cases, Answer
consider the following scenarios. The model on the right is going to
linearly separate the two classes before the model on the left.
 Question No. 16
State if the following statements about goodness functions are true or false.
Goodness functions take measurements or decision No TRUE FALSE

variables as input and quantify an objective which needs Answer
to be optimized.
Inputs can be only numerical, qualitative measurements No TRUE FALSE

need to be transformed into numbers. Answer
Machine learning techniques can build e ective models No TRUE FALSE

only if abundant data is available for training and for Answer
testing.
Standard mathematical optimization requires the No TRUE FALSE

existence of goodness functions to be optimized, and the Answer
de nition of these functions usually requires an expensive
e ort in the real world. In many cases, this can be the
most relevant e ort in a "data science" project.
 Question No. 17
Given the dataset in the table, state if the following statements are true or false:
The mutual information between x3 and y is 1. No TRUE FALSE

Answer
The mutual information between x2 and y is 0.5. No TRUE FALSE

Answer
In order to predict y, x3 is more informative than x1 . No TRUE FALSE

Answer
The mutual information between x1 and y is

3
. No TRUE FALSE
2
Answer
 Question No. 18
Consider a dataset composed by n tuples (xi , yi ) with 1 ≤ i ≤ n, where each input

vector x is de ned as (x1 , x2 , x3 , x4 ). The goal is to predict yi using a linear regression
model f^(xi ). Adopting least squares methods, one is looking for the weight vector w∗
which approximates the given tuples as closely as possible. State if the following
regression models can be solved using the pseudo-inverse:
^ 2 3
f (x) = w1 x1 + w2 x + w3 x + w4 x
4
No TRUE FALSE
2 3 4
Answer
^
f (x) = 9 No TRUE FALSE
Answer
^
f (x) = w1 x1 x2 + w2 x2 x3 + w3 x3 x4 + w4 x
2
No TRUE FALSE
4
Answer
^
f (x) =
1
No TRUE FALSE
w1 x1 +w2 x2 +w3 x3 +w4 x4
Answer
^
f (x) = w1 x1 + w2 x2 + w3 x3 + w4 x4 No TRUE FALSE
Answer
^
f (x) = 9x3 No TRUE FALSE
Answer
^ 2 3
f (x) = x1 w1 + x2 w + x3 w + x4 w
4
No TRUE FALSE
2 3 4
Answer
^
f (x) = w
x1
+ w
x2
+ w
x3
+ w
x4
No TRUE FALSE
1 2 3 4
Answer
 Question No. 19
State if the following statements about regression are true or false:
The objective of regression is to model the dependency of No TRUE FALSE

a real value on a set of input values. Answer
The objective of linear regression is to always guarantee No TRUE FALSE

zero error on the training examples. Answer
If the number of input variables is 33, and one starts No TRUE FALSE
training from 99 examples, the parameters of the linear Answer
model obtaining zero error on the examples can always be
determined.
training from 33 di erent examples, the parameters of the Answer
linear model obtaining zero error on the examples can
always be determined.
 Question No. 20
The minimum distance between a pair of clusters C, D is No TRUE FALSE

the distance between the nearest pair of points which Answer
respectively belong to C and D.
^
f (x) = 9x3 No TRUE FALSE
Answer
^ 2 3
f (x) = x1 w1 + x2 w + x3 w + x4 w
4
No TRUE FALSE
2 3 4
Answer
^
f (x) = w
x1
+ w
x2
+ w
x3
+ w
x4
No TRUE FALSE
1 2 3 4
Answer
 Question No. 19
State if the following statements about regression are true or false:
The objective of regression is to model the dependency of No TRUE FALSE

a real value on a set of input values. Answer
The objective of linear regression is to always guarantee No TRUE FALSE

zero error on the training examples. Answer
training from 99 examples, the parameters of the linear Answer
model obtaining zero error on the examples can always be
determined.
training from 33 di erent examples, the parameters of the Answer
linear model obtaining zero error on the examples can
always be determined.
 Question No. 20
The minimum distance between a pair of clusters C, D is No TRUE FALSE

the distance between the nearest pair of points which Answer
respectively belong to C and D.
A dendrogram can be used to identify into how many No TRUE FALSE
clusters the entities of a dataset should be grouped, by Answer
looking at horizontal distances between merging points.
The covariance matrix can be used to estimate the shape No TRUE FALSE
of a spheric cluster. Answer
A dendrogram is a plot that can be used to visualize the No TRUE FALSE

order in which entities of a dataset have been merged. Answer
In agglomerative clustering, at each step the most similar No TRUE FALSE

clusters are merged. The similarity between clusters can Answer
be measured in di erent ways, so at each step the most
similar clusters are the ones which minimize minimum,
maximum or average distance between pairs of clusters.
The average distance between a pair of clusters C, D is No TRUE FALSE

de ned as Answer
∑ δ(x, y)
x∈C,y∈D
¯
δ avg (C, D) = ,
C ⋅ D
where δ(x, y) is a metric used to measure the distance

between entities.
The Mahalanobis distance δ between two points x, y of No TRUE FALSE

the same distribution, with covariance matrix S , can be Answer
de ned as
T −1
δ(x, y) = √(x − y) ⋅ S ⋅ (x − y).
 Question No. 22
State if the following statements about feature selection are true or false:
Filter methods are not able to identify mutual relationships No TRUE FALSE
between di erent inputs. Answer
Even if the Pearson correlation coe cient between two data No TRUE FALSE
features is zero, it is not possible to state that such features are Answer
independent.
Filter methods consider di erent subsets of features of data to No TRUE FALSE

build a model, and they run training and testing for possible Answer
subsets of features in order to decide the best subset to be used.
If the χ2 value is computed to test the statistical independence No TRUE FALSE

between each input feature and the output label, the resulting χ2 Answer
values can be used to rank input features according to how
probably each input feature and the output label are dependent.
Mutual information can be used to identify sets of features of No TRUE FALSE

input data which, when considered individually, do not provide Answer
any useful information in order to predict the output.
The linear correlation coe cient between xi and yi may change if No TRUE FALSE
x values are normalized by dividing them by their standard Answer
deviation σ > 0 in the following manner: xi /σ.
The entropy of a uniform probability distribution of n events is No TRUE FALSE

log2 n. Answer
If the Pearson correlation coe cient between two data features is No TRUE FALSE
zero, the Mutual Information between such features is also zero. Answer
The Pearson correlation coe cient measures the linear No TRUE FALSE
relationship between numeric data features. It is de ned as the Answer
covariance of two input features divided by the product of their
standard deviations; as a consequence, if the covariance is
negative then also the correlation coe cient is going to be
negative.
A possible normalization of the data consists of rescaling all input No TRUE FALSE
dimensions that they range is in [0, 1]. Answer
 Question No. 22
State if the following statements about feature selection are true or false:
Filter methods are not able to identify mutual relationships No TRUE FALSE
between di erent inputs. Answer
Even if the Pearson correlation coe cient between two data No TRUE FALSE
features is zero, it is not possible to state that such features are Answer
independent.
Filter methods consider di erent subsets of features of data to No TRUE FALSE

build a model, and they run training and testing for possible Answer
subsets of features in order to decide the best subset to be used.
If the χ2 value is computed to test the statistical independence No TRUE FALSE

between each input feature and the output label, the resulting χ2 Answer
values can be used to rank input features according to how
probably each input feature and the output label are dependent.
Mutual information can be used to identify sets of features of No TRUE FALSE

input data which, when considered individually, do not provide Answer
any useful information in order to predict the output.
The linear correlation coe cient between xi and yi may change if No TRUE FALSE
x values are normalized by dividing them by their standard Answer
deviation σ > 0 in the following manner: xi /σ.
The entropy of a uniform probability distribution of n events is No TRUE FALSE

log2 n. Answer
If the Pearson correlation coe cient between two data features is No TRUE FALSE
zero, the Mutual Information between such features is also zero. Answer
The Pearson correlation coe cient measures the linear No TRUE FALSE
relationship between numeric data features. It is de ned as the Answer
covariance of two input features divided by the product of their
standard deviations; as a consequence, if the covariance is
negative then also the correlation coe cient is going to be
negative.
A possible normalization of the data consists of rescaling all input No TRUE FALSE
dimensions that they range is in [0, 1]. Answer

Exam Correction: Review Given Answers (Ground Truth Is in Bold, Student Answer Is Underlined)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exam Correction: Review Given Answers (Ground Truth Is in Bold, Student Answer Is Underlined)

Uploaded by

Copyright:

Available Formats

EXAM CORRECTION

Review given answers

(ground truth is in bold, student answer is underlined)

State if the following statements about errors are true or false:

Errors related to bias can be reduced by using large No TRUE FALSE

The impact of random errors can be reduced by using No TRUE FALSE

Systematic errors do not happen by pure randomness. No TRUE FALSE

The variance of a model measures how the expected No TRUE FALSE

The objective of supervised learning is to nd a model that No TRUE FALSE

In the K-nearest-neighbours algorithm, a long training No TRUE FALSE

Consider how the predicted output of a vector is predicted No TRUE FALSE

When d0 tends to in nity, the contribution of the

When applying the weighted k-nearest-neighbours No TRUE FALSE

Nearest neighbours algorithms can only be applied to No TRUE FALSE

Consider dimension X1 and 3.5 (median) as possible split No TRUE FALSE

The entropy of the output is 1. No TRUE FALSE

The Gini Impurity of the output is 14

Consider dimension X2 and 3.5 (median) as possible split No TRUE FALSE

Consider dimension X2 and 3.5 (median) as possible split No TRUE FALSE

Consider dimension X1 and 3.5 (median) as possible split No TRUE FALSE

Given a dataset of mushrooms, where edible mushrooms No TRUE FALSE

Given a dataset of n tuples in m dimensions, the No TRUE FALSE

The union of two convex sets is always convex. No TRUE FALSE

The function f (x) = 1 + x1 + x2 is linear. No TRUE FALSE

A function f (x), where x is a vector, is linear if for every No TRUE FALSE

A function f (x), where x is a vector, is linear if No TRUE FALSE

The roots of the equation x1 + 3x2 = 0 form a convex No TRUE FALSE

The intersection of two convex sets is always convex. No TRUE FALSE

For every k > 0, in leave-one-out cross-validation, one of No TRUE FALSE

The number of data examples used for comparing No TRUE FALSE

The measurements of two phenomena are di erent in a No TRUE FALSE

A result is statistically signi cant when it is obtained by No TRUE FALSE

In strati ed cross-validation, classes are balanced in both No TRUE FALSE

In cross-validation, one repeats many train-and-test No TRUE FALSE

Strati ed cross-validation is useful to create learning No TRUE FALSE

Given a dataset of n examples, the weights w∗ found by No TRUE FALSE

Given a dataset of n tuples (xi , yi ) with 1 ≤ i ≤ n, the No TRUE FALSE

Given a dataset of n tuples (xi , yi ) with 1 ≤ i ≤ n, least No TRUE FALSE

Given a dataset of n tuples (xi , yi ) with 1 ≤ i ≤ n, let x No TRUE FALSE

Consider a dataset of n tuples in 2-dimensions (x1 , x2 ), No TRUE FALSE

A linear transformation L is a linear projection of vectors No TRUE FALSE

The main diagonal of the covariance matrix corresponds No TRUE FALSE

Given a dataset of centered data, the orthogonal No TRUE FALSE

Orthogonal projections from a space of dimension m to a No TRUE FALSE

The vectors x = (1, 1) and y = (2, 2) are collinear. No TRUE FALSE

The vectors x = (1, 1) and y = (1, 0) are orthogonal. No TRUE FALSE

Given a labeled dataset in m > 2 dimensions, consider No TRUE FALSE

amount of variance among input dimensions,

Data should always be normalized or standardized before No TRUE FALSE

Consider a dataset of n vectors in m dimensions. The No TRUE FALSE

The covariance matrix measures how the coordinates of a No TRUE FALSE

The Euclidean distance is an appropriate similarity metric No TRUE FALSE

In k-means soft clustering, the batch update should be No TRUE FALSE

The cosine similarity is not in uenced by the units of No TRUE FALSE

If an internal representation of the entities to be clustered No TRUE FALSE

If only an external representation of the entities to be No TRUE FALSE

The Mahalanobis distance is an appropriate similarity No TRUE FALSE

Let x and y be two vectors of length n. The normalization No TRUE FALSE

are the minimum and maximum values in input

K-means hard clustering is an iterative algorithm where, at No TRUE FALSE

Clustering represents groups of similar points using No TRUE FALSE

The shape of Voronoi cells is independent with respect to No TRUE FALSE