You are on page 1of 11

Indian Institute of Technology Madras - Department of Management Studies

MS4610 - Introduction to Data Analytics


Final Exam
Date: November 24, 2021, Duration: 1 hour, Max marks: 75

Instructions:
• Enter your answers in the excel sheet, and rename it with your roll number before submission.
• For all MCQ type questions, only the option is required.
• For subjective questions (non-MCQ type), please provide appropriate steps/reasoning.
• You may refer to the formulae in the separate formulae sheet provided.
• The answer sheets have to be uploaded in the Google form link that is provided.

1. (1 mark) There is a 68% chance of making a basket on a free throw and each throw is independent
of each other throw. What is expected number of throws to make first basket?
a. 2.15
b. 0.68
c. 1.47
d. 0.32
2. (1 mark) Which of the following would affect the results of Linear Regression
a. A data entry error which inflates one of the y-values
b. Constant variance of error terms
c. Independence of error terms
d. Absence of multicollinearity
3. (1 mark) A test to screen for a serious but curable disease is similar to hypothesis testing, with
a null hypothesis of no disease, and an alternative hypothesis of disease. If the null hypothesis is
rejected treatment will be given. Otherwise, it will not. Assuming the treatment has serious side
effects, in this scenario it is better to:
a. Increase the probability of making a Type 1 error, providing treatment when it is not needed
b. Increase the probability of making Type 1 error, not providing treatment when it is needed
c. Decrease the probability of making a Type 2 error, providing treatment when it is not needed
d. Decrease the probability of making a Type 2 error, not providing treatment when it is needed
4. (2 marks) A CEO of a certain production company wishes to understand the box office success
as a function of the budget (in crores) incurred in the movie. A logistic regression model is then
trained on the data and the following results are obtained:

Estimate
(Intercept) -0.2851
Budget (in cr) 0.00046

The budget (rounded to 2 decimals) for which the probability of box office success is 0.6 is:

1
a. 1965.52
b. 2126.62
c. 1501.23
d. 619.78
5. (2 marks) Identify the probability distribution that is applicable in the following scenario: An old
machine is known to produce 10% defective parts. We are interested in the number of defective
parts in a run of 20 parts.
a. Uniform
b. Binomial
c. Poisson
d. Exponential
e. Geometric
f. Bernoulli
g. Negative binomial
h. Normal
6. (2 marks) Let X be a random variable that takes values -2,-1,0,1,2, and Y be a random variable
that takes values 0,1,4. The following table gives the joint probability distribution of X and Y.

Y \X -2 -1 0 1 2
0 0 0 1/5 0 0
1 0 1/5 0 1/5 0
4 1/5 0 0 0 1/5

Find the conditional probability P r(X = −1|Y = 1).


a. 1/5
b. 2/5
c. 1/2
d. 1
7. (3 marks) Let x1 , x2 , ..., x50 be independent observations from a distribution X which is not normal.
Suppose it is known that the mean of this distribution is 48 and the standard deviation is 5. What
can we say about the sample mean x-bar?
a. x-bar = 48
b. x-bar is distributed approximately normal with mean 48 and standard error 5

c. x-bar is distributed approximately normal with mean 48 and standard error 5/ 50
d. x-bar cannot be approximated with the normal distribution since X is not normal
8. (2 marks) A marketing decision is to be made on whether a new product can be introduced next
month or not. If the sales is high the new product would yield a payoff of Rs.5000, if the sales is
medium the payoff is Rs.3000 and if the sales is low the payoff is Rs.1000. If the probability of
the actual states of being high, medium or low is 0.7, 0.2 and 0.1 respectively, how valuable (in
Rs.) is it to know how the actual sales would be before making the decision on introducing the
product?
a. 3500
b. 4200
c. 0
d. 5000

2
9. (1 mark) What would happen if instead of using an ANOVA to compare 10 groups, you performed
multiple t-tests?
a. Nothing, there is no difference between using an ANOVA and using a t-test
b. Increases the probability of making a Type I error
c. Increases the probability of making a Type 2 error
d. Both (b) and (c)
10. (3 marks) The average scores of all students varies Normally from year to year with a mean of
75 and standard deviation of 6. The indivdual student scores vary Normally with mean w and
known standard deviation of 4, where w is the average score. If one student has scored 85, find
the posterior mean of w.
a. 33.75
b. 85
c. 78.08
d. 81.92
11. (2 marks) The figure below represents a dendrogram by a Hierarchical clustering algorithm. What
should the cluster members be if you choose to have 4 clusters?

a. {3,6},{4},{1},{2,5}
b. {3,6,4},{1},{2},{5}
c. {3,6},{4},{1,2},{5}
d. {3},{6,4},{1}, {2,5}

12. (2 marks) The below table provides a training dataset, where x and y are 2 input variables, and
Class is the dependent variable.

X1 X2 X3 Class
-1 0 1 Blue
2 0 0 Red
0 1 3 Blue
0 1 2 Blue
0 3 0 Blue
1 1 1 Red

Suppose we wish to use this data set to make a prediction for Class when X1 = X2 = X3 = 0
using K-nearest neighbors. What is our prediction with K = 5 (Use Euclidean distances)?

3
a. Red
b. Blue
c. Cannot say
13. (2 marks) With an SVM/SVC which is not perfectly separable, consider the following data points
or samples:
(i) All misclassified samples
(ii) All samplies inside the margin
(iii) All samples lying on the margin boundary
(iv) All samples outside the margin
which of the above samples will have non-zero slack variables ηi ?
a. Only (i)
b. Only (i) and (ii)
c. Only (ii) and (iii)
d. Only (i), (ii) and (iv)
14. (2 marks) You are interested in the determining whether customers would like a new breakfast
cereal. In a market testing, there is an initial roll out of 1000 packets of the new cereal. The new
cereal will be produced in bulk if at least 600 of these packets are sold. Construct the sample space
for the possible number of packets that would be sold in the roll out and the subset representing
the event when it will be decided to produce the new cereal in bulk. (Note: | represents “given
that” and captures the notion of conditional probability.)
a. Sample space: {p|600 ≤ p ≤ 1000}, Event: {p|p ≥ 0}
b. Sample space: {p|0 ≤ p ≤ 600}, Event: {p|p ≥ 600}
c. Sample space: {p|0 ≤ p ≤ 1000}, Event: {p|600 ≤ p ≤ 1000}
d. Sample space: {p|p ≥ 0}, Event: {p|0 ≤ p ≤ 1000}
15. (3 marks) A marketing research firm tests the effectiveness of a new flavoring for a leading beverage
using a sample of 20 people, half of whom taste the beverage with the old flavoring and the other
half taste the new one. The people in the study are then given a questionnaire that evaluates how
enjoyable the beverage was. The total score for each person was then recorded. Assuming the
first group who tasted the old flavoring had an average satisfaction score of 11.1 with a variance
of 18.77 while the ones who tasted the new had an average score of 15 and a variance of 13.33.
Assuming equal variances in the population and a significance level of 95%, consider the following
statements:
(i) There is no significant difference in the average satisfaction scores between the two flavorings
(ii) There is a significant difference in the average satisfaction scores between the two flavorings
(iii) The probability of observing a value greater than 2.18 from the sampling distribution is 0.043

a. Only (i) and (iii) are correct and (iii) is the correct explanation for (i)
b. Only (i) and (iii) are correct and (iii) is not the correct explanation for (i)
c. Only (ii) and (iii) are correct and (iii) is the correct explanation for (ii)
d. Only (ii) and (iii) are correct and (iii) is not the correct explanation for (ii)

16. (2 marks) What effect would increasing the sample size have on a confidence interval?
a. The confidence interval would increase in size.
b. The confidence interval would decrease in size.
c. The confidence interval is unaffected by sample size.
d. The confidence interval could either increase or decrease in size.

4
17. (1 mark) In a simple linear regression, we use statistical inference to:
a. Fine tune the value of the coefficients in the linear model
b. Ensure that the unexplained error captured in the Mean Squared Error (MSE) is minimized
c. Test the hypothesis that each coefficient in the linear model is statistically different from 0
d. Establish the intercept and slope values in the linear fit based on some objective for fitting a
line through the data
18. (2 marks) Suppose you are handling a training dataset with 2000 observations, 5 predictor variables
and a response variable with two levels - “diabetic” and “non-diabetic”. If the data has 90% of
the responses as “diabetic”, which of the following statements would not apply to the scenario?
a. The accuracy of the model is likely to be greater than 90%
b. Weighted F1-score would provide a better estimate of the model performance
c. The area under the ROC curve for the current data would be comparable to the accuracy
value
d. None of the above
19. (2 marks) The probability function of a discrete random variable X is given by:
f (x) = kx2 , x = 5, 6, 7, and k is a positive constant. Find P (X ≥ 7).
a. 7/18
b. 13/18
c. 85/110
d. 49/110
20. (3 marks) Which of the following statements is/are true:
i. In the complete linkage method of hierarchical clustering, the similarity is based on the
maximum distance between objects in two clusters.
ii. Divisive hierarchical clustering starts by considering each observation as a cluster and forms
larger clusters.
iii. Agglomerative clustering is computationally more efficient than Divisive clustering.
iv. Single linkage method suffers from chaining (forming elongated clusters which can be too
spread out)
Choose the correct option:
a. i,ii
b. Only i
c. ii, iii,iv
d. i,iii,iv
e. i,ii,iii,iv
21. (2 marks) Consider the following statements about K-fold cross validation:
(i) K-fold has lower bias compared to LOOCV
(ii) The final Mean Squared Error (MSE) is the average of all folds
(iii) For K=N, K-fold behaves as LOOCV where N is the number of observations
Which of the above statements are true?
a. Only (i)
b. Only (i) and (iii)
c. Only (ii) and (iii)

5
d. All of the above
e. None of these
22. (1 mark) Which of the following statements is true about LDA (Linear Discriminant Analysis):
a. LDA aims to minimize both distance between class and distance within class
b. LDA is preferred when the number of observations (n) is large and the predictors are approx-
imately normal
c. If the discriminatory information is in the variance but not in the mean of the data, LDA
will fail
d. If the discriminatory information is in the mean but not in the variance of the data, LDA
will fail
23. (2 marks) Considering the same minimum support is maintained at each iteration, which of the
following is true about the aPRIORI approach to Association Rule Mining:
i. If an itemset is frequent, then all its subsets are also frequent
ii. If an itemset is frequent, then all its supersets are also frequent
iii. Algorithm starts with itemsets of size 1 and searches for larger itemsets that are frequent
iv. Algorithm starts with the largest itemsets and searches for smaller itemsets that are frequent
Choose the correct option:

a. ii,iv
b. Only i
c. i, iii
d. i, iv

24. (2 marks) Let X be a continuous random variable where


(
1/18 if − 3 ≤ x ≤ 15
f (x) =
0 otherwise

Then variance of X is:


a. 23
b. 24
c. 25
d. 26
e. 27

25. (1 mark) Indicate in which scenario(s) a classification problem would be applicable:


a. The reviews of a movie can be considered a proxy for ratings and the sentiment contained in
them can be bucketed into a few categories. Given data on review sentiments of movies, the
runtime, budget, we wish to predict the total box office collection.
b. We wish to predict if a new product when launched in the market will be profitable. We
collected data on similar products and recorded their price, marketing budget, competition
price, and the corresponding profit yielded.
c. Data on the top 500 firms in the US containing information on the profit, number of employ-
ees, industry and the CEO salary. The aim is to understand which factors affect the CEO
salary.
d. Given information on viewing times, activities, shares, likes and posts on a social media
website, we wish to identify if a particular user is an influencer.

6
26. (2 marks) Consider the data points and the fitted line generated by OLS shown in the figure below.
Of the data points marked, which ones would affect the fitted line significantly if removed?

a. (a) alone
b. (c) alone
c. (d) alone
d. Both (a) and (c)

27. (2 marks) A 95% confidence interval for the population mean was computed to be (1.24, 5.16).
Which of the following statements are true about the confidence interval:
i. We are 95% confident that the true mean is between 1.24 and 5.16
ii. 95% of all samples should have x-bars between 1.24 and 5.16
iii. 95% of all x?s have values between 1.24 and 5.16
iv. Of 100 intervals calculated the same way, we expect 95 of them to capture the population
mean

Choose the correct option:


a. ii,iv
b. Only i
c. i, iii
d. i, iv
e. i,ii,iii,iv
28. (2 marks) Consider the following statements about Bagging:
(i) Individual trees are built on a subset of features
(ii) Individual trees are built on a subset of observations
(iii) Individual trees are correlated
(iv) Interpretability of the result is low
Which of the above statements are true?
a. (i) and (iv)
b. (i), (iii) and (iv)
c. (ii) and (iv)
d. All statements are true
29. (1 mark) Which of the following is not a property of the binomial experiment?

a. the trials are independent

7
b. the experiment consists of a sequence of n identical trials
c. the probability of success does not change from trial to trial
d. there are k possible outcomes on each trial, where k is any positive integer
e. The sample space is {0,1,..., K} where K is the number of trials

30. (3 marks) We wish to classify potential customers as buyers and non-buyers for our newly launched
product, in order to improve the selection of target audience for the product related advertisements.
For this purpose we consider two independent variables for each customer - income in lakhs/annum
(I) and education in years (E). The figure below gives such a partition space, where ‘B’ represents
Buyer and ‘NB’ represents Non-Buyer, indicating the majority response in each region. Find the
root node and depth of the decision tree resulting from this partition space (The depth of a tree
is defined as the length of longest path from the root to a leaf node).

a. Root node is I < 3 and depth is 4


b. Root node is E < 12 and depth is 4
c. Root node is E < 9 and depth is 4
d. Root node is I < 3 and depth is 3
e. Root node is E < 12 and depth is 3
f. Root node is E < 9 and depth is 3

31. (3 marks) The figure below represents a dendrogram by a Hierarchical clustering algorithm. From
visual inspection, what should be the best choice of the number of clusters and what will be the
cluster members?

Choose the correct option:

8
a. 3 clusters, {a,b}, {c}, {d,e,f}
b. 2 clusters, {a,b,c}, {d,e,f}
c. 2 clusters, {a,b}, {c,d,e,f}
d. 4 clusters, {a,b}, {c}, {d}, {e,f}

32. (2 marks) For the provided dataset, which of the hyperplanes will be selected by a SVM?

a. a
b. b
c. c
d. None of the above

33. (3 marks) You have 30 data points which you believe comes from a Binomial distribution (PDF
is n Cx px (1 − p)n−x ). You want to use the concept of maximum likelihood estimation (MLE) to
estimate the parameter number of successes (p). Explain the steps that you would take for this
computation, and find the maximum Qn likelihood estimate. The formula for MLE as discussed in
class is: maxˆl(θ; x1 , x2 , ...xn ) = i=1 f (xi |θ)
34. (2 marks) Explain the impact of outliers in K-means clustering algorithm.

35. (3 marks) In the given figures, K- Nearest Neighbour classification method was applied and for
varying values of K, the corresponding decision boundaries were plotted. Arrange the figures in
increasing order of K value. Also discuss the impact of K on bias and variance of the models.

9
K1 K2

K3 K4

36. (2 marks) The OLS solution of a regression problem is given below.Regularization was performed
on the dataset, and the output coefficients are given under Model 1 and Model 2. Which among
Model 1 and Model 2 is likely to be a Ridge regression, and which is likely to be a Lasso regression?
Why?

Variable OLS Model 1 Model 2


Intercept -8.915 18.610 5.179
V1 -0.013 0.015 0
V2 -0.942 -0.876 -0.874
V3 -0.164 -0.053 0
V4 -0.005 0.006 0
V5 -2.599 -2.364 -2.264
V6 0.007 0.005 0.004
V7 -0.004 -0.004 -0.003
V8 14.973 -13.484 0
V9 -0.929 -0.638 -0.657
V10 0.893 0.858 0.8122
V11 0.294 0.251 0.270

37. (3 marks) Consider the contingency table below (M) for two itemsets A and B, where f11 is the
number of transactions where both A and B are present, f10 is the number of transactions where
A is present but not B, f01 is the number of transactions where A is not present but B is present,
and f00 is the number of transactions where both A and B are absent.

10
A measure is null-invariant if O(M+C) = O(M) where C = [0 0; 0 k] and k is a positive constant.
Write down the Lift measure (P (AB)/(P (A).P (B))) in terms of f00 , f01 , f10 , f11 from M, and
explain if it is null-invariant or not.

11

You might also like