You are on page 1of 21

SCQs [Paper-I]

Linear Regression
1. Feature engineering is an important step in any model building exercise. It is the process of creating new features
from a given data set using the domain knowledge to leverage the predictive power of a machine learning model.
Which of the following statements are correct?
Statement 1: Feature engineering techniques are applied before train test split.
Statement 2: There is no difference between standardization and normalization,
Statement 3: Mean encoding is a feature engineering technique for handling categorical features.
a. Only 1 and 2 c. Only 2 and 3
b. Only 1 d. Only 3
2. VIF is used to detect Multicollinearity. Which of the following statements is NOT true for VIF?
a. The VIF has lowest bound of 0
b. The VIF has no upper bound
c. VIF for a variable generally changes if you drop one of the predictor variables
d. If a variable is a product of two other variables, it can have a high VIF
3. The distribution of errors terms in a linear regression model should look like (the horizonal line represents y=0):

a. A c. B
b. C d. D

4. For the same dependent variable Y, two models were created using the independent variables X1 and X2. The
following graph represent the fitted line on the scatterplot. (Both the graph are on same scale). Which of the
following is true about the residuals in these two models?
a. The sum of residuals in model 2 is higher than model 1
b. The sum of residuals in model 1 is higher than model 2
c. Both have the same sum of residuals
d. Nothing can be said about the sum of residuals from
the given graph

5. You built a simple linear regression model on a provided problem statement by the client. After a few days, the client
asks you to build a new model with an increased number of data points (old dataset + new data points). The count of
new data points exceeds old data points by 20%.
Which of the following statement is TRUE regarding the mean of residuals?
a. Mean of residuals of old model > Mean of residuals of new model
b. Mean of residuals of old model < Mean of residuals of new model
c. Mean of residuals of old model = Mean of residuals of new model
d. Information provided is not enough to comment on the mean of residuals
6. A scatterplot was plotted for two variables – age and income to find out how the income depends on the age of a
person. It was found that as the income increases linearly with age, the variability in income also increases. This is a
violation of which of the following assumptions of linear regression?
a. Homogeneity c. Heterogeneity
b. Homoscedasticity d. Linearity
7. RFE method is used for:
a. Dummy variable creation c. Detecting multicollinearity
b. Feature selection d. Univariate regression
8. Which of the following assumptions do we make while building a simple linear regression model (assume X and y to be
independent and dependent variables respectively)
A. There is a linear relationship between X and y
B. X and Y are normally distributed
C. Error terms are independent of each other
D. Error terms have constant variance
a. A, B, C and D c. A, C and D
b. A, B and C d. B, C and D
9. A client approached you with a problem statement. You decided to build a multiple linear regression model on the
dataset provided. The dataset consisted of 40 features. Obviously, all features will not be significant. Selecting the
relevant features manually will be a tougher task. You can use RFE to select relevant features. RFE is an automated
feature selection technique. Initially, you assumed 25 features can explain your whole data.
Which of the following commands correctly calls the RFE technique in Python? (Here “lm” is the fitted instance of
multiple linear regression model)
a. from stastmodel.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.fit(X_train,y_train)
b. from sklearn.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.predict(X_train,y_train)
c. from sklearn.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.fit(X_train,y_train)
d. from RFE import feature_selection
rfe=RFE(lm,25)
rfe=rfe.predict(X_train,y_train)
10. Suppose that on adding a new predictor variable to a linear regression model (model-1), the adjusted r-squared of the
new model (model-2) decreases. Choose the correct statement:
a. The r-squared of model-2 will be less than that of model 1
b. The r-squared of model-2 increases, but the complexity of model-2 also increases
c. The r-squared of model-2 decreases, but the complexity of model-2 also increases
d. Nothing can be said about the r-squared of model-2
11. Some of the independent variables (predictors) might be interrelated, due to which the presence of a particular
independent variable in the model is redundant. This phenomenon is called Multicollinearity.
Suppose that you are building a multiple linear regression model for a given problem statement, which of the
following statements is TRUE w.r.t. multicollinearity?
a. Multicollinearity is a problem when your only goal is to predict the independent variable from the set of
dependent variables
b. Multicollinearity is a problem when your goal is to infer the effect on the dependent variable due to
independent variable.
c. Multicollinearity is not a problem if a variable is not collinear with your variable of interest
d. Multicollinearity is not a problem if there are multiple dummy(binary) variables that represent a categorical
variable with three or more categories
12. If the co-efficient of determination is 0.47 between a dependent variable and an independent variable. This denotes
that-
a. The relationship between the two variables is not strong
b. The corelation coefficient between the two variables is also 0.47
c. 47% of the variance in the independent variable is explained by the dependent variable
d. 47% of the variance in the dependent variable is explained by the independent variable
13. While solving linear regression, the dependent variable is-
a. Numeric c. Categorical
b. Dummy coded d. Binary
14. Consider the following two assumptions for a single regression model. (Assume X and y to be independent and
dependent variables respectively).
Statement 1: There is a linear relationship between X and y
Statement 2: X and y are normally distributed
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
15. What does standardized scaling do?
a. Bring all data points in the range 0 to 1
b. Bring all data points in the range -1 to 1
c. Bring all the data points in a normal distribution with mean 0 and standard deviation 1
d. Bring all the data points in a normal distribution with mean 1 and standard deviation 0
16. In the linear regression, F-statistic is used to determine-
a. The significance of the individual beta coefficient
b. The variance explanation strength of the model
c. The significance of the overall model fit
d. Both A and C
17. Suppose you run a regression with one of the feature variable T, with all the remaining feature variables. The R-
squared of this model was found out to be 0.8. What will be the VIF for the variable T?
a. 1.56 c. 2.77
b. 3.33 d. 5.00
18. Which of the following is true regarding the error terms in linear regression?
a. The sum of residuals should be zero
b. The sum of residuals should be lesser than zero
c. The sum of residuals should be greater than zero
d. There is no such restriction on what the sum of residuals should be

Logistic Regression and Classification


19. Suppose an imbalanced data set has a class ratio of 2:3, and you want to run a cross-validation scheme to evaluate a
model's performance. If you apply a stratified k-fold to generate the train-test folds, what will be the distribution of
the classes in the test split?
a. 1:5 c. 2:3
b. 1:7 d. None of these
20. Consider the following two statements-
Statement 1: Suppose the value of Precision and Recall for a model is 0.65 and 0.75 respectively. Then the value
of F1-score will be -0.696
Statement 2: Mean squared error is a metric that can be used to evaluate logistic regression model.
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
21. The output of logistic model is-
a. 0 or 1 c. Any value between 0 and 1
b. 0.5 d. Depends on the business problem
22. What is the use of performing segmentation on a dataset before running a logistic regression on it?
a. It helps in capturing the seasonal fluctuations that might be present in the data
b. It helps to find the optimal cut-off point more easily
c. It helps in finding the different predictive patterns for the different set of data points that might be present in
the data
d. It helps capture the trends easily when there is a class imbalance
23. Given an imbalanced dataset, the ratio of positive to negative class is 1: 10000. You run a logistic regression model and
find out the model has a high value of precision and a low value of recall. Which of the following statements is true?
a. The class is handled well by the data
b. The model is not able to detect the class, but when it does it is highly trustable
c. The model is able to detect the class but it includes data points from the other class as well
d. The class his handled poorly by the data
24. You have to build a logistic regression model that is trying to predict whether loan is approved or not based on a
person’s FICO score. Here are the model parameters: Intercept(𝛽0 )= -9.346 and the co-efficient of FICO score=0.0146.
Given the parameters, can you calculate the probability of loan getting approved for someone with a FICO score of
640?
a. 0.35 c. 0.40
b. 0.45 d. 0.50
25. Which of the following is correct for a logistic regression model?
a. The independent variable should not be multicollinear
b. The dependent variable should follow Normal Distribution
c. The log odds in a logistic regression model lies between 0 and 1
d. F1 score is always the best metric for evaluating a logistic regression model
26. You have problem statement to build a multivariate logistic regression model. There are two features say ‘infected’
and ‘Blood Group’ of your interest in the dataset. The feature ‘infected’ takes two values “yes” or “no” whereas “Blood
Group” takes multiple levels liked A, A+, O, O+ etc.
Now consider the following statements-
Statement 1: For the feature “infected”, mapping is preferred over the creation of dummy variables.
Statement 2: For the feature “Blood Group”, the creation of dummy variables is preferred over mapping.
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
27. For a completely random binary classification model, what will be the area under the curve of the ROC graph?
a. 0 c. 0.25
b. 0.5 d. 1
28. Consider the following univariate logistic model
𝑦 = 𝛽0 + 𝛽1 𝑥1
Which of the following statement is NOT true?
a. The maximum likelihood estimation determines the best combination of 𝛽0 𝑎𝑛𝑑 𝛽1
b. If 𝛽1 is increased by 1 unit, Y increases by 1 unit
c. 𝛽0 is the y-intercept
d. If 𝛽1 is increased by 1 unit, log odds increases by 1 unit
29. You have to build a logistic regression model that is trying to predict whether loan is approved or not based on a
person’s FICO score. Here are the model parameters: Intercept(𝛽0 )= -9.346 and the co-efficient of FICO score=0.0146.
Given the parameters, can you calculate the probability of loan getting approved for someone with a FICO score of
655?
a. 0.35 c. 0.45
b. 0.55 d. 0.65
30. Consider the following confusion matrix. Which among the following is the lowest for the given confusion matrix?
Total=500 Actual Positive Actual Negative
Predicted Positive 196 20
Predicted Negative 28 256
a. Accuracy c. Precision
b. Sensitivity d. Specificity
31. If you use a random number generator to predict the output 0 or 1 for a binary classification problem, what will be the
area under the curve of the ROC curve?
a. 0 c. 0.5
b. 1 d. 100
32. How is regression different from classification?
a. One is supervised while the other is unsupervised
b. One is iterative while the other is closed
c. In regression, the response variable is numeric while it is categorical in classification
d. None of the above
33. Recall the telecom churn example. If the log odds for churn are equal to 0 for a customer, then that means-
a. There is no chance of the customer churning
b. The probability of customer churning is equal to the probability of the customer not churning.
c. The probability of customer churning is very small compared to the probability of the customer not churning.
d. The probability of customer churning is very large compared to the probability of the customer not churning.
34. Recall the telecom churn example. If the log odds for churn are equal to 1/3 for a customer, then that means-
a. The probability of customer not churning is 3 times the probability of the customer churning
b. The probability of customer churning is 3 times more than the probability of the customer not churning
c. The probability of customer not churning is 4 times the probability of the customer churning
d. The probability of customer churning is 4 times more than the probability of the customer not churning
35. Which of the following statements is NOT true?
a. In the case of a fair coin, the odds of getting heads is 1
b. The error values of linear and logistic regression have to be normally distributed
c. Specificity decreases with the increase in sensitivity
d. As TPR increases, FPR also increases
36. Take a look at the following three problem statements.
Problem statement 1: Let's say that you are building a telecom chum prediction model with the business objective that
your company wants to implement an aggressive customer retention campaign to retain the high churn-risk'
customers. This is because a competitor has launched extremely low-cost mobile plans, and you want to avoid churn
as much as possible by incentivising the customers. Assume that budget is not a constraint.
Problem statement 2: Let's say you are building a cancer detection model with the objective that both the patient who
has cancer and the patient who has not cancer can be detected correctly. It can have serious implications if you
predict either of the class wrong, ie., if wrongly detected as "not cancer" the patient will die of cancer, and if wrongly
detected as "cancer" the patient will die of chemotherapy.
Problem statement 3: You have to build an image classification model where 60% of images belong to one class and
rest 40% images belong to another class. You have to predict the class of a new image.
Which is the correctly matched model evaluation metric for the above classification models?
a. Problem Statement 1: Specificity c. Problem Statement 2: Sensitivity
b. Problem Statement 2: Specificity d. Problem Statement 3: Accuracy
37. What will be the accuracy percentage of the given confusion matrix of the three-class classification?
True/Predicted Class A Class B Class C
Class A 13 0 5
Class B 0 4 8
Class C 1 1 9
a. 63% c. 36%
b. 71% d. 45%

Clustering
38. In hierarchical clustering, the shortest distance and the maximum distance between points in two clusters are
defined as ………. and ………….. respectively.
a. Single linkage and complete linkage c. Complete linkage and single linkage
b. Single linkage and average linkage d. Complete linkage and average linkage
39. Which of the following statement is NOT true?
a. Each time the clusters are made during the K-means algorithm, the centroid is updated.
b. The cluster centres that are computed in the K-means algorithm are given by centroid value of the cluster
points
c. Standardization of the data is not important before applying Euclidean distance as a measure of
similarity/dissimilarity
d. The centroid of a column with data points 25, 32, 34 and 23 is 28.5.
e. The Euclidean distance between two points (10,2) and (4,5) is 7.
40. Initializing the following command in Python will result in the following:
model_clus= KMeans(n_clusters=6, max_iter=50)
a. Run maximum 6 iterations c. Run maximum 40 iterations
b. Create 6 final clusters d. Create 50 final clusters
41. Which of the following is not true for Hopkins Statistics?
a. Hopkins statistics decides if the data is suitable for clustering or not
b. Hopkins statistics lie between -1 and 1
c. If the Hopkins statistics comes out to be 0, then the data is uniformly distributed
d. If the Hopkins statistics comes out to be 1, then the data is highly suitable for clustering
42. Consider the two statements-
Statement 1: The distance between 2 clusters is the maximum distance between 2 points in the clusters in
complete linkage.
Statement 2: Most of the time Complete linkage will produce unstructured dendrograms.
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
43. A client has approached you for a problem statement that requires the use of clustering. You decided to model the
problem statement with hierarchical clustering. Consider the datasets having ‘n’ data points.
Which of the following statements is true for the above problem statement?
a. ‘n*n’ distance matrix should be calculated for the mentioned problem statement
b. Initially ‘n’ clusters are formed for the mentioned problem statement
c. The output of the problem statement above is a dendrogram
d. All the above
44. Silhouette metric for any ith point is given by S(i) = (b(i) - a(i)/max(a(i), b(i))
Which of the following is not true about the Silhouette metric?
a. b(i) is the average distance from the nearest neighbour cluster (Separation)
b. a(i) is the average distance from own cluster (Cohesion).
c. If S(i) = 1 the data point is similar to its own cluster.
d. Silhouette metric ranges from 0 to +1
45. Clustering is used to identify the below-
a. Data distribution c. Correlation among the data points
b. Principal components d. Subgroups in the data
46. For a K-means clustering process, the Hopkin Statistic for the dataset came out to be 0.8. Hence the dataset is-
a. Suitable for clustering c. Not suitable for clustering
b. Can’t say from the given information d. None of the above
47. For a K-means clustering process, the Hopkin Statistic for the dataset came out to be 0.3. Hence the dataset is-
a. Suitable for clustering c. Not suitable for clustering
b. Can’t say from the given information d. None of the above
48. You observed the following dendrogram after performing K-means clustering on a dataset. Which of the following
statements can be concluded from this dendrogram?

a. The initial number of clusters is 6


b. There are 25 data points used in the above clustering algorithm
c. Single linkage is used to define the distance between two clusters in
the above dendrogram
d. The above dendrogram interpretation is not possible for K-means
clustering.

49. Refer to the dendrogram image below and answer the question that follow:
Find the number of clusters formed if the dendrogram is cut at 0.25. (Assume agglomerative clustering method)

a. 6 c. 11
b. 13 d. 15

Decision Tree
50. Which of the following is the correct sampling technique that is used by a random forest model to overcome the
problem of overfitting?
a. Random sampling c. Bootstrapping
b. Oversampling d. Stratified sampling
51. Which of the following metrics measures how often a randomly chosen element would be incorrectly identified?
a. Entropy c. Information Gain
b. Gini Index d. None of these
52. Which of the following is true for weight of evidence (WoE) analysis?
a. It helps in finding the different predictive patterns for the different segments that might be present in the data
b. WoE helps in treating missing values for both continuous and categorical variables
c. WoE values should follow an increasing or decreasing trend across bins.
d. All of the above
53. Refer to the decision tree given below and choose the statement that is correct as per this tree.

a. The tree given above will show very good performance on the train data
b. The tree given above is an underfitting tree.
c. If the petal length is more than 2.45, then it is equally likely that the flower is either setosa or virginica.
d. Both B and C
54. Suppose you train a decision tree with the following data. Which feature should we split on at the root?
X Y Z V
T T F 1
F F F 0
T T T 0
F T T 1
a. X c. Y
b. Z d. Cannot be determined
55. Select the correct option based on the following decision tree.

I. Node 8 is the root node


II. Leaf node is 5
III. Nodes 2, 3, 4 are internal nodes.
IV.
a. Only I c. Only II
b. Both II and III d. Both II and IV

NLP and Lexical Processing


56. Choose the correct option from the following:
The difference between “+” and “*” quantifier is-
a. ‘+’ needs the preceding character to be present at least once whereas ‘*’ does not need the same.
b. ‘*’ need the character to be present at least once whereas ‘+’ does not need the same.
c. Both then quantifiers have same functionality
d. None of the above
57. What is the Levenshtein distance between ‘decade’ and ‘dictate’?
a. 3 c. 4
b. 5 d. 6
58. Which of the following strings will match the expression ‘^01+0$’?
1. 0
2. 00
3. 011110
a. Only option 1 c. Only option 3
b. Both 1 and 2 d. Both 2 and 3
59. What is the Levenshtein distance between ‘shutter’ and ‘shelter’?
a. 1 c. 2
b. 3 d. 4
60. Which of the following strings will match with the regular expression ‘^01*0$’?
1. 0
2. 00
3. 01111111110
a. Only option 1 c. Only option 3
b. Both 1 and 2 d. Both 2 and 3
Business Problem Solving
61. The coronavirus disease (COVID-19) was declared a pandemic by World Health Organization (WHO) in February 2020.
Currently, there are no vaccines or treatments that have been officially approved by WHO after clinical trials. India has
not seen the peak of infection yet and the number of infections is touching a new height daily. The business unit of
and Indian health and hygiene company approaches you to know “Why the sales of masks is decreasing despite the
number of corona infections increasing daily”.
Answer the following questions:
Suppose you mapped the above problem statement with a classification problem, either a customer will buy a mask or
not. You’ll build …….. model as your initial solution.
a. Neural Network c. Logistic Regression
b. Decision Tree d. All of the above
62. The coronavirus disease (COVID-19) was declared a pandemic by World Health Organization (WHO) in February 2020.
Currently, there are no vaccines or treatments that have been officially approved by WHO after clinical trials. India has
not seen the peak of infection yet and the number of infections is touching a new height daily. The business unit of
and Indian health and hygiene company approaches you to know “Why the sales of masks is decreasing despite the
number of corona infections increasing daily”.
Answer the below questions:
Consider the following two statements:
Statement 1: Understanding the change in customer behaviour is an important factor to be considered for
business understanding for the problem statement above
Statement 2: One of the possible hypotheses for the above problem statement: There is a rise in the number of
companies manufacturing normal/surgical masks due to which the sales of the client’s company is decreasing
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
63. Any Business Problem Solving will have the following steps:
1. To identify the right data sources, that will be useful in formulating the final solution
2. Develop hypothesis and assess the overall impact of the hypothesized solution
3. Asking the right question for business and problem understanding.
4. Define the solution approach: What will be the POC model? What will be the metrics for the model evaluation
etc.
5. Converting business problem to a data science problem
6. Start your model building process with the simple POC model. And then increase the complexity of the POC
model and optimize the parameters to get the best result.
7. Performing EDA on the datasets
8. Model Evaluation
What will be the correct flow for solving the above/any business problem?
a. 3>1>5>2>4>7>6>8
b. 3>2>1>5>4>7>6>8
c. 4>3>1>2>5>7>6>8
d. 3>2>1>5>4>7>8>6
MCQs [Paper -I]
64. ROC curve shows the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR). TPR and FPR are
sensitivity and (1-Specificity) respectively. The following function is written in Python using metrics package from the
sci-kit learn library for the ROC curve function.
def draw_roc(actual,probs):
fpr, tpr,thresholds = metrics.roc_curve(actual, probs, drop_intermediate=False)
auc_score = metrics.roc_auc_score(actual,probs)
return None
Which of the following statements are true? (More than one option may be correct)
a. The area under ROC curve can be more than 1
b. The arguments passed in the above function are actual values of the target variable and the predicted values
(i.e. 0 or 1)
c. The area under the ROC can take any value between 0 and 1
d. Larger the area under the curve, the better will be the model
e. The arguments passed in the above function are actual values of the target variable and the respective
predicted probabilities
65. Observe the following cost function graph with different learning rates.

a. The learning rate of the Curve C is highest among all curves.


b. The learning rate of the Curve B is lower than A.
c. The learning rate of the Curve B is higher than A.
d. The learning rate of Curve C is smallest among all curves.
e. None of the above
66. Which of the following command correctly builds a logistic regression model in Python?
(More than 1 option can be correct)
a. from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(X_train,y_train)
b. Import statsmodel.api as sm
lr=sm.GLM(y_train,(sm.add_constant(X_train)),
family =sm.families.Binomial())
lr.fit()
c. from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.predict(X_train,y_train)
d. Import statsmodel.api as sm
lr=sm.GLM(y_train,(sm.add_constant(X_train)),
family =sm.families.Binomial())
lr.predict()
67. In a simple linear regression model when you fit a straight line through the data you’ll get the two parameters of the
straight line i.e. the intercept 𝛽0 and the slope 𝛽1 . Which of the following is true for 𝛽0 and 𝛽1 ? (More than one option
may be correct)
a. The null hypothesis for a simple linear regression model is 𝐻0 : 𝛽1 =0
b. If the p-value turns out to be greater than 0.05 for 𝛽1 , it means 𝛽1 is significant
c. If 𝛽1 turns out to be insignificant, that means there is no relationship between the dependent and the
independent variable.
d. If the p-value turns out to be less than 0.05 for 𝛽0 it means that 𝛽0 is non-zero
68. Which of the following metrics can be used for finding the appropriate number of clusters in K-means clustering?
(More than one option may be correct)
a. Silhouette Score c. Elbow Curve
b. Hopkins Statistic d. Dendrogram
69. Which of the following statements is true? (More than one option may be correct)
a. TSS(Total Sum of Squares) is defined as the sum of all squared differences between the observed dependent
variable and its mean
b. R-Squared can take any value between 0 and 1
c. Larger the R-squared value, the better the regression model fits the observations
d. If RSS=5.50 and TSS=11, the value of VIF will be 1.33
70. Which of the following statements are correct in the context of logistic regression? (More than one option may be
correct)
a. The dummies for continuous variables make the model more unstable
b. Weight of Evidence (WoE) helps in treating missing values for both continuous and categorical variables
c. WoE should follow a non-monotonic trend across bins.
d. Data clumping can be a problem with transforming continuous variables to dummies.
e. Information Value or IV is an important indicator of predictive power.
71. Which of the following is NOT a methodology by which you can identify the optimal number of clusters for K-means
clustering? (More than one option may be correct)
a. Dendrogram inspection method c. Elbow method
b. Single Linkage method d. Silhouette Score
SCQs [PAPER-II]
SQL
1. Which of the following is/are the correct sequence of steps in the database design creation-manipulation cycle?
1. Development, Design, Manipulation, Maintenance
2. Development, Manipulation, Production, Maintenance
3. Development, Manipulation, Production, Revision
a. 2 and 3 c. 3
b. 2 d. 1, 2 and 3
2. Select which of the statements is true regarding user-defined functions and stored procedures?
a. A User Defined Function cannot call a stored procedure
b. A stored procedure cannot call a user defined function
c. A stored procedure must return a value
d. A user defined function supports both the input and output parameters
3. A CASE statement in SQL is equivalent to which of the following?
a. A way to use CASE based loop in SQL
b. A way to use IF-THEN-ELSE logic in SQL
c. A way to define CASE while creating tables
d. A way to use FUNCTIONS in SQL
4. In a database, a foreign key is-
a. A data element/attribute within a data field of a data record that is not unique and cannot be used to
distinguish one data record in a database from another data record within a database table.
b. A data element/attribute within a data field of a data record within a database table that is a secondary key in
another database table
c. A data element/attribute within a data field of a data record within a database table that is a primary key in
another table
d. A data element/attribute within a data field of a data record that enables a database to uniquely distinguish
one data record in a database from another data record within a database
5. Which of the following statement holds true for an OLAP system?
P) Data is stored in a normalized form
Q) OLAP systems are used for analytical purpose
R) Huge amount of data is stored in OLAP as compared to OLTP system but the query takes less time to execute
S) There is no restriction on data integrity
a. P,Q,S c. P,R,S
b. Q,R,S d. Q and S
6. The correct difference between Star Schema and Snowflake Schema is-
a. A snowflake schema has one central fact table surrounded by multiple dimension tables whereas a Star
Schema can have dimension tables that branch off into more such tables
b. A Star Schema has one central fact table surrounded by multiple dimension table whereas a snowflake schema
can have dimension table that branch off into more such tables
c. A snowflake schema makes querying data easier as compared to Star Schema
d. A Star Schema is more efficient in terms of data storage as compared to Snowflake Schema
7. Which of the following statements are TRUE about an SQL query?
P: A SQL query can contain a HAVING clause if it does not have a GROUP BY clause
Q: A SQL query can contain a HAVING clause only if it has a GROUP BY clause
R: All attributes used in the GROUP BY clause must appear in the SELECT clause
S: Not all attributes used in the GROUP BY clause need not appear in the SELECT clause
a. P and R c. P and S
b. Q and R d. Q and S
8. What will be the best data type definition for MYSQL when a field is alphanumeric and has a fixed length?
a. VARCHAR c. CHAR
b. LONG d. Text
9. What are the features of Relational schema from the following?
1. Atomic
2. Isolated
3. Soft state
4. Consistent
a. 1, 2 and 3 c. 1, 2 and 4
b. 2 and 4 d. 3 and 4
10. Which of the following is not a DML statement?
a. CREATE TABLE MyGuests (id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY, firstname VARCHAR(30) NOT
NULL, lastname VARCHAR(30) NOT NULL, email VARCHAR(50))
b. INSERT INTO MyGuests (firstname, lastname, email) VALUES (“John”, “Doe”, “john@example.com”)
c. UPDATE MyGuests SET lastname=”Doe” WHERE id=2
d. DELETE FROM MyGuests WHERE id=3
11. For the following table 'sales_assistant,
id first_name sold products
1 Manish 2400
3 Lakshay 2700
4 Manish 2700
5 Anand 2900
What will be the output of the following query?
SELECT RANK ( ) OVER(ORDER BY sold products DESC) AS r,
DENSE_RANK ( ) OVER(ORDER BY sold products DESC) AS dr,
first_name,
sold products
FROM sales_assistant;

Ans:

R dr first_name sold products


1 1 Anand 2900
2 2 Lakshay 2700
2 2 Manish 2700
4 3 Manish 2400

12. You are given a table that has an "id" column as shown below:
Id
▸ 1
2
3
4
5
6
7
8
9
What will be the output of the 5th Row for the query below?
SELECT SUM(id) OVER(ORDER BY id ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING)
FROM id
a. 42 c. 37
b. 28 d. 44
CLOUD AND AWS
13. What is the maximum number of files that can be stored in S3?
a. 65,536 c. 1,024
b. No limit d. 9,999
14. What is MapReduce in relation to Big Data architecture?
a. It’s a programming framework used by Hadoop to process Big Data
b. It is a function to convert the incoming data (stored in blocks) into key-value pairs.
c. It is a function to aggregate the values on the basis of the keys across the blocks in the cluster
d. There is no MapReduce in the Big Data Architecture.
15. Which of the following is not a factor used to identify Big Data?
1. Velocity
2. Variety
3. Veracity
4. Volume
a. 1 and 3 c. 3
b. 2 d. None of the above
16. Out of the following which is NOT an IAM user?
a. Sudo User c. Privileged administrators
b. End users d. Programmatic users
17. Suppose your company wants to move its computing infrastructure to cloud but does not want to make a huge
upfront investment. Among the following models, which one would be the most cost-effective option for your
company?
a. Community cloud model c. Public cloud model
b. Private cloud model d. None of them
18. Which of the following implements an operating system level virtualization?
a. Type 1 virtualisation c. Type 2 virtualization
b. Bare metal virtualization d. Containerization
19. Consider an application that must be run on four EC2 instances. Out of the four EC2 instances, two of the EC2
instances execute mission-critical software and need to be run all the time. The third EC2 instance hosts the web
server, which gets loaded only when the customer accesses it from time to time; customer uptime needs to be
maintained at 100%. Finally, the last EC2 instance runs a background job that collates the logs from time to time.
Which would be a cost-effective combination of instances for this purpose?
a. Three on-demand instances and one spot instance
b. One reserved instance with a partial upfront payment and three spot instances
c. Two reserved instances with an upfront payment and one on-demand instance and one-spot instance
d. Four on-demand instances
20. Let’s say your organization wants to move its current computing infrastructure to cloud. You have been assigned to
assess the difference between the IaaS model and the PaaS model. Among the following options, which one would you
recommend as an advantage of the IaaS model over the PaaS model?
a. IaaS model offers reduced maintenance from the user end
b. IaaS model usually provides more flexibility in selecting the underlying infrastructure
c. IaaS model removes the complexity of setting-up, configuring and managing infrastructures such as hardware
and operating systems.
d. All of the above
21. Suppose you have been using services of a cloud service provider for a few years and now you want to move your
current cloud infrastructure from the present cloud provider to another. Which of the following characteristics of
cloud always allows you to do so efficiently and cost effectively?
a. Muti-tenancy c. On-Demand Self-Service
b. Infrastructure as Code(IaC) d. Rapid Elasticity
22. Suppose an organization wants to use computer clusters for a complex project. There are two ongoing projects in the
organization: Project A where they need 10 computers to train all ML models on weekdays: and Project B, where
they need 5 computers to train the ML models on weekends What would be a more efficient way to use these
computers?
a. The organization should set up 10+5=15 computers for both the project needs separately
b. The organization should set up max(10,5)=10 computers so that these computers can be shared between
projects
c. The organization should set up just 5 computers
d. None of the above
23. Suppose your organization has a set of web applications that get a highly varying incoming traffic along with a
sensitive image catalog of 80 petabytes, which will be used by these applications. If your organization needs to mode
to the cloud, which of the following would be a cost-effective method to achieve this? (consider the fact that the
public cloud is a cheaper alternative but can have privacy issues)
a. Use a hybrid cloud model, maintain the image catalog in a private cloud and move the web applications to the
public cloud
b. Use a hybrid cloud model, maintain the web applications in a private cloud and transfer the images to the
public cloud
c. Put everything in a public cloud
d. Maintain everything in a private cloud.
24. Which of the following versions of Hadoop is capable of running both Spark as well as MapReduce based applications?
a. Hadoop Version 1 c. Hadoop Version 2
b. Hadoop Version 1.2 d. All of the above
25. Which of the following is the best approach to determine the number of partitions that are created while storing an
input file in Hadoop?
a. It is not possible to determine the number of partitions in Hadoop
b. It can be determined by running the getNumPartitions() function in Hadoop
c. It can be determined by running the parititon.length() function in Hadoop
d. It can be determined by dividing the file size by block size in Hadoop
26. For the execution of a task in Hadoop 2.0, which of these events occur before node manager launches containers to
host the data processing tasks?
a. Resource Manager launches a container to host the application master
b. The containers in the node manager execute the assigned tasks
c. Application master releases its container
d. The output produced by each task is assembled and the final job run status is reported to the client.
27. Which of the following is the default replication factor applied to a file in HDFS location?
a. 3 c. 2
b. 1 d. None of the above
28. Which of the following statements are false for Yet Another Resource Negotiator (YARN)?
a. The resource manager tracks the resource usage in a node
b. A node manager tracks the resources usage in a node
c. Once a node receives a job, corresponding applications master(s) are launched to execute that job
d. The application master(s) negotiate with the resource manager for the containers to execute the task.
29. Is YARN a replacement for the HADOOP framework?
a. Yes b. No
30. Mr. Bean is working on Hadoop MapReduce programming, but what he wants is a sorted output from the reducer. In
order to achieve this, he is thinking of sorting the output while ingesting it as an input to the reduce itself. Which of
the following would be the best possible option to achieve this?
a. It can be achieved by sorting the data in mapper class, so that output produced by mapper would be sorted
b. It can be achieved by sorting the data in reducer class, so that input taken by reducer would get sorted
c. You cannot change the internal functionality of Hadoop MapReduce programming
d. This is an inbuilt property that is already available in Hadoop MapReduce Programming
SPARK
31. Mr. Bean has received the following requirement from a client after loading an input file in Spark. if someone wants
to perform some analysis, for example, aggregation of columns, then they should be aware of the column names in
the first step itself. However, since the file size is huge, it is not possible to determine whether or not the file
contains a specific column. Which of the following methods should Mr. Bean use for loading the input file in this
case, so that the column names can be determined even without opening the fie?
a. Input_file=sc.textFile(“<path to input file>”)
b. Input_file=spark.read.load(“<path to input file>”, format=”csv”, inferSchema=”True”, header=”True”)
c. It is not possible to infer the schema of the file without opening it at all
d. Input_file=spark.read.load(“<path to input file>”, format=”csv”)
32. Mr. Bean want to store a data file in a particular format so that he can run the following set of queries
Select * from employee where country=’USA’;
Select * from employee where age<15;
Select * from employee where age>60;
Which of the following formats will satisfy this requirement?
a. Text file format c. Avro file format
b. Sequence file format d. Parquet file format
33. Which of the following is the main reason(s) why Spark has taken over Hadoop in today’s era?
a. Spark does everything in memory, whereas Hadoop does everything using hard disk. (commodity-grade
hardware)
b. Spark provides the flexibility to automatically depict the schema of a file, whereas this is not possible with
Hadoop
c. Spark is 10 times faster than Hadoop MapReduce
d. All of the above
34. You are analysing a Spark program and identify that it is taking more than the expected time to execute. The reason
for this issue is that it is recreating some DataFrames repeatedly for processing the other DataFrames, Spark allows
you to avoid this by storing DataFrame in memory so that Spark does not need to recreate it Which strategy would
you use here to store the DataFrame in memory?
a. Add checkpoints to store DataFrames in HDFS
b. Cache the DataFrame that has been used multiple times
c. Create temp tables of DataFrame
d. Merge all the data frames and combine all queries in a single DataFrame query
35. What does the code given below signify in PySpark?
lines = sc.textFile( “<path to input file, where file actually exists>")
Output = lines.map(lambda x:(x.split(“ “)[0],x))
a. Splitting the lines of a file based on the space between words and retaining only the first word out of the given
line
b. Splitting the lines of a file based on the space and retaining all words except the first word out of the given line
c. Creating a paired RDD, with the first word as the key and the line as the value
d. Creating a paired RDD, with the first word as the value and the line as the key
36. While performing word count examples using Spark, Mr. Bean wants to split every line on the basis of whitespace
and create an RDD of words out of it. What could be the best possible option to achieve the same?
a. Map c. Filter
b. FlatMap d. ReduceByKey
37. Which of the following methods can be used to convert a Spark RDD into a Spark DataFrame?
a. RDD.createDF()
b. RDD.convertDF()
c. RDD.toDF()
d. It is not possible to convert an RDD into a DataFrame as RDD does not contain a schema, while DataFrame
contains a schema
38. Which of the following statements is/are correct regarding dataframes?
I. Media content like images and videos should be processed with unstructured APIs.
II. When the data scheme is not defined, data frames should be used about RDDs.
III. Structured APIs have libraries built on top of them to allow writing code more easily
IV. MapReduce-style commands in RDDs give better control to analysts over how particular job should be done
V. DataFrames have in-memory processing capabilities as they are built on top of RDDs and, therefore the properties
are inherited
a. I, II, IV, V c. I, II, IV
b. I, III, IV, V d. I, II, IV, V
39. Look at the summarized Spark dataframe names “df”.
root
|-- Rank: Integer (nullable = true)
|-- Name: String (nullable=true)
|-- Platform: String (nullable = true)
|-- Year: String (nullable = true)
|-- Genre: String (nullable=true)
|-- Publisher: String (nullable=true)
|-- NA_Sales: double (nullable = true)
|-- EU_Sales: double (nullable = true)
|--JP_Sales: double (nullable = true)
|--Other_ Sales: double (nullable = true)
|-- Global Sales: double (nullable = true)
You need to find the genre of game which is most popular in the Other Sales category. Below is a set of commands
you need to choose the correct commands and the order in which they should appear to give the output mentioned
below. Suppose you choose commands 1, 4, and 6 and that they should appear in the order 4, 6 and 1, the answer
will be 4 -6 - 1. Assume all the required libraries have already been imported.
1. P_Genre = spark.sql(“SELECT Genre, SUM(Other_Sales) FROM table GROUP BY Genre ORDER BY
SUM(Other_Sales) DESC”).head(0)
2. P_Genre = spark.sql(“SELECT Genre, SUM(Other_Sales) FROM table GROUP BY Genre ORDER BY
SUM(Other_Sales) DESC”).head(1)[0].asDict()
3. df.createTempView(“table")
4. P_Genre [‘Genre']
5. df.CreateorReplaceTempView(“table")
a. 5-3-1 c. 5-2-4
b. 5-3-2 d. 3-1-4
40. There is a huge CSV file in terabytes and you have to process it. You can pre-process it and convert it into any format
to reduce the size, as there are storage constraints. Which strategy will you apply to reduce the file size? Which
strategy would you use to ensure the lowest memory consumption possible?
a. Convert the CSV file to a JSON file to reduce the file size
b. Apply the gzip compression technique on the CSV file
c. Convert the CSV to parquet format with snappy compression on it
d. CSV files cannot be reduced further by applying any compression technique
41. Which of the following statements is/are correct?
1. The Pandas API on Spark uses the concept of eager execution to accelerate the data analysis process.
2. The Pandas API on Spark runs over multiple nodes on Spark
3. A broadcast hash join is preferred when both the datasets to be joined are of very large sizes.
4. If skew joins are not enabled in Spark AQE, the larger partitions take much longer to be processed which makes
the entire operation slower.
a. 1, 4 c. 2, 4
b. 2, 3, 4 d. 2, 3
42. Suppose Mr. X is writing a code for calculating word count on PySpark, which is given below. However, he realized that
nothing was getting printed to the console. Which of the following is a possible reason for having no output on
execution?
input_file=sc.textFile(“<path for the input file, where file actually exists>”)
words=input_file.flatMap(lambda line: line.split(“ “))
count=words.map(lambda word: (word,1).reduceByKey(lambda x,y:x+y)
a. The syntax for reading the input file is incorrect
b. No action has been called yet; all are transformations
c. The line containing the flat map operation is causing problems, as it is not receiving any input in the required
format
d. The third line is incorrect since grouping bases on key-value pairs is not … RDD
43. Suppose you want to calculate the average score of each player for four matches. Which of the following functions
correctly calculates the value in the RDD ‘avg_score’?

a. map() c. flatMap()
b. reduce() d. mapValues()
MCQs [Paper -II]
44. Which of the following DDL statements holds true for a MYSQL DB?
a. CREATE TABLE upgrad.Product
(
ProductID int PRIMARY KEY DEFAULT 1,
ProductName VARCHAR(100),
OrderID NUMERIC,
OrderDate DATETIME
)
b. CREATE TABLE upgrad.Product
(
ProductID int PRIMARYKEY,
ProductName VARCHAR(100),
OrderID NUMERIC(1, 10),
OrderDate DATETIME
)
c. CREATE TABLE upgrad.Product
(
ProductID int PRIMARY KEY,
ProductName VARCHAR(100),
OrderiD NUMERIC(10,2),
OrderDate DATETIME
)
d. CREATE TABLE upgrad.Product
(
ProductID int PRIMARY KEY,
ProductName VARCHAR(100),
OrderiD NUMERIC(10,2) REFERENCES upgrad.Product(ProductID),
OrderDate DATETIME
)
45. Choose the correct options for the given statements
Statement 1: A maximum cardinality is the maximum number of entity instances that can participate in a
relationship instance.
Statement 2: An identifier determines the type of relationship that an entity has
Statement 3: A disadvantage of a Relational schema is that it’s not horizontally scalable
a. Statement 1 is True while Statement 2 is False
b. Statement 2 is True while Statement 3 is False
c. Statement 3 is True while Statement 1 is False
d. Statement 3 is True while Statement 2 is False
46. Which of the following order of SQL statements is correct?
a. FROM, SELECT, WHERE, GROUP BY, ORDER BY
b. SELECT, FROM, GROUP BY, HAVING, ORDER BY
c. FROM, JOIN, WHERE, WINDOW, ORDER BY
d. FROM, JOIN, GROUP BY, HAVING, WINDOW, ORDER BY
47. You are given two tables: “Student” and “Branch”
Student Branch
student_id branch_id
student_name branch_name
marks_range
year
brand_id
Select the query(s) from the following options that will print the names of the students and their respective years
who belong to the Electrical Engineering branch.
a. SELECT student_name, year
FROM Student a
RIGHT JOIN Branch b
ON a.branch_id=b.branch_id
WHERE branch_name=”Electrical Engineering”);
b. SELECT student_name, year
FROM Student a
LEFT JOIN Branch b
ON a.branch_id=b.branch_id
WHERE branch_name=”Electrical Engineering”);
c. SELECT student_name, year
FROM Student
LEFT JOIN Branch
USING branch_id
WHERE branch_name=”Electrical Engineering”);
d. SELECT student_name, year
FROM Student
LEFT JOIN Branch
USING (branch_id)
WHERE branch_name=”Electrical Engineering”);
48. Which of the following could be a part of job execution in Spark? (Multiple options might be correct)
a. Tasks c. Stages
b. Mapper d. Reducer
49. Which of the following is the best possible option to update a file in HDFS? (Multiple options might be correct)
a. Hadoop fs -update <selection_condition> <updation to be done>
b. There is no direct option of updating a file in Hadoop
c. Fetch the file from Hadoop to local, and then update it and store it back to the HDFS location
d. Fetch the file from Hadoop to local, and then update it. Finally, store it forcefully back in the same location in
Hadoop.
50. Which of the following is a characteristic/benefit of cloud computing?
a. Rapid elasticity and scalability
b. On-demand self-service
c. Resource pooling
d. Access only over a peer-to-peer network connection (A peer-to-peer (P2P) network is created when two or
more PCs are connected and share resources without going through a separate server computer
51. Which of the following statements about Spark and MapReduce is true? (More than one options may be correct)
a. Spark is preferred to MapReduce for processing numerous small files, as it will reduce the overhead in multiple
read and write operations.
b. MapReduce can be more cost-effective than Spark for an extremely large dataset that does not fit in the spark
memory
c. MapReduce is preferred to Spark for iterative processing, as it is much faster than Spark as it can carry out in-
memory computation
d. Spark is preferred to MapReduce to create live dashboards, as Spark’s processing speech is much faster than
that of MapReduce
52. Which statements about type support is true? (More than one option may be correct)
a. It allows conversions from core Pandas to Spark dataframes
b. During type support, the data types need to be physically converted in the appropriate data types
c. When using the type support property of dataframes, the data types are converted automatically to the
appropriate types
d. It allows lazy execution.
53. Which of the following statements is true? (Multiple options may be correct)
a. Virtual Machines load only the required libraries for an application to run.
b. Containers load only the required libraries for an application to run.
c. Virtual Machines are heavier than containers as they load the complete operating system for deploying the
application.
d. Containers are heavier than virtual machines as they load the complete operating system for deploying the
application.

You might also like