Professional Documents
Culture Documents
Linear Regression
1. Feature engineering is an important step in any model building exercise. It is the process of creating new features
from a given data set using the domain knowledge to leverage the predictive power of a machine learning model.
Which of the following statements are correct?
Statement 1: Feature engineering techniques are applied before train test split.
Statement 2: There is no difference between standardization and normalization,
Statement 3: Mean encoding is a feature engineering technique for handling categorical features.
a. Only 1 and 2 c. Only 2 and 3
b. Only 1 d. Only 3
2. VIF is used to detect Multicollinearity. Which of the following statements is NOT true for VIF?
a. The VIF has lowest bound of 0
b. The VIF has no upper bound
c. VIF for a variable generally changes if you drop one of the predictor variables
d. If a variable is a product of two other variables, it can have a high VIF
3. The distribution of errors terms in a linear regression model should look like (the horizonal line represents y=0):
a. A c. B
b. C d. D
4. For the same dependent variable Y, two models were created using the independent variables X1 and X2. The
following graph represent the fitted line on the scatterplot. (Both the graph are on same scale). Which of the
following is true about the residuals in these two models?
a. The sum of residuals in model 2 is higher than model 1
b. The sum of residuals in model 1 is higher than model 2
c. Both have the same sum of residuals
d. Nothing can be said about the sum of residuals from
the given graph
5. You built a simple linear regression model on a provided problem statement by the client. After a few days, the client
asks you to build a new model with an increased number of data points (old dataset + new data points). The count of
new data points exceeds old data points by 20%.
Which of the following statement is TRUE regarding the mean of residuals?
a. Mean of residuals of old model > Mean of residuals of new model
b. Mean of residuals of old model < Mean of residuals of new model
c. Mean of residuals of old model = Mean of residuals of new model
d. Information provided is not enough to comment on the mean of residuals
6. A scatterplot was plotted for two variables – age and income to find out how the income depends on the age of a
person. It was found that as the income increases linearly with age, the variability in income also increases. This is a
violation of which of the following assumptions of linear regression?
a. Homogeneity c. Heterogeneity
b. Homoscedasticity d. Linearity
7. RFE method is used for:
a. Dummy variable creation c. Detecting multicollinearity
b. Feature selection d. Univariate regression
8. Which of the following assumptions do we make while building a simple linear regression model (assume X and y to be
independent and dependent variables respectively)
A. There is a linear relationship between X and y
B. X and Y are normally distributed
C. Error terms are independent of each other
D. Error terms have constant variance
a. A, B, C and D c. A, C and D
b. A, B and C d. B, C and D
9. A client approached you with a problem statement. You decided to build a multiple linear regression model on the
dataset provided. The dataset consisted of 40 features. Obviously, all features will not be significant. Selecting the
relevant features manually will be a tougher task. You can use RFE to select relevant features. RFE is an automated
feature selection technique. Initially, you assumed 25 features can explain your whole data.
Which of the following commands correctly calls the RFE technique in Python? (Here “lm” is the fitted instance of
multiple linear regression model)
a. from stastmodel.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.fit(X_train,y_train)
b. from sklearn.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.predict(X_train,y_train)
c. from sklearn.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.fit(X_train,y_train)
d. from RFE import feature_selection
rfe=RFE(lm,25)
rfe=rfe.predict(X_train,y_train)
10. Suppose that on adding a new predictor variable to a linear regression model (model-1), the adjusted r-squared of the
new model (model-2) decreases. Choose the correct statement:
a. The r-squared of model-2 will be less than that of model 1
b. The r-squared of model-2 increases, but the complexity of model-2 also increases
c. The r-squared of model-2 decreases, but the complexity of model-2 also increases
d. Nothing can be said about the r-squared of model-2
11. Some of the independent variables (predictors) might be interrelated, due to which the presence of a particular
independent variable in the model is redundant. This phenomenon is called Multicollinearity.
Suppose that you are building a multiple linear regression model for a given problem statement, which of the
following statements is TRUE w.r.t. multicollinearity?
a. Multicollinearity is a problem when your only goal is to predict the independent variable from the set of
dependent variables
b. Multicollinearity is a problem when your goal is to infer the effect on the dependent variable due to
independent variable.
c. Multicollinearity is not a problem if a variable is not collinear with your variable of interest
d. Multicollinearity is not a problem if there are multiple dummy(binary) variables that represent a categorical
variable with three or more categories
12. If the co-efficient of determination is 0.47 between a dependent variable and an independent variable. This denotes
that-
a. The relationship between the two variables is not strong
b. The corelation coefficient between the two variables is also 0.47
c. 47% of the variance in the independent variable is explained by the dependent variable
d. 47% of the variance in the dependent variable is explained by the independent variable
13. While solving linear regression, the dependent variable is-
a. Numeric c. Categorical
b. Dummy coded d. Binary
14. Consider the following two assumptions for a single regression model. (Assume X and y to be independent and
dependent variables respectively).
Statement 1: There is a linear relationship between X and y
Statement 2: X and y are normally distributed
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
15. What does standardized scaling do?
a. Bring all data points in the range 0 to 1
b. Bring all data points in the range -1 to 1
c. Bring all the data points in a normal distribution with mean 0 and standard deviation 1
d. Bring all the data points in a normal distribution with mean 1 and standard deviation 0
16. In the linear regression, F-statistic is used to determine-
a. The significance of the individual beta coefficient
b. The variance explanation strength of the model
c. The significance of the overall model fit
d. Both A and C
17. Suppose you run a regression with one of the feature variable T, with all the remaining feature variables. The R-
squared of this model was found out to be 0.8. What will be the VIF for the variable T?
a. 1.56 c. 2.77
b. 3.33 d. 5.00
18. Which of the following is true regarding the error terms in linear regression?
a. The sum of residuals should be zero
b. The sum of residuals should be lesser than zero
c. The sum of residuals should be greater than zero
d. There is no such restriction on what the sum of residuals should be
Clustering
38. In hierarchical clustering, the shortest distance and the maximum distance between points in two clusters are
defined as ………. and ………….. respectively.
a. Single linkage and complete linkage c. Complete linkage and single linkage
b. Single linkage and average linkage d. Complete linkage and average linkage
39. Which of the following statement is NOT true?
a. Each time the clusters are made during the K-means algorithm, the centroid is updated.
b. The cluster centres that are computed in the K-means algorithm are given by centroid value of the cluster
points
c. Standardization of the data is not important before applying Euclidean distance as a measure of
similarity/dissimilarity
d. The centroid of a column with data points 25, 32, 34 and 23 is 28.5.
e. The Euclidean distance between two points (10,2) and (4,5) is 7.
40. Initializing the following command in Python will result in the following:
model_clus= KMeans(n_clusters=6, max_iter=50)
a. Run maximum 6 iterations c. Run maximum 40 iterations
b. Create 6 final clusters d. Create 50 final clusters
41. Which of the following is not true for Hopkins Statistics?
a. Hopkins statistics decides if the data is suitable for clustering or not
b. Hopkins statistics lie between -1 and 1
c. If the Hopkins statistics comes out to be 0, then the data is uniformly distributed
d. If the Hopkins statistics comes out to be 1, then the data is highly suitable for clustering
42. Consider the two statements-
Statement 1: The distance between 2 clusters is the maximum distance between 2 points in the clusters in
complete linkage.
Statement 2: Most of the time Complete linkage will produce unstructured dendrograms.
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
43. A client has approached you for a problem statement that requires the use of clustering. You decided to model the
problem statement with hierarchical clustering. Consider the datasets having ‘n’ data points.
Which of the following statements is true for the above problem statement?
a. ‘n*n’ distance matrix should be calculated for the mentioned problem statement
b. Initially ‘n’ clusters are formed for the mentioned problem statement
c. The output of the problem statement above is a dendrogram
d. All the above
44. Silhouette metric for any ith point is given by S(i) = (b(i) - a(i)/max(a(i), b(i))
Which of the following is not true about the Silhouette metric?
a. b(i) is the average distance from the nearest neighbour cluster (Separation)
b. a(i) is the average distance from own cluster (Cohesion).
c. If S(i) = 1 the data point is similar to its own cluster.
d. Silhouette metric ranges from 0 to +1
45. Clustering is used to identify the below-
a. Data distribution c. Correlation among the data points
b. Principal components d. Subgroups in the data
46. For a K-means clustering process, the Hopkin Statistic for the dataset came out to be 0.8. Hence the dataset is-
a. Suitable for clustering c. Not suitable for clustering
b. Can’t say from the given information d. None of the above
47. For a K-means clustering process, the Hopkin Statistic for the dataset came out to be 0.3. Hence the dataset is-
a. Suitable for clustering c. Not suitable for clustering
b. Can’t say from the given information d. None of the above
48. You observed the following dendrogram after performing K-means clustering on a dataset. Which of the following
statements can be concluded from this dendrogram?
49. Refer to the dendrogram image below and answer the question that follow:
Find the number of clusters formed if the dendrogram is cut at 0.25. (Assume agglomerative clustering method)
a. 6 c. 11
b. 13 d. 15
Decision Tree
50. Which of the following is the correct sampling technique that is used by a random forest model to overcome the
problem of overfitting?
a. Random sampling c. Bootstrapping
b. Oversampling d. Stratified sampling
51. Which of the following metrics measures how often a randomly chosen element would be incorrectly identified?
a. Entropy c. Information Gain
b. Gini Index d. None of these
52. Which of the following is true for weight of evidence (WoE) analysis?
a. It helps in finding the different predictive patterns for the different segments that might be present in the data
b. WoE helps in treating missing values for both continuous and categorical variables
c. WoE values should follow an increasing or decreasing trend across bins.
d. All of the above
53. Refer to the decision tree given below and choose the statement that is correct as per this tree.
a. The tree given above will show very good performance on the train data
b. The tree given above is an underfitting tree.
c. If the petal length is more than 2.45, then it is equally likely that the flower is either setosa or virginica.
d. Both B and C
54. Suppose you train a decision tree with the following data. Which feature should we split on at the root?
X Y Z V
T T F 1
F F F 0
T T T 0
F T T 1
a. X c. Y
b. Z d. Cannot be determined
55. Select the correct option based on the following decision tree.
Ans:
12. You are given a table that has an "id" column as shown below:
Id
▸ 1
2
3
4
5
6
7
8
9
What will be the output of the 5th Row for the query below?
SELECT SUM(id) OVER(ORDER BY id ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING)
FROM id
a. 42 c. 37
b. 28 d. 44
CLOUD AND AWS
13. What is the maximum number of files that can be stored in S3?
a. 65,536 c. 1,024
b. No limit d. 9,999
14. What is MapReduce in relation to Big Data architecture?
a. It’s a programming framework used by Hadoop to process Big Data
b. It is a function to convert the incoming data (stored in blocks) into key-value pairs.
c. It is a function to aggregate the values on the basis of the keys across the blocks in the cluster
d. There is no MapReduce in the Big Data Architecture.
15. Which of the following is not a factor used to identify Big Data?
1. Velocity
2. Variety
3. Veracity
4. Volume
a. 1 and 3 c. 3
b. 2 d. None of the above
16. Out of the following which is NOT an IAM user?
a. Sudo User c. Privileged administrators
b. End users d. Programmatic users
17. Suppose your company wants to move its computing infrastructure to cloud but does not want to make a huge
upfront investment. Among the following models, which one would be the most cost-effective option for your
company?
a. Community cloud model c. Public cloud model
b. Private cloud model d. None of them
18. Which of the following implements an operating system level virtualization?
a. Type 1 virtualisation c. Type 2 virtualization
b. Bare metal virtualization d. Containerization
19. Consider an application that must be run on four EC2 instances. Out of the four EC2 instances, two of the EC2
instances execute mission-critical software and need to be run all the time. The third EC2 instance hosts the web
server, which gets loaded only when the customer accesses it from time to time; customer uptime needs to be
maintained at 100%. Finally, the last EC2 instance runs a background job that collates the logs from time to time.
Which would be a cost-effective combination of instances for this purpose?
a. Three on-demand instances and one spot instance
b. One reserved instance with a partial upfront payment and three spot instances
c. Two reserved instances with an upfront payment and one on-demand instance and one-spot instance
d. Four on-demand instances
20. Let’s say your organization wants to move its current computing infrastructure to cloud. You have been assigned to
assess the difference between the IaaS model and the PaaS model. Among the following options, which one would you
recommend as an advantage of the IaaS model over the PaaS model?
a. IaaS model offers reduced maintenance from the user end
b. IaaS model usually provides more flexibility in selecting the underlying infrastructure
c. IaaS model removes the complexity of setting-up, configuring and managing infrastructures such as hardware
and operating systems.
d. All of the above
21. Suppose you have been using services of a cloud service provider for a few years and now you want to move your
current cloud infrastructure from the present cloud provider to another. Which of the following characteristics of
cloud always allows you to do so efficiently and cost effectively?
a. Muti-tenancy c. On-Demand Self-Service
b. Infrastructure as Code(IaC) d. Rapid Elasticity
22. Suppose an organization wants to use computer clusters for a complex project. There are two ongoing projects in the
organization: Project A where they need 10 computers to train all ML models on weekdays: and Project B, where
they need 5 computers to train the ML models on weekends What would be a more efficient way to use these
computers?
a. The organization should set up 10+5=15 computers for both the project needs separately
b. The organization should set up max(10,5)=10 computers so that these computers can be shared between
projects
c. The organization should set up just 5 computers
d. None of the above
23. Suppose your organization has a set of web applications that get a highly varying incoming traffic along with a
sensitive image catalog of 80 petabytes, which will be used by these applications. If your organization needs to mode
to the cloud, which of the following would be a cost-effective method to achieve this? (consider the fact that the
public cloud is a cheaper alternative but can have privacy issues)
a. Use a hybrid cloud model, maintain the image catalog in a private cloud and move the web applications to the
public cloud
b. Use a hybrid cloud model, maintain the web applications in a private cloud and transfer the images to the
public cloud
c. Put everything in a public cloud
d. Maintain everything in a private cloud.
24. Which of the following versions of Hadoop is capable of running both Spark as well as MapReduce based applications?
a. Hadoop Version 1 c. Hadoop Version 2
b. Hadoop Version 1.2 d. All of the above
25. Which of the following is the best approach to determine the number of partitions that are created while storing an
input file in Hadoop?
a. It is not possible to determine the number of partitions in Hadoop
b. It can be determined by running the getNumPartitions() function in Hadoop
c. It can be determined by running the parititon.length() function in Hadoop
d. It can be determined by dividing the file size by block size in Hadoop
26. For the execution of a task in Hadoop 2.0, which of these events occur before node manager launches containers to
host the data processing tasks?
a. Resource Manager launches a container to host the application master
b. The containers in the node manager execute the assigned tasks
c. Application master releases its container
d. The output produced by each task is assembled and the final job run status is reported to the client.
27. Which of the following is the default replication factor applied to a file in HDFS location?
a. 3 c. 2
b. 1 d. None of the above
28. Which of the following statements are false for Yet Another Resource Negotiator (YARN)?
a. The resource manager tracks the resource usage in a node
b. A node manager tracks the resources usage in a node
c. Once a node receives a job, corresponding applications master(s) are launched to execute that job
d. The application master(s) negotiate with the resource manager for the containers to execute the task.
29. Is YARN a replacement for the HADOOP framework?
a. Yes b. No
30. Mr. Bean is working on Hadoop MapReduce programming, but what he wants is a sorted output from the reducer. In
order to achieve this, he is thinking of sorting the output while ingesting it as an input to the reduce itself. Which of
the following would be the best possible option to achieve this?
a. It can be achieved by sorting the data in mapper class, so that output produced by mapper would be sorted
b. It can be achieved by sorting the data in reducer class, so that input taken by reducer would get sorted
c. You cannot change the internal functionality of Hadoop MapReduce programming
d. This is an inbuilt property that is already available in Hadoop MapReduce Programming
SPARK
31. Mr. Bean has received the following requirement from a client after loading an input file in Spark. if someone wants
to perform some analysis, for example, aggregation of columns, then they should be aware of the column names in
the first step itself. However, since the file size is huge, it is not possible to determine whether or not the file
contains a specific column. Which of the following methods should Mr. Bean use for loading the input file in this
case, so that the column names can be determined even without opening the fie?
a. Input_file=sc.textFile(“<path to input file>”)
b. Input_file=spark.read.load(“<path to input file>”, format=”csv”, inferSchema=”True”, header=”True”)
c. It is not possible to infer the schema of the file without opening it at all
d. Input_file=spark.read.load(“<path to input file>”, format=”csv”)
32. Mr. Bean want to store a data file in a particular format so that he can run the following set of queries
Select * from employee where country=’USA’;
Select * from employee where age<15;
Select * from employee where age>60;
Which of the following formats will satisfy this requirement?
a. Text file format c. Avro file format
b. Sequence file format d. Parquet file format
33. Which of the following is the main reason(s) why Spark has taken over Hadoop in today’s era?
a. Spark does everything in memory, whereas Hadoop does everything using hard disk. (commodity-grade
hardware)
b. Spark provides the flexibility to automatically depict the schema of a file, whereas this is not possible with
Hadoop
c. Spark is 10 times faster than Hadoop MapReduce
d. All of the above
34. You are analysing a Spark program and identify that it is taking more than the expected time to execute. The reason
for this issue is that it is recreating some DataFrames repeatedly for processing the other DataFrames, Spark allows
you to avoid this by storing DataFrame in memory so that Spark does not need to recreate it Which strategy would
you use here to store the DataFrame in memory?
a. Add checkpoints to store DataFrames in HDFS
b. Cache the DataFrame that has been used multiple times
c. Create temp tables of DataFrame
d. Merge all the data frames and combine all queries in a single DataFrame query
35. What does the code given below signify in PySpark?
lines = sc.textFile( “<path to input file, where file actually exists>")
Output = lines.map(lambda x:(x.split(“ “)[0],x))
a. Splitting the lines of a file based on the space between words and retaining only the first word out of the given
line
b. Splitting the lines of a file based on the space and retaining all words except the first word out of the given line
c. Creating a paired RDD, with the first word as the key and the line as the value
d. Creating a paired RDD, with the first word as the value and the line as the key
36. While performing word count examples using Spark, Mr. Bean wants to split every line on the basis of whitespace
and create an RDD of words out of it. What could be the best possible option to achieve the same?
a. Map c. Filter
b. FlatMap d. ReduceByKey
37. Which of the following methods can be used to convert a Spark RDD into a Spark DataFrame?
a. RDD.createDF()
b. RDD.convertDF()
c. RDD.toDF()
d. It is not possible to convert an RDD into a DataFrame as RDD does not contain a schema, while DataFrame
contains a schema
38. Which of the following statements is/are correct regarding dataframes?
I. Media content like images and videos should be processed with unstructured APIs.
II. When the data scheme is not defined, data frames should be used about RDDs.
III. Structured APIs have libraries built on top of them to allow writing code more easily
IV. MapReduce-style commands in RDDs give better control to analysts over how particular job should be done
V. DataFrames have in-memory processing capabilities as they are built on top of RDDs and, therefore the properties
are inherited
a. I, II, IV, V c. I, II, IV
b. I, III, IV, V d. I, II, IV, V
39. Look at the summarized Spark dataframe names “df”.
root
|-- Rank: Integer (nullable = true)
|-- Name: String (nullable=true)
|-- Platform: String (nullable = true)
|-- Year: String (nullable = true)
|-- Genre: String (nullable=true)
|-- Publisher: String (nullable=true)
|-- NA_Sales: double (nullable = true)
|-- EU_Sales: double (nullable = true)
|--JP_Sales: double (nullable = true)
|--Other_ Sales: double (nullable = true)
|-- Global Sales: double (nullable = true)
You need to find the genre of game which is most popular in the Other Sales category. Below is a set of commands
you need to choose the correct commands and the order in which they should appear to give the output mentioned
below. Suppose you choose commands 1, 4, and 6 and that they should appear in the order 4, 6 and 1, the answer
will be 4 -6 - 1. Assume all the required libraries have already been imported.
1. P_Genre = spark.sql(“SELECT Genre, SUM(Other_Sales) FROM table GROUP BY Genre ORDER BY
SUM(Other_Sales) DESC”).head(0)
2. P_Genre = spark.sql(“SELECT Genre, SUM(Other_Sales) FROM table GROUP BY Genre ORDER BY
SUM(Other_Sales) DESC”).head(1)[0].asDict()
3. df.createTempView(“table")
4. P_Genre [‘Genre']
5. df.CreateorReplaceTempView(“table")
a. 5-3-1 c. 5-2-4
b. 5-3-2 d. 3-1-4
40. There is a huge CSV file in terabytes and you have to process it. You can pre-process it and convert it into any format
to reduce the size, as there are storage constraints. Which strategy will you apply to reduce the file size? Which
strategy would you use to ensure the lowest memory consumption possible?
a. Convert the CSV file to a JSON file to reduce the file size
b. Apply the gzip compression technique on the CSV file
c. Convert the CSV to parquet format with snappy compression on it
d. CSV files cannot be reduced further by applying any compression technique
41. Which of the following statements is/are correct?
1. The Pandas API on Spark uses the concept of eager execution to accelerate the data analysis process.
2. The Pandas API on Spark runs over multiple nodes on Spark
3. A broadcast hash join is preferred when both the datasets to be joined are of very large sizes.
4. If skew joins are not enabled in Spark AQE, the larger partitions take much longer to be processed which makes
the entire operation slower.
a. 1, 4 c. 2, 4
b. 2, 3, 4 d. 2, 3
42. Suppose Mr. X is writing a code for calculating word count on PySpark, which is given below. However, he realized that
nothing was getting printed to the console. Which of the following is a possible reason for having no output on
execution?
input_file=sc.textFile(“<path for the input file, where file actually exists>”)
words=input_file.flatMap(lambda line: line.split(“ “))
count=words.map(lambda word: (word,1).reduceByKey(lambda x,y:x+y)
a. The syntax for reading the input file is incorrect
b. No action has been called yet; all are transformations
c. The line containing the flat map operation is causing problems, as it is not receiving any input in the required
format
d. The third line is incorrect since grouping bases on key-value pairs is not … RDD
43. Suppose you want to calculate the average score of each player for four matches. Which of the following functions
correctly calculates the value in the RDD ‘avg_score’?
a. map() c. flatMap()
b. reduce() d. mapValues()
MCQs [Paper -II]
44. Which of the following DDL statements holds true for a MYSQL DB?
a. CREATE TABLE upgrad.Product
(
ProductID int PRIMARY KEY DEFAULT 1,
ProductName VARCHAR(100),
OrderID NUMERIC,
OrderDate DATETIME
)
b. CREATE TABLE upgrad.Product
(
ProductID int PRIMARYKEY,
ProductName VARCHAR(100),
OrderID NUMERIC(1, 10),
OrderDate DATETIME
)
c. CREATE TABLE upgrad.Product
(
ProductID int PRIMARY KEY,
ProductName VARCHAR(100),
OrderiD NUMERIC(10,2),
OrderDate DATETIME
)
d. CREATE TABLE upgrad.Product
(
ProductID int PRIMARY KEY,
ProductName VARCHAR(100),
OrderiD NUMERIC(10,2) REFERENCES upgrad.Product(ProductID),
OrderDate DATETIME
)
45. Choose the correct options for the given statements
Statement 1: A maximum cardinality is the maximum number of entity instances that can participate in a
relationship instance.
Statement 2: An identifier determines the type of relationship that an entity has
Statement 3: A disadvantage of a Relational schema is that it’s not horizontally scalable
a. Statement 1 is True while Statement 2 is False
b. Statement 2 is True while Statement 3 is False
c. Statement 3 is True while Statement 1 is False
d. Statement 3 is True while Statement 2 is False
46. Which of the following order of SQL statements is correct?
a. FROM, SELECT, WHERE, GROUP BY, ORDER BY
b. SELECT, FROM, GROUP BY, HAVING, ORDER BY
c. FROM, JOIN, WHERE, WINDOW, ORDER BY
d. FROM, JOIN, GROUP BY, HAVING, WINDOW, ORDER BY
47. You are given two tables: “Student” and “Branch”
Student Branch
student_id branch_id
student_name branch_name
marks_range
year
brand_id
Select the query(s) from the following options that will print the names of the students and their respective years
who belong to the Electrical Engineering branch.
a. SELECT student_name, year
FROM Student a
RIGHT JOIN Branch b
ON a.branch_id=b.branch_id
WHERE branch_name=”Electrical Engineering”);
b. SELECT student_name, year
FROM Student a
LEFT JOIN Branch b
ON a.branch_id=b.branch_id
WHERE branch_name=”Electrical Engineering”);
c. SELECT student_name, year
FROM Student
LEFT JOIN Branch
USING branch_id
WHERE branch_name=”Electrical Engineering”);
d. SELECT student_name, year
FROM Student
LEFT JOIN Branch
USING (branch_id)
WHERE branch_name=”Electrical Engineering”);
48. Which of the following could be a part of job execution in Spark? (Multiple options might be correct)
a. Tasks c. Stages
b. Mapper d. Reducer
49. Which of the following is the best possible option to update a file in HDFS? (Multiple options might be correct)
a. Hadoop fs -update <selection_condition> <updation to be done>
b. There is no direct option of updating a file in Hadoop
c. Fetch the file from Hadoop to local, and then update it and store it back to the HDFS location
d. Fetch the file from Hadoop to local, and then update it. Finally, store it forcefully back in the same location in
Hadoop.
50. Which of the following is a characteristic/benefit of cloud computing?
a. Rapid elasticity and scalability
b. On-demand self-service
c. Resource pooling
d. Access only over a peer-to-peer network connection (A peer-to-peer (P2P) network is created when two or
more PCs are connected and share resources without going through a separate server computer
51. Which of the following statements about Spark and MapReduce is true? (More than one options may be correct)
a. Spark is preferred to MapReduce for processing numerous small files, as it will reduce the overhead in multiple
read and write operations.
b. MapReduce can be more cost-effective than Spark for an extremely large dataset that does not fit in the spark
memory
c. MapReduce is preferred to Spark for iterative processing, as it is much faster than Spark as it can carry out in-
memory computation
d. Spark is preferred to MapReduce to create live dashboards, as Spark’s processing speech is much faster than
that of MapReduce
52. Which statements about type support is true? (More than one option may be correct)
a. It allows conversions from core Pandas to Spark dataframes
b. During type support, the data types need to be physically converted in the appropriate data types
c. When using the type support property of dataframes, the data types are converted automatically to the
appropriate types
d. It allows lazy execution.
53. Which of the following statements is true? (Multiple options may be correct)
a. Virtual Machines load only the required libraries for an application to run.
b. Containers load only the required libraries for an application to run.
c. Virtual Machines are heavier than containers as they load the complete operating system for deploying the
application.
d. Containers are heavier than virtual machines as they load the complete operating system for deploying the
application.