Professional Documents
Culture Documents
Nptel Assignments
Nptel Assignments
ASSIGNMENT WEEK 1:
1. What concept does the phrase "turning data tombs into 'golden nuggets' of
knowledge"signify with respect to data mining? (1 mark)
2. Which step involves the extraction of data patterns using intelligent methods? (1 mark)
3. What is the primary purpose of data mining in the context of the data age? (1 mark)
5. What does the architecture of a data warehouse primarily aim to facilitate? (1 mark)
6. What is the primary advantage of using data warehouse systems for OLAP? (1 mark)
Answer: d) Objects that deviate from the general behaviour or model of the data.
8. How does data mining benefit from scalable database technologies? (1 mark)
10. Which phase in the knowledge discovery process involves the removal of noise
andinconsistent data? (1 mark)
13. Which of the following can be called as a major driver of Data Mining? (1 mark)
14. What does the "long tail" phenomenon refer to in business? (1 mark)
15. What can you infer from the following graph? (1 mark)
Answer: a) Less travelled destinations are growing more popular with each passing
year
BUSINESS INTELLIGENCE AND ANALYTICS
ASSIGNMENT WEEK 2:
1. Which term describes the practice of making decisions purely on data analysis rather
than intuition? (1 mark)
3. What does the acronym ACID stand for in the context of databases? (1 mark)
Answer: c) OLAP deals with data retrieval and analysis for revealing business
trends, while OLTP supports a large number of simple transactions.
6. Why does a data warehouse not require transaction processing, recovery, and
concurrency control mechanisms? (1 mark)
7. Which data warehouse model spans the entire organization and provides corporate-
wide data integration? (1 mark)
11. Which of the following gives a logical structure of the database graphically? (1 mark)
12. Which type of DBMS language is used to create the database schema? (1 mark)
13. Which Data Manipulation command is used to add a new record in a database?
(1 mark)
ANS: b) INSERT
14. What does the atomicity property of the ACID database guarantee in a transaction?
(1 mark)
15. What problem does the ACID property of isolation address? (1 mark)
ASSIGNMENT WEEK 3:
4. What does the apex cuboid in a data cube typically represent? (1 Mark)
5. How many cuboids are there in a 4-dimensional cube with 4 levels each? (1 Mark)
7. Which schema is commonly used in data warehouses due to its capability to model
multiple, interrelated subjects? (1 Mark)
8. Which normal form deals with atomicity and ensures that each attribute contains only
indivisible values? (1 Mark)
10. Consider the SQL statement: SELECT COUNT (*) FROM table_name. What does
it retrieve? (1 Mark)
13. What is the purpose of generating a lattice of cuboids in a data cube model? (1 Mark)
14. What distinguishes a data mart from a data warehouse in terms of schema
preference? (1 Mark)
Answer: C) Data marts typically utilize star or snowflake schemas, while data
warehouses favour the fact constellation schema.
ASSIGNMENT WEEK 4:
1. The concept of "Survival at time 't'" in survival analysis refers to: (1 Mark)
2. What does the term "Churn Rate" signify in customer analytics? (1 Mark)
Answer: B) The time taken for exactly half of a customer cohort to leave
8. Why is it important for businesses to track their customer acquisition cost (CAC)
alongside CLV? (1 Mark)
11. What are the potential limitations of using survival analysis in customer churn
prediction? (1 Mark)
12. How does survival differ from retention in customer analytics? (1 Mark)
13. Which components are crucial for a full customer value calculation? (1 Mark)
14. How does survival analysis contribute to customer value calculations? (1 Mark)
15. An online gaming platform has 100,000 active users. During a specific month,
10,000users become inactive. The platform identifies 20,000 users as being at risk
of becoming inactive during that month. What is the hazard probability for the online
gaming platform during that month? (1 Mark)
Ans: c) 0.5
BUSINESS INTELLIGENCE AND ANALYTICS
ASSIGNMENT WEEK 5:
2. What type of data transformation technique scales data to a specific range, such as
0to 1? (1 Mark)
ANS: d) Standardization/Normalization
4. What does Ordinary Least Squares (OLS) aim to minimize in the context of linear
regression? (1 Mark)
Answer: A) The sum of squared errors between the predicted and observed
values of the dependent variable.
Solution: A) The model is too simple to capture the underlying patterns in the
data.
9. When should one focus on reducing bias in a machine learning model? (1 Mark)
Solution: D) When the model doesn’t fit the data well, and works poorly in
explanatory/predictive performance
10. What is the bias-variance trade-off in machine learning? (1 Mark)
Solution: C- Finding the equilibrium between model complexity and its ability
to generalize to unseen data.
Solution: A) It iteratively uses all but one sample as the test set and the
remaining sample as the training set.
14 What are the three sources of error in predicted Y in machine learning? (1 Mark)
15. Which of the following statements most accurately distinguishes supervised learning
from unsupervised learning in machine learning? (1 Mark)
ASSIGNMENT WEEK 6:
Ans: c) They categorize objects into distinct and mutually exclusive groups based
on their characteristics.
3. Which are the two measures used in ROC curves to visualize the performance
of classifiers? (1 Mark)
4. Which metric measures the ratio of correctly predicted positive observations to the
totalpredicted positives? (1 Mark)
Answer: D) Precision
5. Imagine you're building a spam filter that classifies emails as spam or not spam.
Aftertesting your model, you get the following results:
False Negatives (FN): 10 emails correctly classified as not spam but are
actuallyspam
Ans: d) 0.909
7. How does the test data variation contribute to the errors in predicting Y
values?(1 Mark)
Answer: A) The percentage of test set tuples correctly classified by the classifier.
9. In classification, what does the term "reducible error" primarily refer to? (1 Mark)
10. In a medical study evaluating a diagnostic test for a certain disease, 150 patients
weretested. Of these, 90 patients were diagnosed with the disease, while 60 patients
did not have the disease. The model predictions are as follows:
Choose the correct option that represents the error rate of the diagnostic test basedon the
provided
classification
outcomes. (2 Marks)
Ans: B) 0.2
11. Overfitting occurs when a classifier incorporates anomalies of the training data that
arenot present in the general dataset. (True/False) (1 Mark)
Answer: True
Answer: True
13. What is the lift obtained by a marketing team if, without data mining, they achieve a
15% response rate by randomly selecting 20% of potential customers, while with
predictive analytics, they target 20% of likely customers and achieve a response rate
of 25%? (2 Marks)
Answer: B) 1.67
b) Only 1 is correct
c) Only 2 is correct
15. Which of the following is NOT a commonly used classification technique? (1 Mark)
ASSIGNMENT WEEK 7:
3. Why might a decision tree, resulting from the described process, perform poorly on
atest set? (1 Mark)
4. What might a smaller tree with fewer splits achieve in terms of variance and bias?
(1Mark)
Answer: True
Ans: True
12. What are some common techniques for handling imbalanced data in
classificationtasks? (1 Mark)
13. In Random forest you can generate hundreds of trees (say T1, T2 …..Tn) and then
aggregate the results of these trees. Which of the following is true about an individual
(Tk) tree in Random Forest? (1 Mark)
14. Consider a dataset with a binary target variable (0 or 1) and a split based on
afeature resulting in two child nodes after the split.
Node 1 (left child): Out of 40 samples, 30 belong to class 0 and 10 belong to class 1.
Node 2 (right child): Out of 60 samples, 20 belong to class 0 and 40 belong to class
1.
which option has the correct Gini indices of the child nodes? (3 Marks)
Solution: b) Gini index for Node 1: 0.375, Gini index for Node 2: 0.444
15. How does Random Forest aim to reduce correlation among trees? (1 Mark)
ASSIGNMENT WEEK 8:
1. Which of the following is a common method for splitting nodes in a decision tree?
(1Mark)
3. Which of the following is a popular algorithm for constructing decision trees? (1 Mark)
Answer: A) ID3
7. Which of the following is a common stopping criterion for growing a decision tree?
(1Mark)
8. For decision trees, what purpose does "one-hot encoding" serve? (1 Mark)
9. What's the primary drawback of utilizing a substantial maximum depth for a decision
tree? (1 Mark)
A. Pruning
B. Bagging
C. Boosting
Answer: D
11. Which of the following is NOT commonly associated with the use of decision trees?
(1Mark)
12. How can decision trees be made more robust to noise in the data? (1 Mark)
14. If the true positive value is 10 and the false positive value is 15, what is the precision score
for the classification model? (1 Mark)
Answer: B) 0.4
ASSIGNMENT WEEK 9:
6. Which of the following is a method of choosing the optimal number of clusters for
k- means? (1 Mark)
A. Shadow method
D. B and C
ANSWER: D ) B and C
7. Which of the following statements best describes the goal of SMOTE preprocessing
technique? (1 Mark)
Explanation: D
10. Which of the following statements about distance between clusters is true? (1 Mark)
11. In a 3-dimensional space represented by coordinates (x, y, z), two cluster centroids,
A and B, have coordinates A(2, 4, 6) and B(5, 1, 3) respectively. What is the precise
Euclidean distance between these centroids, denoting their dissimilarity in the
cluster space? (1 Mark)
12. In K-means clustering, what is the purpose of the "elbow method"? (1 Mark)
13. Suppose that a customer transaction table contains 9 items and 3 customers. What
is the Jaccard coefficient (similarity measure for asymmetric binary variables) for
C1 and C2? (1 Mark)
Ans: b. 0.25
14. In the figure below, if you draw a horizontal line on the y-axis for y=2. What will
be the number of clusters formed? (1 Mark)
Solution: (B) 2
15. Assume you want to cluster 7 observations into 3 clusters using the K-Means
clustering algorithm. After first iteration, clusters C1, C2, C3 have following
observations:
What will be the Manhattan distance for observation (9, 9) from cluster centroid C1 in the
second iteration? (1 Mark)
Answer: C) Barnacles
7. What SQL function is used for RFM analysis to scale RFM into a
predefined range? (1 Mark)
Answer: C) NTILE()
Answer: B) matplotlib
11. How is Recency (R) scaled after grouping Days since last order into 10
deciles? (1 Mark)
Answer: B) It is reversed, with the most recent customer receiving the
highest R value
12. Which clustering algorithm assigns data points to the nearest cluster centroid? (1 Mark)
Answer: a. K-Means
13. A retail company wants to segment its customers for targeted marketing
campaigns. They have data on customer demographics (age, gender,
income), purchase history (amount, frequency, categories), and online
behaviour (website visits, clicks). Which features are most suitable for k-
means clustering in this scenario? (1 Mark)
14. True or False: In K-means clustering, each cluster is represented by its center
(centroid) which corresponds to the median of points assigned to the cluster.
Ans: False
15. Out of the reasons elicited below, what would be a major reason for you not
to choose K- means for clustering analysis? (1 Mark)
Ans: A) It is sensitive to noise and outlier data points and also sensitive to the
initial placement of its cluster centers (centroids).
BUSINESS INTELLIGENCE AND ANALYTICS
7. What does the term 'epoch' refer to in neural network training? (1 Mark)
10. If a neural network has 16 input neurons and 4 output neurons, how many neurons
would be recommended for the hidden layer according to thumb rule? (1 Mark)
Answer: A) 8 neurons
12. There is a feedback loop in the final stage of a back propagation algorithm- T/F
(1Mark)
Answer: False
13. In time series analysis, which component represents the long-term movement or
thegeneral direction of the data? (1 Mark)
Answer: C) Trend
15. What differentiates a feedforward neural network from other types of neural networks
like recurrent neural networks (RNNs) or convolutional neural networks (CNNs)? (1
Mark)
5. Which type of words are typically considered for removal or stop word
lists in text mining? (1 Mark)
Answer: True
9. What does a higher Phi coefficient value indicate regarding word co-
occurrence? (1 Mark)
10. .In a text corpus comprising 200 documents, the word "forest" and “wildlife” doesn’t
co-occur in 120 documents. Both "forest" and "wildlife" co-occur in 50 documents.
Furthermore, "forest" without "wildlife" appears in 10 documents, and "wildlife" without
"forest" appears in 20 documents. What is the Phi coefficient to measure the correlation
between the appearance of the words "forest" and "wildlife" in this dataset?
ANSWER: B ) 0.66
11. Which of the following datasets provides a polarity score ranging from
-5 to +5 for words in sentiment analysis? (1 Mark)
13. In TF-IDF analysis, what does the term frequency (tf) measure for
a word in a document? (1 Mark)
15. Which of these techniques is used for normalization in text mining? (1 Mark)
A. Stemming
B. Stop words removal
C. Lemmatization
D. All of the above
ANSWER:D