You are on page 1of 4

Data Science – Question Bank

Unit 1
• Explain different stages of data Science?
• Explain Raw Data with example.
• Illustrate Central Limit Theorem with a neat diagram.
• Describe Baye’s theorem in details with an example.
• Explain Processed Data with example.
• Explain Meta Data (Code Book) with example.

Unit 2
• Differentiate between: Point Estimate, Interval Estimate & Confidence Interval.
• Explain null and alternative hypothesis by considering the example for a flipping coin.
• Explain Type 1 & Type 2 errors in hypothesis testing with suitable examples.
• What is Significance Level? How it regulates the possibility of occurrence of Type 1 & Type 2
errors?
• Explain p values with example.
• Explain the interrelationship of Margin of Error and Standard Error?
• In the population, the average IQ is 100 with a standard deviation of 15. A team of scientists
want to test a new medication to see if it has either a positive or negative effect on intelligence
or not effect at all. A sample of 30 participants who have taken the medication has a mean of
140. Did the medication affect intelligence?

• Study the data distribution given in table and answer the questions below.
Value 1 2 3 4 5 6 7 8
No. of data points with 1 0 0 3 4 10 12 8
that value i.e. frequency
o What is the mean value?
o How would you describe the data distribution? Why?

Unit 3
• Explain how gradient descent is used to fit parameterized models.
• Explain the concept of Lp norm.
• State the advantages and disadvantages of using L1 norm.
• Illustrate with an example, L1 metric distance is always larger than 1.2 metric distance.
• Draw a typical Hessian Matrix? Indicate how is it used in Optimization

Unit 4
• What is machine learning? What is its role in data Science?
• Explain supervised and unsupervised machine learning?

• Explain the standard errors of regression coefficients.


• What is the significance of R2 in regression? If a regression activity returns R2 as 0.9354, what is
your interpretation of the same?
• Which regression is used in modeling of sensor characteristics?
• What do you understand by Logistic regression? What are dichotomous variables in the context of
Logistic regression?
• Explain dichotomous variables in context of Logistics regression using suitable examples.
Data Science – Question Bank
• What do you mean by interpretation of beta coefficients? Explain with examples.
• A certain regression equation obtained following scores: SSR: 20.24 and SSE: 2.11. What is value of
R2? Based on the value of R2 comment on the relationship between the variables.

• Why we measure impurity of a resulting node in Decision tree? List the different measures of
impurity in DT?
• There are 4 coins A, B, C and D out of which 3 coins are of equal weight and one coin is heavier.
Find out the heavier coin using Decision Tree.

Unit 5

• Compare and contrast between Divisive and Agglomerative clustering algorithm?


algo
• How does the KNN algorithm make the predictions on the unseen dataset?
dataset
• Is Feature Scaling required for the KNN Algorithm? Explain with proper justification.
• Cluster the followingg eight points (with (x, y) representing locations) into three clusters: A1(2,
10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9). Initial cluster centers are:
A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two pointspoint a = (x1, y1) and b =
(x2, y2) is defined as-- Ρ(a, b) = |x2 – x1| + |y2 – y1|. Use K-Means
Means Algorithm to find the three
cluster centers after the second iteration.
• Apply KNN and predict the class for the test point (3,7) for k=3. Training points with class are
(x,y,class). (7,7,2), (7,4,2), (3,4,1), (1,4,1), (2,5,2), (3,8,1)

• Use K-Means
Means Algorithm to create two clusters.
clusters. Assume A(2, 2) and C(1, 1) are centers of the
two clusters.

• Marks scored by 10 students in mathematics and computer science are given in table below along
with their result as Pass or Fail. Pappu scores 41 marks in mathematics and 38 marks in computer
science. Using KNN classifier algorithm, determine whether Pappu has passed or failed using K as
1,2,3,5 and 7.
Student Mathematics Computer Result
Science
Naren 80 80 Pass
Amit 75 40 Pass
Deven 65 50 Pass
Surya 40 40 Pass
Data Science – Question Bank
Sanjay 70 40 Pass
Teja 65 37 Fail
Akhilesh 70 25 Fail
Sharad 38 38 Fail
Ajit 35 59 Fail
Shivraj 70 65 Pass

• Using the Naïve Bayes Classifier approach based on the training data set given in table.
Predict Class = Buy Laptop: Yes or No for the feature set: {Income = Low; Student =
No; Credit Rating = Excellent}

Sr. No. Income Student Credit Buy Laptop


Rating
1 High No Fair No
2 High No Excellent No
3 High No Fair Yes
4 Medium No Fair Yes
5 Low Yes Fair Yes
6 Low Yes Excellent No
7 Low Yes Excellent Yes
8 Medium No Fair No
9 Low Yes Fair Yes
10 Medium Yes Fair Yes
11 Medium Yes Excellent Yes
12 Medium No Excellent Yes
13 High Yes Fair Yes
14 Medium No Excellent No

Unit 6

• Define Genie impurity and Entropy impurity. What will their values be, for the purest node?
• How would you execute the k-fold cross validation strategy? Why is Leave-one-out-method its
specialization?
• A confusion matrix for a classification exercise returns the following values: TP=0.962, TN:0.93,
FP:0.12, FN:0.07. Calculate accuracy, precision, recall, sensitivity, specificity and f-score.
• The confusion matrix for a certain classification activity is as shown in Table no. 2
Predicted: NO Predicted: YES
Actual: NO 50 10
Actual: YES 5 100
Find the following classifier performance measures –

1. Accuracy
2. Precision
3. Recall
4. Specificity
5. F-Score
6. Error rate
• Explain the following methods used for training and testing –
Data Science – Question Bank
1. Re substitution
2. K fold Cross-validation
3. Bootstrapping

You might also like