QB - Data Science

Data Science – Question Bank - CVV
Unit 1
• Explain different stages of data Science?

• Explain Raw Data with examples.
• Explain Processed Data with examples.
• Explain Meta Data (Code Book) with example.
• Illustrate Central Limit Theorem with a neat diagram.
• Describe Baye’s theorem in details with an example.
• Give examples of discrete data and continuous data
• What are primary and secondary sources of data?
MCQs
1. Which of the following is an example of raw data?

a) original swath files generated from a sonar system
b) initial time-series file of temperature values
c) a real-time GPS-encoded navigation file
d) all of the mentioned
2. Which of the following data is put into a formula to produce commonly accepted results?
a) Raw
b) Processed
c) Synchronized
d) All of the Mentioned
3. Which of the following is another name for raw data?

a) destination data
b) eggy data
c) secondary
d) machine learning
Unit 2
• Differentiate between descriptive statistics and inferential statistics with examples.
• Difference between mean, mode, median
• Difference between standard deviation, variance, range, inter quartile range
• Differentiate between population, sample, parameter, statistic
• Real life examples of normal distribution and binomial distribution
• Differentiate between: Point Estimate, Interval Estimate & Confidence Interval.
• Explain null and alternative hypothesis by considering the example for a flipping coin.
• Explain Type 1 & Type 2 errors in hypothesis testing with suitable examples.
• What is Significance Level? How it regulates the possibility of occurrence of Type 1 & Type 2
errors?
• Explain p values with example.
• Explain the interrelationship of Margin of Error and Standard Error?
• In the population, the average IQ is 100 with a standard deviation of 15. A team of scientists
want to test a new medication to see if it has either a positive or negative effect on intelligence
or not effect at all. A sample of 30 participants who have taken the medication has a mean of
140. Did the medication affect intelligence?
• Study the data distribution given in table and answer the questions below.
Value 1 2 3 4 5 6 7 8
No. of data points with 1 0 0 3 4 10 12 8
that value i.e. frequency
o What is the mean value?
o How would you describe the data distribution? Why?
Unit 3
• Differentiate between Euclidean distance and Manhattan distance.
• Explain how gradient descent is used to fit parameterized models.
• Constrained optimization vs. unconstrained optimization
• Linear vs. Non Linear optimization
• Discrete optimization vs. Non Discrete optimization
• Explain the concept of Lp norm.
• State the advantages and disadvantages of using L1 norm.
• Illustrate with an example, L1 metric distance is always larger than 1.2 metric distance.
• Draw a typical Hessian Matrix? Indicate how is it used in Optimization
• Explain Gradient Descent.
Unit 4
• What is machine learning? What is its role in data Science?
• Explain supervised and unsupervised machine learning?
• Applications of classifications in real life situations
• Regression vs. Clustering vs. Classification
•
• Explain the standard errors of regression coefficients.

• What is the significance of R2 in regression? If a regression activity returns R2 as 0.9354, what is
your interpretation of the same?
• Which regression is used in modeling of sensor characteristics?
• What do you understand by Logistic regression? What are dichotomous variables in the context of
Logistic regression?
• Explain dichotomous variables in context of Logistics regression using suitable examples.
• What do you mean by interpretation of beta coefficients? Explain with examples.
• A certain regression equation obtained following scores: SSR: 20.24 and SSE: 2.11. What is value of
R2? Based on the value of R2 comment on the relationship between the variables.
• Why we measure impurity of a resulting node in Decision tree? List the different measures of
impurity in DT?
• There are 4 coins A, B, C and D out of which 3 coins are of equal weight and one coin is heavier.
Find out the heavier coin using Decision Tree.
Unit 5
• Compare and contrast between Divisive and Agglomerative clustering algorithm?

algo
• How does the KNN algorithm make the predictions on the unseen dataset?
dataset
• Is Feature Scaling required for the KNN Algorithm? Explain with proper justification.
• Concept of Decision Tree
• K Mean clustering
• KNN
• Define Genie impurity and Entropy impurity. What will their values be, for the purest node?
• Cluster the followingg eight points (with (x, y) representing locations) into three clusters: A1(2,
10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9). Initial cluster centers are:
A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two pointspoint a = (x1, y1) and b =
(x2, y2) is defined as-- Ρ(a, b) = |x2 – x1| + |y2 – y1|. Use K-Means
Means Algorithm to find the three
cluster centers after the second iteration.
• Apply KNN and predict the class for the test point (3,7) for k=3. Training points with class are
(x,y,class). (7,7,2), (7,4,2), (3,4,1), (1,4,1), (2,5,2), (3,8,1)
• Use K-Means
Means Algorithm to create two clusters.
clusters. Assume A(2, 2) and C(1, 1) are centers of the
two clusters.
• Marks scored by 10 students in mathematics and computer science are given in table below along
with their result as Pass or Fail. Pappu scores 41 marks in mathematics and 38 marks in computer
science. Using KNN classifier algorithm, determine whether Pappu has passed or failed using K as
1,2,3,5 and 7.
Student Mathematics Computer Result
Science
Naren 80 80 Pass
Amit 75 40 Pass
Deven 65 50 Pass
Surya 40 40 Pass
Sanjay 70 40 Pass
Teja 65 37 Fail
Akhilesh 70 25 Fail
Sharad 38 38 Fail
Ajit 35 59 Fail
Shivraj 70 65 Pass
• Using the Naïve Bayes Classifier approach based on the training data set given in table.
Predict Class = Buy Laptop: Yes or No for the feature set: {Income = Low; Student =
No; Credit Rating = Excellent}
Sr. No. Income Student Credit Buy Laptop

Rating
1 High No Fair No
2 High No Excellent No
3 High No Fair Yes
4 Medium No Fair Yes
5 Low Yes Fair Yes
6 Low Yes Excellent No
7 Low Yes Excellent Yes
8 Medium No Fair No
9 Low Yes Fair Yes
10 Medium Yes Fair Yes
11 Medium Yes Excellent Yes
12 Medium No Excellent Yes
13 High Yes Fair Yes
14 Medium No Excellent No
Unit 6
• Confusion Matrix – Accuracy, Precision, Recall, Specificity.

• Explain specificity and sensitivity in the context of an ROC curve.
• How do you determine if a classifier is better using an ROC curve?
• How would you execute the k-fold cross validation strategy? Why is Leave-one-out-method its
specialization?
• A confusion matrix for a classification exercise returns the following values: TP:0.912, TN:0.93,
FP:0.12, FN:0.05. Calculate accuracy, precision, recall, sensitivity, specificity and f-score.
• The confusion matrix for a certain classification activity is as shown in Table no. 2
Predicted: NO Predicted: YES
Actual: NO 50 10
Actual: YES 5 100
Find the following classifier performance measures –
1. Accuracy
2. Precision
3. Recall
4. Specificity
5. F-Score
6. Error rate
• Explain the following methods used for training and testing –
1. Re substitution
2. K fold Cross-validation
3. Bootstrapping

QB - Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

QB - Data Science

Uploaded by

Copyright:

Available Formats

Data Science – Question Bank - CVV

• Explain different stages of data Science?

1. Which of the following is an example of raw data?

3. Which of the following is another name for raw data?

• Explain the standard errors of regression coefficients.

• Compare and contrast between Divisive and Agglomerative clustering algorithm?

Sr. No. Income Student Credit Buy Laptop

• Confusion Matrix – Accuracy, Precision, Recall, Specificity.

You might also like