You are on page 1of 4

DMDW Question Bank

e
1. What is Data Mining?

un
2. What is Supervised and Unsupervised Learning?
3. What are the different tasks of Data Mining?
4. Discuss the Life cycle of Data Mining projects?

P
5. Explain the process of KDD.
6. What is Prediction?
7. What are the different fields where data mining is used? Explain any one field in detail.

g,
8. Give some data mining tools. Name some best tools which can be used for data analysis.
9. What do you understand by data aggregation and data generalization?

ng
10.What do you understand by a model in Data Mining?
11.What is the difference between univariate, bivariate, and multivariate analysis?
12.What is Visualization?

fE
13.What is Data Preprocessing? What preprocessing steps do you know?
14.What is the difference between Data Processing and Data Mining?
15.What is Data Binning?
o
16.What's the difference between Feature Engineering vs. Feature Selection?
ge
17.What's the difference between Covariance and Correlation?
18.What is Cross-Validation and why is it important in supervised learning?
19.Explain how do you understand Dimensionality Reduction
lle

20.Name some benefits of Feature Selection.


21.Should your Test Data be Cleaned the same way that the Training Data is?
22.What is the difference between Normalization and Scaling?
Co

23.What is the Importance of Data Reduction?


24.How can less Training Data give Higher Accuracy?
25.Why is data more sparse in a high-dimensional space?
26.What is the difference between Test Set and Validation Set?
27.How would you determine the needed Sample Size?
ins

28.What are some common steps in Data Cleaning?


29.What Distance Function do you use for Quantitative Data?
30.How do you Normalize Real-Time Data?
mm

31.What is the Normalization of a data? Can you use different Normalization methods on
different features?
32.Explain the two types of Data Reducing Algorithms
33.When would you use Equal Frequency Binning and when do you use Equal Width
Binning?
Cu

34.How do you cope with Missing data in Regression?


35.Are there any problems with splitting data randomly into Training, Validation,
and Test datasets?

e
36.How would you deal with Outliers in your dataset?
37.How would you use a Confusion Matrix for determining a model performance?

un
38.When would you use chi-Square test?
39.How would you handle Missing Data?
40.How could I (statistically) find features that are more important than others?

P
g,
41.Differentiate Between Data Mining And Data Warehousing?
42.What Are Cubes?

ng
43.What are the differences between OLAP And OLTP?[IMP]
44.What is the difference between variance and covariance?
45.Why should we use data warehousing and how can you extract data for analysis?
46.What are the different storage models available in OLAP?

49.What is Fact Table?


50.What is ETL?
o fE
47.What is Discrete and Continuous data in Data Mining?
48.What is Dimension Table?

51.What is Data mart?


52.What are the key columns in Fact and dimension tables?
ge

53.What is Star Schema? What is Snowflake Schema?


54.What is a core dimension?
lle

55.What are the steps to build the data warehouse?


Co

56.What are the parameters by which you can evaluate a classifier? Explain.
57.What are the methods to validate a classifier?
58.What are the techniques to select features
59.What is data normalization? How can you normalize data, explain with example.
60.What is bayes classification? Explain with example
ins

61.What are the applications of bayes classifier


62.What is nearest neighbour based classifier
63.What is k-nearest neighbour based classifier.
mm

64.What is clustering? Why clustering is important?


65.What is supervised / unsupervised learning or classification?
66.Explain k-means clustering
Cu

67.What is hierarchical clustering?


68.Explain the steps of k-Means Clustering Algorithm
69.What are some Stopping Criteria for k-Means Clustering?

e
70.While performing K-Means Clustering, how do you determine the value of K?

un
71.What is the DBSCAN Algorithm?
72.Explain the DBSCAN Algorithm step by step.
73.Which is the most widely used Density-based Clustering Algorithm?

P
74.What is Density-based Clustering?
75.Explain the Input parameters given to the DBSCAN Algorithm.
76.What are density reachability and density connectivity?

g,
77.Explain the following terms related to DBSCAN Algorithm:
• Direct Density Reachable

ng
• Density Reachable
• Density Connected
78.What are the advantages of the DBSCAN density-based Clustering Algorithm?

fE
79.What are the disadvantages of the DBSCAN density-based Clustering Algorithm?

80.What is a Decision Tree?


o
81.To which kind of problems are decision trees most suitable?
82.On what basis is an attribute selected in the decision tree for choosing it as a node?
ge
83.What is Information Gain? What are its disadvantages?
84.Name some algorithms used for deriving decision trees?
85.How are outliers detected?
lle

86.What is Classification?
87.What are ‘Training set’ and ‘Test set’?
Co

88.Explain the Decision Tree Classifier


89.What are the advantages of a decision tree classifier?
90.Explain bayes classification with an example.
91.Why is KNN preferred when determining missing numbers in data?
92.Explain Over-fitting.
ins

93.Define Tree Pruning.


94.What is Classification Accuracy?
95.What are precision and recall?
mm

96.What is Naive Bayes? How does Naive Bayes work?


97.What are some benefits of Naive Bayes?
98.What are the applications of Naive Bayes?
99.What is posterior probability and prior probability in Naïve Bayes?
Cu

100.Write a difference between classification and clustering.


101.How clustering helps in data reduction?

e
102.Define the terms: frequent itemsets, patterns, and market basket analysis.

un
103.Illustrate market basket analysis
104.What is meant by association rule mining? Explain the process of association rule mining,
using an example.
105.Explain the steps in Apriori algorithm used for frequent pattern mining with the help of an

P
example
106.Write the Apriori algorithm for frequent pattern mining.

g,
107.Explain the procedure for generating strong association rules from frequent itemsets
108.Explain the different algorithms for improving the efficiency of the Apriori algorithm
109.Explain a method for finding frequent patterns without generating frequent itemsets

ng
110.Write and explain the FP-growth algorithm
111.Explain the method of using vertical data format for generating frequent itemsets
112.What is a closed frequent itemset? Explain the approaches to mining closed frequent

fE
itemsets.
113.Explain about mining multilevel association rules using top-down approach, with a
suitable example
o
114.Explain about the variations to the top-down approach in multilevel association rule
mining
115.Explain about multidimensional association rules using suitable examples
ge
116.What is a categorical attribute? What is a quantitative attribute?
117.Explain the approaches for categorizing the techniques for the mining of quantitative
attributes for multidimensional association rules
lle

118.Text Mining is performed on which kind of data?


119.What will TF-IDF do?
Co

120.Removing words like “and”, “is”, “a”, “an”, “the” from a sentence is called as?
121.The process of deriving high quality information from text is referred to as ________.
122.The various aspects of text mining is/are____________.
123.________is fundamentally defining unstructured data to structured data and applying
text.
ins

124.In a structured and annotated text dataset you can just import into your program, to apply
text mining operation is statistically referred as _______.
125.Bag of words referred to as ________ .
mm

126.Machine learning algorithms cannot work with raw text directly; the text must be
converted into numbers. Specifically, vectors of numbers. This is called _________.
127.For a very large corpus, that the length of the vector might be thousands or millions of
positions and each document may contain very few of the known words in the vocabulary
then this results in a vector with lots of zero scores called as________.
128.Creating a vocabulary of two-word pairs is, in turn, called a _________ model.
Cu

You might also like