Professional Documents
Culture Documents
Data Analytics Questions
Data Analytics Questions
1. Choose if a pair of six-sided die is rolled more than six times what is the probability of getting
all outcomes are units - 1/36
2. What is the primary functions transfer RNA is the process of proteins synthesis - carry amino acids
to the ribosome and match them with the mRNA codon during translation
3. Find out when an event A is independent to itself - always
4. Which of the language is most important of data science - Python
5. How would we define the denominator of the z- score formula - The sample standard deviation
6. In the regression equation y=65.57 + 0.50x, the intersect is defined as what - The predicted value of
y when x is 0
7. Write down the purpose of performing cross validation - To estimate how well the model will
generalize to new, unseen data
8. Where logistic regression is used - to predict the risk of developing a given disease
9. What type of data mining technique is used to uncovered the patterns in data - Clustering
10. What kind of standard probability density function is applicable to discrete random variable
- Poisson distribution
11. Data analysis uses which method to get insights from data - Machine learning
12. Find out the branch of statistics which deals with development of statistical methods is classified as
- Mathematical statistics
13. Linear regression is the supervised machine learning model in which the model finds the best fit
between the independent and dependent variable - True
14. Find out the types of linear regression - simple linear regression and multiple linear regression
15. Justified linear regression analysis is used to predict the value of a variable based on the value of
another variable - True
16. The process of quantifying data is referred to as - Quantitative analysis
17. Justify text analysis is also referred to as text mining - True
18. A scatter plot are used when we want to visually examine the relationship between two
quantitative variable - True
19. A graph that uses vertical bars to represent data is called - bar graph
20. Data analysis is a process of - the act of analysing datasets in order to derive conclusions about
the information contained within them.
21. What is a hypothesis - is a proposed explanation for a phenomenon.
22. Linear regression models are relatively simple and provide an easy to interpret mathematical formula
that can generate - Generating predictions
23. Alternative hypothesis is called as - Research hypothesis
24. If the null hypothesis is false then which is accepted - Alternative hypothesis
25. Justify the mean square error is a measure of the average of the sequence of the residuals - False
26. Logistics equations is used to find the probability of the event = success an event = Logistic
regression
1. Describe why SVMs offer more accurate results than Logistic Regression.
- SVM try to maximize the margin between the closest support vectors whereas
logistic regression maximize the posterior class probability.
- LR is used for solving Classification problems, while SVM model is used for both Classification
and regression.
- SVM is deterministic while LR is probabilistic.
- LR is vulnerable to overfitting, while the risk of overfitting is less in SVM.
- Probability Distributions
A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the minimum
and maximum possible values, but where the possible value would be plotted on the probability
distribution will be determined by a number of factors like mean (average), standard deviation,
skewness, and kurtosis. Two types are : Discrete Probability Distributions and Continuous
Probability Distributions.
- Entropy
Entropy measures the amount of surprise and data present in a variable. In information theory, a
random variable’s entropy reflects the average uncertainty level in its possible outcomes. Events
with higher uncertainty have higher entropy.
3. Identify the difference between Active Learning and Reinforcement Learning. Explain it with
suitable examples and diagram.
Active learning is based on the concept, if a learning algorithm can choose the data it wants to learn
from, it can perform better than traditional methods with substantially less data for training. So it's
kind of semi-supervised machine learning.
Query Output
World Active Learner Classifier/Model
Response
Reinforcement Learning is a type of machine learning technique that enables an agent to learn in an
interactive environment by trial and error using feedback from its own actions and experiences. It is
based on rewards and punishments mechanism which can be both active and passive.
Environment
Agent
- A one-way ANOVA is primarily designed to enable the equality testing between three or
more means. A two-way ANOVA is designed to assess the interrelationship of two
independent variables on a dependent variable.
- A one-way ANOVA only involves one factor or independent variable, whereas there are two
independent variables in a two-way ANOVA.
- In a one-way ANOVA, the one factor or independent variable analysed has three or more
categorical groups. A two-way ANOVA instead compares multiple groups of two factors.
- One-way ANOVA need to satisfy only two principles of design of experiments, i.e., replication and
randomization. As opposed to two-way ANOVA, which meets all three principles of design of
experiments which are replication, randomization and local control.
Big Data is a massive amount of data sets that cannot be stored, processed, or analysed using
traditional tools. There are millions of data sources that generate data at a very rapid rate. These data
sources are present across the world. Some of the largest sources of data are social media platforms
and networks. Let’s use Facebook as an example—it generates more than 500 terabytes of data every
day. This data includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured data.
For example, in a regular Excel sheet, data is classified as structured data—with a definite format. In
contrast, emails fall under semi-structured, and your pictures and videos fall under unstructured data.
All this data combined makes up Big Data.
The activation function in Neural Networks takes an input 'x' multiplied by a weight 'w'. Bias allows
you to shift the activation function by adding a constant (i.e. the given bias) to the input. Bias in Neural
Networks can be thought of as analogous to the role of a constant in a linear function, whereby the
line is effectively transposed by the constant value.
With no bias, the input to the activation function is 'x' multiplied by the connection weight 'w0'.
In a scenario with bias, the input to the activation function is 'x' times the connection weight 'w0'
plus the bias times the connection weight for the bias 'w1'. This has the effect of shifting the
activation function by a constant amount (b * w1).
Several approaches to clustering exist. Each approach is best suited to a particular data distribution.
Focusing on centroid-based clustering using k-means :
Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to hierarchical
clustering defined below. k-means is the most widely-used centroid-based clustering algorithm.
Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.
8. Write short note on Deep Learning.
Deep learning is a branch of machine learning which is based on artificial neural networks. It is
capable of learning complex patterns and relationships within data. In deep learning, we don’t
need to explicitly program everything. It has become increasingly popular in recent years due to
the advances in processing power and the availability of large datasets. Because it is based on
artificial neural networks (ANNs) also known as deep neural networks (DNNs). These neural
networks are inspired by the structure and function of the human brain’s biological neurons, and
they are designed to learn from large amounts of data.
9. Explain ANOVA.
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups or treatments. It is often used to determine whether
there are any statistically significant differences between the means of different groups.
ANOVA compares the variation between group means to the variation within the groups. If the variation
between group means is significantly larger than the variation within groups, it suggests a
significant difference between the means of the groups.
Quadratic Discrimination is the general form of Bayesian discrimination. Discriminant analysis is used
to determine which variables discriminate between two or more naturally occurring groups. Difference
from LDA is it relaxed the assumption that the mean and covariance of all the classes were equal.
Working :
QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of
observations. QDA is particularly useful if there is prior knowledge that individual classes exhibit
distinct covariances.
QDA assumes that observation of each class is drawn from a normal distribution (similar to linear
discriminant analysis).
QDA assumes that each class has its own covariance matrix (different from linear discriminant
analysis)
The advantage of quadratic discriminant analysis over linear discriminant analysis and linear regression
is that when the decision boundaries are linear, the linear discriminant analysis and logistic regression
will perform well.
11. Describe Probability Distribution in details.
Probability distributions enable us to analyze data and draw meaningful conclusions by describing the
likelihood of different outcomes or events.
In statistical analysis, these distributions play a pivotal role in parameter estimation, hypothesis
testing, and data inference. They also find extensive use in risk assessment, particularly in finance and
insurance, where they help assess and manage financial risks by quantifying the likelihood of various
outcomes.
Example Mean, median, mode, range, variance Normal, exponential, binomial, Poisson
Statistical modelling is the process of applying statistical analysis to a dataset. A statistical model is a
mathematical representation (or mathematical model) of observed data.
When data analysts apply various statistical models to the data they are investigating, they are able
to understand and interpret the information more strategically. Rather than sifting through the raw
data, this practice allows them to identify relationships between variables, make predictions about
future sets of data, and visualize that data so that non-analysts and stakeholders can consume and
leverage it.
SVM performs and generalized well on the out of sample data. Due to this as it performs well on
out of generalization sample data SVM proves itself to be fast as the sure fact says that in SVM for
the classification of one sample , the kernel function is evaluated and performed for each and every
support vectors.
Data analytics is the process of examining, cleaning, transforming, and interpreting data to extract
valuable insights and support decision-making. main characteristics of data analytics are:
● Data collection from various sources.
● Data cleaning and preprocessing for accuracy.
● Descriptive, diagnostic, predictive, and prescriptive analytics.
● Data visualization for better understanding.
● Use of machine learning and AI.
● Handling big data and real-time analytics.
● Emphasis on data security and privacy.
● An iterative process to refine insights.
● Domain-specific and fosters a data-driven culture
Class Test II
Short Answer Type Questions
Goal Goal is to make data usable. It is used to make data driven decisions.
2. Express multiple one way repeated measure ANOVA on a two way design?
- A one-way ANOVA is primarily designed to enable the equality testing between three or
more means. A two-way ANOVA is designed to assess the interrelationship of two
independent variables on a dependent variable.
- A one-way ANOVA only involves one factor or independent variable, whereas there are two
independent variables in a two-way ANOVA.
- In a one-way ANOVA, the one factor or independent variable analysed has three or more
categorical groups. A two-way ANOVA instead compares multiple groups of two factors.
- One-way ANOVA need to satisfy only two principles of design of experiments, i.e., replication and
randomization. As opposed to two-way ANOVA, which meets all three principles of design of
experiments which are replication, randomization and local control.
Overfitting : Overfitting occurs when the model tries to cover all the data points, or more than the
required data points present in the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of
the model. The overfitted model has low bias and high variance. Some ways by which the occurrence
of overfitting can be reduced are : Cross-Validation, Training with more data, Removing features, Early
stopping the training, Regularization, etc.
Underfitting : Underfitting occurs when the model is not able to capture the underlying trend of the
data. To avoid overfitting, the fed of training data is stopped earlier due to which the model is not
able to learn enough from the training data, and causes underfitting. Hence it reduces the accuracy
and produces unreliable predictions. An underfitted model has high bias and low variance. We can
avoid underfitting by increasing : the training time of the model and the number of features.
Overfitting Underfitting
- SVM try to maximize the margin between the closest support vectors whereas logistic regression
maximize the posterior class probability.
- LR is used for solving Classification problems, while SVM model is used for both Classification
and regression.
- SVM is deterministic while LR is probabilistic.
- LR is vulnerable to overfitting, while the risk of overfitting is less in SVM.
The activation function in Neural Networks takes an input 'x' multiplied by a weight 'w'. Bias allows
you to shift the activation function by adding a constant (i.e. the given bias) to the input. Bias in Neural
Networks can be thought of as analogous to the role of a constant in a linear function, whereby the
line is effectively transposed by the constant value.
With no bias, the input to the activation function is 'x' multiplied by the connection weight 'w0'.
In a scenario with bias, the input to the activation function is 'x' times the connection weight
'w0' plus the bias times the connection weight for the bias 'w1'. This has the effect of
shifting the activation function by a constant amount (b * w1).
Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and
presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's main
features and characteristics without making any generalizations or inferences to a larger population.
The primary goal of descriptive statistics is to provide a clear and concise summary of the data,
enabling researchers or analysts to gain insights and understand patterns, trends, and distributions
within the dataset. This summary typically includes measures such as central tendency (e.g., mean,
median, mode), dispersion (e.g., range, variance, standard deviation), and shape of the distribution
(e.g., skewness, kurtosis).
8. Applications of Clustering.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are :
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide
the recommendations as per the past search of products. Netflix also uses this technique to
recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.
9. How can the initial number of clusters for k-means algorithm be estimated? Give example.
Elbow Method : It is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS (Within Cluster Sum of Squares) value, which defines the total
variations within a cluster.
Gap Statistic Method : It compares the total within intra-cluster variation for different values of k
with their expected values under null reference distribution of the data. The estimate of the optimal
clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This
means that the clustering structure is far away from the random uniform distribution of points.
Silhouette Approach : It measures the quality of a clustering, i.e., it determines how well each
object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different
values of k. The optimal number of clusters k is the one that maximize the average silhouette
over a range of possible values for k.
10. Describe cleansing and what are the best ways to practice data cleansing?
Data Cleansing, also called Data Scrubbing is the first step of data preparation. It is a process of
finding out and correcting or removing incorrect, incomplete, inaccurate, or irrelevant data in the
dataset. If data is incorrect, outcomes and algorithms are unreliable, though they may look correct.
11. Define the Complexity theory of Map Reduce. What is the reduce size of Map Reduce?
In the context of MapReduce, complexity theory refers to the analysis of the computational complexity of
algorithms and problems when using the MapReduce programming model.
Time Complexity :
Map Phase : If n is the number of input records and m is the size of the input data, the time complexity of
the Map phase is typically O(n+m).
Shuffle and Sort : The time complexity depends on the efficiency of the shuffling mechanism and the size
of the data being shuffled.
Reduce Phase : If r is the number of reducer nodes and k is the number of unique keys, the time
complexity of the Reduce phase is often O(r+k).
Fundamental challenges
- Storage
- Processing
- Security
- Finding and Fixing Data Quality Issues
- Evaluating and Selecting Big Data Technologies
- Data Validation
- Scaling Big Data Systems
13. What are the different hierarchical methods for cluster analysis?
- Agglomerative Clustering :
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual
data points as a single cluster then it is merged continuously based on similarity until it forms one
big cluster containing all objects. It is good at identifying small clusters.
- Divisive Clustering :
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all
the data points into a big single cluster and later on splitting them into smaller heterogeneous
clusters continuously until all data points are in their own cluster. Thus, they are good at
identifying large clusters. It follows a top-down approach and is more efficient than
agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any
predefined implementation in any of the major machine learning frameworks.
14. State the Quadratic Determinant Analysis
Quadratic Discrimination is the general form of Bayesian discrimination. Discriminant analysis is used
to determine which variables discriminate between two or more naturally occurring groups. Difference
from LDA is it relaxed the assumption that the mean and covariance of all the classes were equal.
Working :
QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of
observations. QDA is particularly useful if there is prior knowledge that individual classes exhibit
distinct covariances.
QDA assumes that observation of each class is drawn from a normal distribution (similar to linear
discriminant analysis).
QDA assumes that each class has its own covariance matrix (different from linear discriminant
analysis)
The advantage of quadratic discriminant analysis over linear discriminant analysis and linear regression
is that when the decision boundaries are linear, the linear discriminant analysis and logistic regression
will perform well.
Statistical modelling is the process of applying statistical analysis to a dataset. A statistical model is a
mathematical representation (or mathematical model) of observed data.
When data analysts apply various statistical models to the data they are investigating, they are able
to understand and interpret the information more strategically. Rather than sifting through the raw
data, this practice allows them to identify relationships between variables, make predictions about
future sets of data, and visualize that data so that non-analysts and stakeholders can consume and
leverage it.
16. Explain how HADOOP is related to Big Data? What are the features of HADOOP?
HADOOP is an open source, Java based framework used for storing and processing big data. The data
is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables
concurrent processing and fault tolerance.
HDFS
- Hadoop comes with a distributed file system called the Hadoop Distributed File System (HDFS)
which was designed for Big Data processing.
- It attempts to enable storage of large files, by distributing the data among a pool of data nodes.
- It holds very large amount of data and provides easier access.
- It is highly fault tolerant and designed using low-cost hardware.
17. Explain why Big Data Analytics is helpful in Business Reserve. Explain the steps to be followed
to deploy a Big Data solution.
- Improved Accuracy : Big data analytics enables businesses to make decisions based on facts and
evidence rather than intuition or guesswork. By analyzing large volumes of data, patterns and
trends that may not be apparent at a smaller scale can be identified.
- Real-time Insights : In the fast-paced business environment, real-time insights are essential for
timely decision making. Big data analytics allows organizations to process and analyze data in
real- time or near real-time, enabling them to respond quickly to emerging trends, market shifts,
and customer demands.
- Customer Understanding : Understanding customers is vital for tailoring products, services, and
marketing strategies. Big data analytics provides a holistic view of customer behavior, preferences,
and needs by analyzing multiple data sources, such as online interactions, social media sentiment,
purchase history, and demographic information. This knowledge enables businesses to personalize
their offerings, deliver targeted marketing campaigns, and enhance the overall customer
experience.
- Competitive Advantage : In today’s competitive landscape, gaining an edge over rivals is crucial.
Big data analytics helps companies uncover insights that can differentiate them from competitors.
By identifying market trends, consumer preferences, and emerging opportunities, businesses can
develop innovative products, optimize pricing strategies, and deliver superior customer service.
- Risk Management : Big data analytics plays a vital role in risk management by identifying
potential risks and predicting future outcomes. By analyzing historical data and using predictive
modelling techniques, businesses can identify potential threats, fraud patterns, and anomalies.
This empowers organizations to take proactive measures to mitigate risks, improve security, and
safeguard their operations, reputation, and financial well-being.
- Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently an itemset occurs in a transaction. A typical
example is a Market Based Analysis. It is one of the key techniques used by large relations to
show associations between items. It allows retailers to identify relationships between the items
that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based
on the occurrences of other items in the transaction.
- Deep learning is a branch of machine learning which is based on artificial neural networks. It is
capable of learning complex patterns and relationships within data. In deep learning, we don’t
need to explicitly program everything. It has become increasingly popular in recent years due to
the advances in processing power and the availability of large datasets. Because it is based on
artificial neural networks (ANNs) also known as deep neural networks (DNNs). These neural
networks are inspired by the structure and function of the human brain’s biological neurons, and
they are designed to learn from large amounts of data.
Practice Set
(Multiple Choice Questions)
2. Which of the following is the most important language for Data Science?
a) Java b) Ruby
c) R d) None of the mentioned
3. Which of the following data mining technique is used to uncover patterns in data?
a) Data bagging b) Data booting
c) Data merging d) Data Dredging
5. Which of the following approach should be used if you can’t fix the variable?
a) randomize it b) non stratify it
c) generalize it d) none of the mentioned
6. Focusing on describing or explaining data versus going beyond immediate data and making
inferences is the difference between .
a) Central tendency and common tendency
b) Mutually exclusive and mutually exhaustive properties
c) Descriptive and inferential
d) Positive skew and negative skew
7. Which of the Standard Probability density functions is applicable to discrete Random Variables?
a) Gaussian Distribution b) Poisson Distribution
c) Rayleigh Distribution d) Exponential Distribution
10. If a test was generally very easy, except for a few students who had very low scores, then
the distribution of scores would be .
a) Positively skewed b) Negatively skewed
c) Not skewed at all d) Normal
11. Which of the following approach should be used to ask Data Analysis question?
a) Find only one solution for particular problem
b) Find out the question which is to be answered
c) Find out answer from dataset without asking question
d) None of the mentioned
12. What is the mean of this set of numbers: 4, 6, 7, 9, 2000000
a) 7.5 b) 400,005.2
c) 7 d) 4
15. How can you prevent a clustering algorithm from getting stuck in bad local optima?
a) Set the same seed value for each run b) Use multiple random initializations
c) Both A and B d) None of the above
16. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each
iteration. You find that the value of J(Theta) decreases quickly and then levels off. Based on
this, which of the following conclusions seems most plausible?
a) Rather than using the current value of a, use a larger value of a (say a=1.0)
b) Rather than using the current value of a, use a smaller value of a (say a=0.1)
c) a=0.3 is an effective choice of learning rate
d) None of the above
18. A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not
getting 4 on the second roll?
a) 1/36 b) 1/18
c) 5/36 d) 1/6
e) 1/3
19. Suppose you have trained a logistic regression classifier and it outputs a new example x with a
prediction ho(x) = 0.2. This means
a) Our estimate for P(y=1 | x) b) Our estimate for P(y=0 | x)
c) Our estimate for P(y=1 | x) d) Our estimate for P(y=0 | x)
20. For t distribution, increasing the sample size, the affect will be on ?
a) Standard Error of the Means b) The t-ratio
c) Degrees of Freedom d) All of the above
24. Which of the following is defined as the rule or formula to test a Null Hypothesis?
a) Test statistic b) Population statistic
c) Variance statistic d) Null statistic
25. The point where the Null Hypothesis gets rejected is called as?
a) Significant Value b) Rejection Value
c) Acceptance Value d) Critical Value
28. The process of constructing a mathematical model or function that can be used to predict or
determine one variable by another variable is called
a) regression b) correlation
c) residual d) outlier plot
32. Which would have a constant input in each epoch of training a Deep Learning model?
a) Weight between input and hidden layer b) Weight between hidden and output layer
c) Biases of all hidden layer neurons d) Activation function of output layer
e) None of the above
33. Identify the correct one -
Statement 1: It is possible to train a network well by initializing all the weights as 0
Statement 2: It is possible to train a network well by initializing biases as 0
Which of the statements given above is true?
a) Statement 1 is true while Statement 2 is false
b) Statement 2 is true while statement 1 is false
c) Both statements are true
d) Both statements are false
39. How are input layer units connected to second layer in competitive learning networks?
a) feedforward manner b) feedback manner
c) feedforward and feedback d) feedforward or feedback.
42. In Random forest you can generate hundreds of trees (say T1, T2. Tn) and then aggregate the
results of these tree. Which of the following is true about individual (Tk) tree in Random Forest?
1. Individual tree is built on a subset of the features
2. Individual tree is built on all the features
3. Individual tree is built on a subset of observations
4. Individual tree is built on full set of observations
a) 1 and 3 b) 1 and 4
c) 2 and 3 d) 2 and 4
43. Which of the following algorithm would you take into the consideration in your final
model building based on performance?
Suppose you have given the following graph which shows the ROC curve for two different
classification algorithms such as Random Forest (Red) and Logistic Regression (Blue)
44. In random forest or gradient boosting algorithms, features can be of any type. For example, it can
be a continuous feature or a categorical feature. Which of the following option is true when
you consider these types of features?
a) Only Random Forest algorithm handles real valued attributes by discretizing them
b) Only Gradient boosting algorithm handles real valued attributes by discretizing them
c) Both algorithms can handle real valued attributes by discretizing them
d) None of these.
45. The cell body of neuron can be analogous to what mathematical operation?
a) summing b) differentiator
c) integrator d) none of the mentioned.
46. What is the advantage of basis function over multilayer feedforward neural networks?
a) training of basis function is faster than MLFFNN
b) training of basis function is slower than MLFFNN
c) storing in basis function is faster than MLFFNN
d) none of the mentioned.
47. Suppose you are using a bagging-based algorithm say a Random Forest in model building. Which
of the following can be true?
1. Number of trees should be as large as possible
2. You will have interpretability after using Random Forest
a) 1 b) 2
c) 1 and 2 d) None of these
51. In which of the following cases will K-means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with nonconvex shapes
a) 1 and 2 b) 2 and 3
c) 1, 2 and 3 d) 1 and 3
52. Which scenario prefers failover cluster instance over standalone instance in SQL Server?
a) High Confidentiality b) High Availability
c) High Integrity d) None of the mentioned.
55. An exciting new feature in SQL Server 2014 is the support for the deployment of a
Failover Cluster Instance (FCI) with
a) Cluster Shared Volumes (CSV). b) In memory database.
c) Column oriented database. d) All of the mentioned.
56. Which of the following is a Windows Failover Cluster quorum mode?
a) Node Majority b) No Majority: Read Only
c) File Read Majority d) None of the mentioned.
58. Which of the following argument is used to set importance values? *Correct - (a)
a) scale b) set
c) value d) all of the mentioned.
60. To register a “watch” on a znode data, you need to use the commands to access the
current content or metadata.
a) stat b) put
c) receive d) gets
61. Which of the following has a design policy of using ZooKeeper only for transient data?
a) Hive b) Imphala
c) Hbase d) Oozie
62. Which of the following specifies the required minimum number of observations for each column
pair in order to have a valid result?
a) min_periods b) max_periods
c) minimum_periods d) all of the mentioned.
63. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?
a) Big data management and data mining b) Data warehousing and business intelligence
c) Management of Hadoop clusters d) Collecting and storing unstructured data.
68. What is the unit of data that flows through a Flume agent?
a) Log b) Row
c) Record d) Event
69. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management, and SQL support.
72. Which RNA carries the genetic information from the DNA to the ribosome for protein synthesis?
a) tRNA b) mRNA
c) rRNA d) DNA
73. What is the primary function of tRNA (transfer RNA) in the process of protein synthesis?
a) Transcribing genetic information from DNA b) Carrying amino acids to the ribosome
c) Providing a template for protein synthesis d) Forming the ribosomal structure.
74. Which would be more appropriate to be replaced with question mark in the following figure?
75. Which of the following approach should be used to ask Data Analysis question?
a) Find only one solution for a particular problem
b) Find out the question which is to be answered
c) Find out answer from dataset without asking question
d) None of the mentioned.
76. Choose which of the following design term is perfectly applicable to the below figure?
a) correlation b) cofounding
c) causation d) none of the mentioned.
77. Choose the goal of is to focus on summarizing and explaining a specific set of data.
a) inferential statistics b) descriptive statistics
c) none of these d) all of these.
78. Anita randomly picks 4 cards from a deck of 52-cards and places them back into the deck (Any
set of 4 cards is equally like
a) 48C4 x 52C4 b) 48C4 x 52C8
c) 48C8 x 52C8 d) None of these
79. If a fair six-sided die is rolled 6 times. What is the probability of getting all outcomes as unique?
a) 0.01543 b) 0.01993
c) 0.23148 d) 0.03333
81. Some test scores follow a normal distribution with a mean of 18 and a standard deviation of
6. Select what proportion of test takers have scored between 18 and 24?
a) 0.2 b) 0.22
c) 0.34 d) none of these
82. Weight (Y) is regressed on height (X) of 40 adults. The height range in the data is 50-100 and
the regression line is Y = 100+0.1X with R^2= 0.12. Choose which of the conclusions below
does not necessarily follow?
a) the data suggests a weak relationship between X and Y
b) an adult with an X-value of 60 has an estimated Y-value of 106
c) an adult with an X-value of 80 has an estimated Y-value of 108
d) an adult with an X-value of 90 has an estimated Y-value of 10
83. What is the number of restrictions in the calculation of the F-statistics in question 22 above?
a) 1 b) 2
c) 3 d) 4
86. In a study, subjects are randomly assigned to one of three groups: control, experimental A, or
experimental B. After treatment, the mean scores for the three groups are compared. Choose
the appropriate statistical test for comparing these means is:
a) the analysis of variance b) the correlation coefficient
c) chi square d) the t-test
87. Assume that there is no overlap between the box and whisker plots for three drug
treatments where each drug was administered to 35 individuals. Choose the box plots for
these data:
a) represent evidence against the null hypothesis of ANOVA
b) provide no evidence for, or against, the null hypothesis of ANOVA
c) represent evidence for the null hypothesis of ANOVA
d) none of the mentioned.
88. Select what would happen if instead of using an ANOVA to compare 10 groups, you
performed multiple t- tests
a) making multiple comparisons with t-test increases the probability of making a Type I error
b) sir Ronald Fischer would be turning over in his grave; he put all that work into developing
ANOVA, and you use multiple t-tests
c) nothing serious, except that making multiple comparisons with a t-test requires more computation
than doing a single ANOVA.
d) Nothing, there is no difference between using an ANOVA and using a t-test.
89. If you pooled all the individuals from all three lakes into a single group, select they would have
a standard deviation of
a) 1.257 b) 1.58
c) 3.767 d) 14.19
91. Consider a hypothesis H0 where ϕ0 = 5 against H1 where ϕ1 > 5. Select the test is
a) right tailed b) left tailed
c) center tailed d) cross tailed
96. Choose in supervised learning, class labels of the training samples are
a) known b) unknown
c) does not matter d) partially known
97. Choose if Sw is singular and N < D, its rank is at most (N is total number of samples, D
dimension of data, C is number of classes
a) N+C b) N
c) C d) N-C
98. Select if Sw is singular and N < D the alternative solution is to use (N is total number of samples,
D dimension of data
a) EM b) PCA
c) ML d) any of these
99. Choose which of the following method options is provided by train function for bagging
a) bag earth b) tree bag
c) bag fda d) all of the mentioned
100. Select which of the following is statistical boosting based on additive logistic regression
a) gamboost b) gbm
c) ada d) All of these
103. What conditions are must for competitive network to perform pattern clustering?
a) non linear output layers
b) connection to neighbours is excitatory and to the farther units inhibitory
c) on centre off surround connections
d) none of the mentioned fulfils the whole criteria
1. Describe cleansing and what are the best ways to practice data cleansing?
Data Cleansing, also called Data Scrubbing is the first step of data preparation. It is a process of
finding out and correcting or removing incorrect, incomplete, inaccurate, or irrelevant data in the
dataset. If data is incorrect, outcomes and algorithms are unreliable, though they may look correct.
The best practices for Data Cleaning are :
a. Develop a data quality strategy.
b. Correct data at the point of entry.
c. Validate the accuracy of data.
d. Manage removing duplicate data.
2. Discuss a few problems that data analyst usually encounters while performing the analysis?
Biased Data : Data could be biased due to the source from which it is collected. For instance,
suppose you collect data to determine the winner of an electoral campaign, collecting from a specific
region alone introduces one form of a bias, while collecting data from a specific income group
introduces another form of bias.
Duplicates in the data : Data could have duplicates which may impact the result of analysis.
Missing data : All data points might not have the values for all attributes you are
analyzing. Noisy data : The data could be noisy, usually a high value of variance indicates
noise.
Outliers in the data : Points outside expected range of data that introduce inconsistencies in model.
Difference in formats in various data sources : Some data could be crawled and collected in html
format, while other data might be collected from online reviews in text format. A third source of
data might be structured data already in the database. A data analyst usually must ingest several
data sources to get richer data.
Data Volume : A large amount of data will require a different class of algorithms for processing to
handle efficiently.
- SVM try to maximize the margin between the closest support vectors whereas
logistic regression maximizes the posterior class probability.
- LR is used for solving Classification problems, while SVM model is used for both Classification
and regression.
- SVM is deterministic while LR is probabilistic.
- LR is vulnerable to overfitting, while the risk of overfitting is less in SVM.
SVM outperforms LR in classifying grayscale images of handwritten digits (like the digits 0 to 9).
5. Classify overfitting and underfitting and how to combat them?
Overfitting : Overfitting occurs when the model tries to cover all the data points, or more than the
required data points present in the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy
of the model. The overfitted model has low bias and high variance.
Some ways by which the occurrence of overfitting can be reduced are : Cross-Validation, Training
with more data, Removing features, Early stopping the training, Regularization, etc.
Underfitting : Underfitting occurs when the model is not able to capture the underlying trend of
the data. To avoid overfitting, the fed of training data is stopped earlier due to which the model
is not able to learn enough from the training data, and causes underfitting. Hence it reduces the
accuracy and produces unreliable predictions. An underfitted model has high bias and low variance.
We can avoid underfitting by increasing : the training time of the model and the number of features.
Agglomerative Clustering :
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual
data points as a single cluster then it is merged continuously based on similarity until it forms one
big cluster containing all objects. It is good at identifying small clusters.
Divisive Clustering :
Divisive clustering follows a top-down approach and is more efficient than agglomerative
clustering. It starts by considering all the data points into a big single cluster and later splitting
them into smaller heterogeneous clusters continuously until all data points are in their own cluster.
Thus, they are good at identifying large clusters.
8. Explain how is Hadoop related to big data? What are the features of Hadoop.
HADOOP is an open source, Java based framework used for storing and processing big data. The
data is stored on inexpensive commodity servers that run as clusters. Its distributed file system
enables concurrent processing and fault tolerance.
HDFS
Hadoop comes with a distributed file system called the Hadoop Distributed File System (HDFS)
which was designed for Big Data processing.
It attempts to enable storage of large files, by distributing the data among a pool of data nodes.
It holds very large amount of data and provides easier access.
It is highly fault tolerant and designed using low-cost hardware.
9. Explain in detail about the probability distribution and entropy.
- Probability Distributions
A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the
minimum and maximum possible values, but where the possible value would be plotted on the
probability distribution will be determined by a number of factors like mean (average), standard
deviation, skewness, and kurtosis. Two types are : Discrete Probability Distributions and
Continuous Probability Distributions.
- Entropy
Entropy measures the amount of surprise and data present in a variable. In information theory, a
random variable’s entropy reflects the average uncertainty level in its possible outcomes. Events
with higher uncertainty have higher entropy.
10. Identify the difference between active learning and reinforcement learning explains it
with suitable example and diagram?
Active learning is based on the concept, if a learning algorithm can choose the data it wants to learn
from, it can perform better than traditional methods with substantially less data for training. So it's
kind of semi-supervised machine learning.
Query Output
World Active Learner Classifier/Model
Response
Reinforcement Learning is a type of machine learning technique that enables an agent to learn in an
interactive environment by trial and error using feedback from its own actions and experiences. It is
based on rewards and punishments mechanism which can be both active and passive.
Environment
Agent
11. Write the apriori algorithm for mining frequent item sets with an example.
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say
that the apriori algorithm is an association rule leaning that analyzes that people who bought
product A also bought product B.
Support : It is an association rule which is the percentage of transactions in the database that
contains A U B. i.e.,
Support (A) = Number of transactions containing A / Total number of transactions
Confidence : For an association rule, it is the ratio of the number of transactions containing A U B to
the number of transactions containing A, i.e.,
Confidence (A => B) = No. of transactions containing A U B / No. of transactions containing A
The Apriori algorithm operates on a straightforward premise. When the support value of an item
set exceeds a certain threshold, it is considered a frequent item set. To begin, set the support
criterion, meaning that only those things that have more than the support criterion are considered
relevant.
12. Write the difference between data mining and data analysis?
14. Can you express multiple one way repeated measure ANOVA on a two way design?
- A one-way ANOVA is primarily designed to enable the equality testing between three or
more means. A two-way ANOVA is designed to assess the interrelationship of two
independent variables on a dependent variable.
- A one-way ANOVA only involves one factor or independent variable, whereas there are two
independent variables in a two-way ANOVA.
- In a one-way ANOVA, the one factor or independent variable analysed has three or more
categorical groups. A two-way ANOVA instead compares multiple groups of two factors.
- One-way ANOVA need to satisfy only two principles of design of experiments, i.e., replication
and randomization. As opposed to two-way ANOVA, which meets all three principles of design
of experiments which are replication, randomization and local control.
15. Group the main characteristics of big data. Why you need to take big data?
Five Vs of Big Data :
Volume - The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of
data generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Variety - Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, photos, videos, etc.
Veracity - Veracity means how much the data is reliable. It has many ways to filter or translate the
data. Veracity is the process of being able to handle and manage data efficiently.
Value - Value is an essential characteristic of big data. It is valuable and reliable data that are used
for storing, processing, and analysing.
Velocity - Velocity creates the speed by which the data is created in real-time. It contains the linking
of incoming data sets speeds, rate of change. The primary aspect of Big Data is to provide
demanding data rapidly.
With no bias, the input to the activation function is 'x' multiplied by the connection weight 'w0'.
In a scenario with bias, the input to the activation function is 'x' times the connection weight 'w0'
plus the bias times the connection weight for the bias 'w1'. This has the effect of shifting the
activation function by a constant amount (b * w1).
17. Compare and contrast the relationship between clustering and centroids.
Several approaches to clustering exist. Each approach is best suited to a particular data distribution.
Focusing on centroid-based clustering using k-means :
Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to
hierarchical clustering defined below. k-means is the most widely-used centroid-based clustering
algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.
Centroid-based Clustering
19. Judge how can the initial number of clusters for k-means algorithm be estimated? - Ques (7)
20. Write big data analysis helpful in increasing business revenue? Explain the steps to be followed
to deploy a big data solution.
- Improved Accuracy : Big data analytics enables businesses to make decisions based on facts and
evidence rather than intuition or guesswork. By analyzing large volumes of data, patterns and
trends that may not be apparent at a smaller scale can be identified.
- Real-time Insights : In the fast-paced business environment, real-time insights are essential for
timely decision making. Big data analytics allows organizations to process and analyze data in
real- time or near real-time, enabling them to respond quickly to emerging trends, market shifts,
and customer demands.
- Customer Understanding : Understanding customers is vital for tailoring products, services, and
marketing strategies. Big data analytics provides a holistic view of customer behavior, preferences,
and needs by analyzing multiple data sources, such as online interactions, social media sentiment,
purchase history, and demographic information. This knowledge enables businesses to personalize
their offerings, deliver targeted marketing campaigns, and enhance the overall customer
experience.
- Competitive Advantage : In today’s competitive landscape, gaining an edge over rivals is crucial.
Big data analytics helps companies uncover insights that can differentiate them from competitors.
By identifying market trends, consumer preferences, and emerging opportunities, businesses can
develop innovative products, optimize pricing strategies, and deliver superior customer service.
- Risk Management : Big data analytics plays a vital role in risk management by
identifying potential risks and predicting future outcomes. By analyzing historical data
and using predictive modelling techniques, businesses can identify potential threats,
fraud patterns, and anomalies. This empowers organizations to take proactive measures
to mitigate risks, improve security, and safeguard their operations, reputation, and
financial well-being.
21. Describe K-Means, is it necessary to convert the data into zero mean and unit covariance?
K-Means Clustering is an unsupervised learning algorithm used to solve clustering problems in
machine learning or data science. It groups the unlabeled dataset into different clusters.
The k-means clustering algorithm mainly performs two tasks:
- Determines the best value for K center points or centroids by an iterative process.
- Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
22. Define the complexity theory for Map Reduce. What is the reduced size of Map Reduce?
In the context of MapReduce, complexity theory refers to the analysis of the computational
complexity of algorithms and problems when using the MapReduce programming model.
Map Phase : If n is the number of input records and m is the size of the input data, the time
complexity of the Map phase is typically O(n+m).
Shuffle and Sort : The time complexity depends on the efficiency of the shuffling mechanism and the
size of the data being shuffled.
Reduce Phase : If r is the number of reducer nodes and k is the number of unique keys, the time
complexity of the Reduce phase is often O(r+k).
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
Shuffle : Shuffling helps to carry data from the Mapper to the required Reducer. With the help
of HTTP, the framework calls for applicable partition of the output in all Mappers.
Sort : In this phase, the output of the mapper that is actually the key-value pairs will be sorted
on the basis of its key value.
Reduce : Once shuffling and sorting will be done the Reducer combines the obtained result and
perform the computation operation as per the requirement. OutputCollector.collect() property is
used for writing the output to the HDFS.
23. Group few problems that data analyst usually encounter while performing analysis. - Ques (2)
Apart from these general usages, it is used by the Amazon in its recommendation system to provide
the recommendations as per the past search of products. Netflix also uses this technique to
recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.
The advantage of quadratic discriminant analysis over linear discriminant analysis and linear
regression is that when the decision boundaries are linear, the linear discriminant analysis and logistic
regression will perform well.
28. Explain what is ANOVA?
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups or treatments. It is often used to determine whether there
are any statistically significant differences between the means of different groups.
ANOVA compares the variation between group means to the variation within the groups. If the
variation between group means is significantly larger than the variation within groups, it suggests a
significant difference between the means of the groups.
29. Describe in detail about the role of Probability Distribution in data analytics.
A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the minimum
and maximum possible values, but where the possible value would be plotted on the probability
distribution will be determined by a number of factors like mean (average), standard deviation,
skewness, and kurtosis. Two types are : Discrete Probability Distributions and Continuous
Probability Distributions. Probability distributions enable us to analyze data and draw meaningful
conclusions by describing the likelihood of different outcomes or events.
In statistical analysis, these distributions play a pivotal role in parameter estimation, hypothesis
testing, and data inference. They also find extensive use in risk assessment, particularly in finance
and insurance, where they help assess and manage financial risks by quantifying the likelihood of
various outcomes.
30. Describe briefly descriptive statistics.
Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and
presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's main
features and characteristics without making any generalizations or inferences to a larger population.
The primary goal of descriptive statistics is to provide a clear and concise summary of the data,
enabling researchers or analysts to gain insights and understand patterns, trends, and distributions
within the dataset. This summary typically includes measures such as central tendency (e.g., mean,
median, mode), dispersion (e.g., range, variance, standard deviation), and shape of the distribution
(e.g., skewness, kurtosis).
31. Write the difference between probability distribution and descriptive statistic.
32. Describe in detail about the role of statistical model in data analytics.
Statistical modelling is the process of applying statistical analysis to a dataset. A statistical model is
a mathematical representation (or mathematical model) of observed data.
When data analysts apply various statistical models to the data they are investigating, they are able
to understand and interpret the information more strategically. Rather than sifting through the raw
data, this practice allows them to identify relationships between variables, make predictions about
future sets of data, and visualize that data so that non-analysts and stakeholders can consume and
leverage it.
33. Illustrate Why are SVMs often more accurate than logistic regression? - Ques (4)
34. Explain What is overfitting and underfitting and how to combat them? - Ques (5)
Input neurons
- This is the number of features the neural network uses to make its predictions.
- The input vector needs one input neuron per feature. For tabular data, this is the number of
relevant features in the dataset.
Output neurons
- This is the number of predictions the user wants to make.
- Regression : For regression tasks, this can be one value (e.g. housing price). For multi-variate
regression, it is one neuron per predicted value (e.g. for bounding boxes it can be 4 neurons —
one each for bounding box height, width, x-coordinate, y-coordinate).
- Classification : For binary classification (spam-not spam), we use one output neuron per positive
class, wherein the output represents the probability of the positive class.
39. Explain Why are SVMs often more accurate than logistic regression with examples. - Ques (4)
45. Write a few problems data analyst usually encounter during performing the analysis. - Ques (2)
47. Analyze the relation between Hadoop and big data. - Ques (8)
48. Illustrate the different features of Hadoop. - Ques (8)
51. Analyze why SVMs are often more accurate than logistic regression with examples. - Ques (4)
53. Assess what are the best practices in big data analytics. - Ques (3)
54. Evaluate the techniques used in big data analytics. - Ques (42)
55. Explain the concept of core points in the DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) algorithm, and discuss their significance in the clustering process.
Provide an example to illustrate your explanation.
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is
that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.
- Building the CF Tree : BIRCH summarizes large datasets into smaller, dense regions called
Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered
triple, (N, LS, SS) where ’N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the
data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF
entry to be composed of other CF entries. Optionally, we can condense this initial CF tree into a
smaller CF.
- Global Clustering : Applies an existing clustering algorithm on the leaves of the CF tree. A CF
tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree contains a
pointer to a child node and a CF entry made up of the sum of CF entries in the child nodes.
Optionally, we can refine these clusters.
Due to this two-step process, BIRCH is also called Two Step Clustering.
57. Reframe the basic idea behind the DIANA (Divisive Analysis) clustering algorithm in data
mining, and describe the key steps involved in its process.
DIANA is also known as DIvisie ANAlysis clustering algorithm. It is the top-down approach form of
hierarchical clustering where all data points are initially assigned a single cluster. Further, the
clusters are split into two least similar clusters. This is done recursively until clusters groups are
formed which are distinct to each other.
In step 1 that is the blue outline circle can be thought of as all the points are assigned a single
cluster. Moving forward it is divided into 2 red-colored clusters based on the distances/density of
points. Now, we have two red-colored clusters in step 2. Lastly, in step 3 the two red clusters are
further divided into 2 black dotted each, again based on density and distances to give us final four
clusters. Since the points in the respective 4 clusters are very similar to each other and very different
when compared to the other cluster groups they are not further divided. Thus, this is how we get
DIANA clusters or top-down approached Hierarchical clusters.
58. Express the concept of state transitions in Hidden Markov Models (HMMs) and their significance
in modeling sequential data. Provide an example to illustrate your explanation.
Hidden Markov Model (HMM) is a statistical model that is used to describe the probabilistic
relationship between a sequence of observations and a sequence of hidden states. It is often used in
situations where the underlying system or process that generates the observations is unknown or
hidden, hence it got the name “Hidden Markov Model.”
It is used to predict future observations or classify sequences, based on the underlying hidden
process that generates the data.
An HMM consists of two types of variables: hidden states and observations.
- The hidden states are the underlying variables that generate the observed data, but they are not
directly observable.
- The observations are the variables that are measured and observed.
The relationship between the hidden states and the observations is modeled using a probability
distribution. The Hidden Markov Model (HMM) is the relationship between the hidden states and the
observations using two sets of probabilities :
- The transition probabilities describe probability of transitioning from one hidden state to another.
- The emission probabilities describe the probability of observing an output given a hidden state.