You are on page 1of 43

Course Name : Data Analytics

Course Code : PEC-702A

1. Choose if a pair of six-sided die is rolled more than six times what is the probability of getting
all outcomes are units - 1/36
2. What is the primary functions transfer RNA is the process of proteins synthesis - carry amino acids
to the ribosome and match them with the mRNA codon during translation
3. Find out when an event A is independent to itself - always
4. Which of the language is most important of data science - Python
5. How would we define the denominator of the z- score formula - The sample standard deviation
6. In the regression equation y=65.57 + 0.50x, the intersect is defined as what - The predicted value of
y when x is 0
7. Write down the purpose of performing cross validation - To estimate how well the model will
generalize to new, unseen data
8. Where logistic regression is used - to predict the risk of developing a given disease
9. What type of data mining technique is used to uncovered the patterns in data - Clustering
10. What kind of standard probability density function is applicable to discrete random variable
- Poisson distribution

11. Data analysis uses which method to get insights from data - Machine learning
12. Find out the branch of statistics which deals with development of statistical methods is classified as
- Mathematical statistics

13. Linear regression is the supervised machine learning model in which the model finds the best fit
between the independent and dependent variable - True
14. Find out the types of linear regression - simple linear regression and multiple linear regression
15. Justified linear regression analysis is used to predict the value of a variable based on the value of
another variable - True
16. The process of quantifying data is referred to as - Quantitative analysis
17. Justify text analysis is also referred to as text mining - True
18. A scatter plot are used when we want to visually examine the relationship between two
quantitative variable - True
19. A graph that uses vertical bars to represent data is called - bar graph
20. Data analysis is a process of - the act of analysing datasets in order to derive conclusions about
the information contained within them.
21. What is a hypothesis - is a proposed explanation for a phenomenon.
22. Linear regression models are relatively simple and provide an easy to interpret mathematical formula
that can generate - Generating predictions
23. Alternative hypothesis is called as - Research hypothesis
24. If the null hypothesis is false then which is accepted - Alternative hypothesis
25. Justify the mean square error is a measure of the average of the sequence of the residuals - False
26. Logistics equations is used to find the probability of the event = success an event = Logistic
regression
1. Describe why SVMs offer more accurate results than Logistic Regression.

- SVM try to maximize the margin between the closest support vectors whereas
logistic regression maximize the posterior class probability.
- LR is used for solving Classification problems, while SVM model is used for both Classification
and regression.
- SVM is deterministic while LR is probabilistic.
- LR is vulnerable to overfitting, while the risk of overfitting is less in SVM.

2. Explain about Probability Distribution and Entropy.

- Probability Distributions
A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the minimum
and maximum possible values, but where the possible value would be plotted on the probability
distribution will be determined by a number of factors like mean (average), standard deviation,
skewness, and kurtosis. Two types are : Discrete Probability Distributions and Continuous
Probability Distributions.

- Entropy
Entropy measures the amount of surprise and data present in a variable. In information theory, a
random variable’s entropy reflects the average uncertainty level in its possible outcomes. Events
with higher uncertainty have higher entropy.

3. Identify the difference between Active Learning and Reinforcement Learning. Explain it with
suitable examples and diagram.

Active learning is based on the concept, if a learning algorithm can choose the data it wants to learn
from, it can perform better than traditional methods with substantially less data for training. So it's
kind of semi-supervised machine learning.

Query Output
World Active Learner Classifier/Model
Response

Reinforcement Learning is a type of machine learning technique that enables an agent to learn in an
interactive environment by trial and error using feedback from its own actions and experiences. It is
based on rewards and punishments mechanism which can be both active and passive.

Environment

Next State Reward/Penalty Action

Agent

4. Express Multiple one-way ANOVA on a two-way device.

- A one-way ANOVA is primarily designed to enable the equality testing between three or
more means. A two-way ANOVA is designed to assess the interrelationship of two
independent variables on a dependent variable.
- A one-way ANOVA only involves one factor or independent variable, whereas there are two
independent variables in a two-way ANOVA.
- In a one-way ANOVA, the one factor or independent variable analysed has three or more
categorical groups. A two-way ANOVA instead compares multiple groups of two factors.
- One-way ANOVA need to satisfy only two principles of design of experiments, i.e., replication and
randomization. As opposed to two-way ANOVA, which meets all three principles of design of
experiments which are replication, randomization and local control.

5. Why you mean to take Big Data?

Big Data is a massive amount of data sets that cannot be stored, processed, or analysed using
traditional tools. There are millions of data sources that generate data at a very rapid rate. These data
sources are present across the world. Some of the largest sources of data are social media platforms
and networks. Let’s use Facebook as an example—it generates more than 500 terabytes of data every
day. This data includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured data.
For example, in a regular Excel sheet, data is classified as structured data—with a definite format. In
contrast, emails fall under semi-structured, and your pictures and videos fall under unstructured data.
All this data combined makes up Big Data.

6. Write the role of Activation Function in Neural Networks.

The activation function in Neural Networks takes an input 'x' multiplied by a weight 'w'. Bias allows
you to shift the activation function by adding a constant (i.e. the given bias) to the input. Bias in Neural
Networks can be thought of as analogous to the role of a constant in a linear function, whereby the
line is effectively transposed by the constant value.

With no bias, the input to the activation function is 'x' multiplied by the connection weight 'w0'.

In a scenario with bias, the input to the activation function is 'x' times the connection weight 'w0'
plus the bias times the connection weight for the bias 'w1'. This has the effect of shifting the
activation function by a constant amount (b * w1).

7. Compare and construct the relationship between Clustering and Centroid.

Several approaches to clustering exist. Each approach is best suited to a particular data distribution.
Focusing on centroid-based clustering using k-means :
Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to hierarchical
clustering defined below. k-means is the most widely-used centroid-based clustering algorithm.
Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.
8. Write short note on Deep Learning.

Deep learning is a branch of machine learning which is based on artificial neural networks. It is
capable of learning complex patterns and relationships within data. In deep learning, we don’t
need to explicitly program everything. It has become increasingly popular in recent years due to
the advances in processing power and the availability of large datasets. Because it is based on
artificial neural networks (ANNs) also known as deep neural networks (DNNs). These neural
networks are inspired by the structure and function of the human brain’s biological neurons, and
they are designed to learn from large amounts of data.

9. Explain ANOVA.

ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups or treatments. It is often used to determine whether
there are any statistically significant differences between the means of different groups.

ANOVA compares the variation between group means to the variation within the groups. If the variation
between group means is significantly larger than the variation within groups, it suggests a
significant difference between the means of the groups.

10. Explain Quadratic Determinant Analysis.

Quadratic Discrimination is the general form of Bayesian discrimination. Discriminant analysis is used
to determine which variables discriminate between two or more naturally occurring groups. Difference
from LDA is it relaxed the assumption that the mean and covariance of all the classes were equal.

Working :
QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of
observations. QDA is particularly useful if there is prior knowledge that individual classes exhibit
distinct covariances.

QDA assumes that observation of each class is drawn from a normal distribution (similar to linear
discriminant analysis).

QDA assumes that each class has its own covariance matrix (different from linear discriminant
analysis)

The advantage of quadratic discriminant analysis over linear discriminant analysis and linear regression
is that when the decision boundaries are linear, the linear discriminant analysis and logistic regression
will perform well.
11. Describe Probability Distribution in details.

A probability distribution is a mathematical function that defines the likelihood of different


outcomes or values of a variable. This function is commonly represented by a graph or probability
table, and it provides the probabilities of various possible results of an experiment or random
phenomenon based on the sample space and the probabilities of events. Probability distributions are
fundamental in probability theory and statistics for analyzing data and making predictions.

Probability distributions enable us to analyze data and draw meaningful conclusions by describing the
likelihood of different outcomes or events.

In statistical analysis, these distributions play a pivotal role in parameter estimation, hypothesis
testing, and data inference. They also find extensive use in risk assessment, particularly in finance and
insurance, where they help assess and manage financial risks by quantifying the likelihood of various
outcomes.

12. Differentiate Descriptive Statistics and Probability Distribution.

Sl. No. Descriptive Statistics Probability Distribution

Summarize and describe an existing Model and predict outcomes from


Purpose
dataset. random variables.
Sl. No. Descriptive Statistics Probability Distribution

Mathematical model for random


Data Type Applied to observed data.
variables.

Example Mean, median, mode, range, variance Normal, exponential, binomial, Poisson

Hypothesis testing, Risk assessment,


Use Cases Data exploration, Data presentation.
modelling random processes.

Describe and summarize data Understand randomness & uncertainty


Goal
properties. in data-generating processes.

13. What is the role of Statistical Model in data analytics?

Statistical modelling is the process of applying statistical analysis to a dataset. A statistical model is a
mathematical representation (or mathematical model) of observed data.

When data analysts apply various statistical models to the data they are investigating, they are able
to understand and interpret the information more strategically. Rather than sifting through the raw
data, this practice allows them to identify relationships between variables, make predictions about
future sets of data, and visualize that data so that non-analysts and stakeholders can consume and
leverage it.

14. Justify why SVM is so fast?

SVM performs and generalized well on the out of sample data. Due to this as it performs well on
out of generalization sample data SVM proves itself to be fast as the sure fact says that in SVM for
the classification of one sample , the kernel function is evaluated and performed for each and every
support vectors.

15. Main characteristics of Data Analytics.

Data analytics is the process of examining, cleaning, transforming, and interpreting data to extract
valuable insights and support decision-making. main characteristics of data analytics are:
● Data collection from various sources.
● Data cleaning and preprocessing for accuracy.
● Descriptive, diagnostic, predictive, and prescriptive analytics.
● Data visualization for better understanding.
● Use of machine learning and AI.
● Handling big data and real-time analytics.
● Emphasis on data security and privacy.
● An iterative process to refine insights.
● Domain-specific and fosters a data-driven culture
Class Test II
Short Answer Type Questions

1. Differentiate between Data Mining and Data analytics.

Criteria Data Mining Data Analysis


It is the process of It is the process of analyzing and
Definition extracting important pattern organizing raw data to determine
from large datasets. useful information and decisions

It is used in discovering hidden In this all operations are involved in


Function
patterns in raw data sets. examining data sets to fine conclusions.

Dataset can be large, medium, or small.


In this data set are
Data set It can also be structured, semi
generally large and
structured or unstructured.
structured.
Often require mathematical Analytical and business intelligence
Models
and statistical models models.
It generally does not
Visualization Surely requires Data visualization.
require visualization

Goal Goal is to make data usable. It is used to make data driven decisions.

It involves the intersection It requires the knowledge of computer


Required
of machine learning, science, statistics, mathematics, subject
Knowledge
statistics, and databases. knowledge Al/Machine Learning.

It shows the data tends The output is verified or discarded


Output
and patterns. hypothesis

2. Express multiple one way repeated measure ANOVA on a two way design?

- A one-way ANOVA is primarily designed to enable the equality testing between three or
more means. A two-way ANOVA is designed to assess the interrelationship of two
independent variables on a dependent variable.
- A one-way ANOVA only involves one factor or independent variable, whereas there are two
independent variables in a two-way ANOVA.
- In a one-way ANOVA, the one factor or independent variable analysed has three or more
categorical groups. A two-way ANOVA instead compares multiple groups of two factors.
- One-way ANOVA need to satisfy only two principles of design of experiments, i.e., replication and
randomization. As opposed to two-way ANOVA, which meets all three principles of design of
experiments which are replication, randomization and local control.

3. Classify overfitting and underfitting and how to combat them?

Overfitting : Overfitting occurs when the model tries to cover all the data points, or more than the
required data points present in the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of
the model. The overfitted model has low bias and high variance. Some ways by which the occurrence
of overfitting can be reduced are : Cross-Validation, Training with more data, Removing features, Early
stopping the training, Regularization, etc.
Underfitting : Underfitting occurs when the model is not able to capture the underlying trend of the
data. To avoid overfitting, the fed of training data is stopped earlier due to which the model is not
able to learn enough from the training data, and causes underfitting. Hence it reduces the accuracy
and produces unreliable predictions. An underfitted model has high bias and low variance. We can
avoid underfitting by increasing : the training time of the model and the number of features.

Overfitting Underfitting

4. Describe why SVM is more accurate than Logistic Regression?

- SVM try to maximize the margin between the closest support vectors whereas logistic regression
maximize the posterior class probability.
- LR is used for solving Classification problems, while SVM model is used for both Classification
and regression.
- SVM is deterministic while LR is probabilistic.
- LR is vulnerable to overfitting, while the risk of overfitting is less in SVM.

5. Write the best practice for big data analytics.

Some of the best practices in big data analytics are :

- Defining clear objectives.


- Knowing which data is important and which is not.
- Assuring data quality.
- Committing to proper data labelling.
- Choosing proper data storage locations.
- Managing data lifecycle.
- Simplifying procedures for backup.
- Implementing security measures.
- Ensuring scalable infrastructure.
- Arranging data audits at regular basis.

6. Write the role of activation function in neural network.

The activation function in Neural Networks takes an input 'x' multiplied by a weight 'w'. Bias allows
you to shift the activation function by adding a constant (i.e. the given bias) to the input. Bias in Neural
Networks can be thought of as analogous to the role of a constant in a linear function, whereby the
line is effectively transposed by the constant value.

With no bias, the input to the activation function is 'x' multiplied by the connection weight 'w0'.
In a scenario with bias, the input to the activation function is 'x' times the connection weight
'w0' plus the bias times the connection weight for the bias 'w1'. This has the effect of
shifting the activation function by a constant amount (b * w1).

7. Describe Descriptive Statistics.

Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and
presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's main
features and characteristics without making any generalizations or inferences to a larger population.
The primary goal of descriptive statistics is to provide a clear and concise summary of the data,
enabling researchers or analysts to gain insights and understand patterns, trends, and distributions
within the dataset. This summary typically includes measures such as central tendency (e.g., mean,
median, mode), dispersion (e.g., range, variance, standard deviation), and shape of the distribution
(e.g., skewness, kurtosis).

8. Applications of Clustering.

The clustering technique can be widely used in various tasks. Some most common uses of this
technique are :
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system to provide
the recommendations as per the past search of products. Netflix also uses this technique to
recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.
9. How can the initial number of clusters for k-means algorithm be estimated? Give example.
Elbow Method : It is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS (Within Cluster Sum of Squares) value, which defines the total
variations within a cluster.

Gap Statistic Method : It compares the total within intra-cluster variation for different values of k
with their expected values under null reference distribution of the data. The estimate of the optimal
clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This
means that the clustering structure is far away from the random uniform distribution of points.

Silhouette Approach : It measures the quality of a clustering, i.e., it determines how well each
object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different
values of k. The optimal number of clusters k is the one that maximize the average silhouette
over a range of possible values for k.

10. Describe cleansing and what are the best ways to practice data cleansing?

Data Cleansing, also called Data Scrubbing is the first step of data preparation. It is a process of
finding out and correcting or removing incorrect, incomplete, inaccurate, or irrelevant data in the
dataset. If data is incorrect, outcomes and algorithms are unreliable, though they may look correct.

The best practices for Data Cleaning are :


- Develop a data quality strategy.
- Correct data at the point of entry.
- Validate the accuracy of data.
- Manage removing duplicate data.

11. Define the Complexity theory of Map Reduce. What is the reduce size of Map Reduce?

In the context of MapReduce, complexity theory refers to the analysis of the computational complexity of
algorithms and problems when using the MapReduce programming model.
Time Complexity :
Map Phase : If n is the number of input records and m is the size of the input data, the time complexity of
the Map phase is typically O(n+m).
Shuffle and Sort : The time complexity depends on the efficiency of the shuffling mechanism and the size
of the data being shuffled.
Reduce Phase : If r is the number of reducer nodes and k is the number of unique keys, the time
complexity of the Reduce phase is often O(r+k).

The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:


Shuffle : Shuffling helps to carry data from the Mapper to the required Reducer. With the help of HTTP,
the framework calls for applicable partition of the output in all Mappers.
Sort : In this phase, the output of the mapper that is actually the key-value pairs will be sorted on the
basis of its key value.
Reduce : Once shuffling and sorting will be done the Reducer combines the obtained result and perform
the computation operation as per the requirement. OutputCollector.collect() property is used for writing
the output to the HDFS. Keep remembering that the output of the Reducer will not be sorted.
12. Challenges of Conventional System.

Fundamental challenges
- Storage
- Processing
- Security
- Finding and Fixing Data Quality Issues
- Evaluating and Selecting Big Data Technologies
- Data Validation
- Scaling Big Data Systems

13. What are the different hierarchical methods for cluster analysis?

Hierarchical clustering refers to an unsupervised learning procedure that determines successive


clusters based on previously defined clusters. It works via grouping data into a tree of clusters.
Hierarchical clustering stats by treating each data points as an individual cluster.

There are two methods of Hierarchical Clustering :

- Agglomerative Clustering :
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual
data points as a single cluster then it is merged continuously based on similarity until it forms one
big cluster containing all objects. It is good at identifying small clusters.

- Divisive Clustering :
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all
the data points into a big single cluster and later on splitting them into smaller heterogeneous
clusters continuously until all data points are in their own cluster. Thus, they are good at
identifying large clusters. It follows a top-down approach and is more efficient than
agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any
predefined implementation in any of the major machine learning frameworks.
14. State the Quadratic Determinant Analysis

Quadratic Discrimination is the general form of Bayesian discrimination. Discriminant analysis is used
to determine which variables discriminate between two or more naturally occurring groups. Difference
from LDA is it relaxed the assumption that the mean and covariance of all the classes were equal.

Working :
QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of
observations. QDA is particularly useful if there is prior knowledge that individual classes exhibit
distinct covariances.

QDA assumes that observation of each class is drawn from a normal distribution (similar to linear
discriminant analysis).

QDA assumes that each class has its own covariance matrix (different from linear discriminant
analysis)

The advantage of quadratic discriminant analysis over linear discriminant analysis and linear regression
is that when the decision boundaries are linear, the linear discriminant analysis and logistic regression
will perform well.

15. What is the role of statistical model in Data Analytics?

Statistical modelling is the process of applying statistical analysis to a dataset. A statistical model is a
mathematical representation (or mathematical model) of observed data.

When data analysts apply various statistical models to the data they are investigating, they are able
to understand and interpret the information more strategically. Rather than sifting through the raw
data, this practice allows them to identify relationships between variables, make predictions about
future sets of data, and visualize that data so that non-analysts and stakeholders can consume and
leverage it.

16. Explain how HADOOP is related to Big Data? What are the features of HADOOP?

HADOOP is an open source, Java based framework used for storing and processing big data. The data
is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables
concurrent processing and fault tolerance.

HDFS
- Hadoop comes with a distributed file system called the Hadoop Distributed File System (HDFS)
which was designed for Big Data processing.
- It attempts to enable storage of large files, by distributing the data among a pool of data nodes.
- It holds very large amount of data and provides easier access.
- It is highly fault tolerant and designed using low-cost hardware.

17. Explain why Big Data Analytics is helpful in Business Reserve. Explain the steps to be followed
to deploy a Big Data solution.

- Improved Accuracy : Big data analytics enables businesses to make decisions based on facts and
evidence rather than intuition or guesswork. By analyzing large volumes of data, patterns and
trends that may not be apparent at a smaller scale can be identified.

- Real-time Insights : In the fast-paced business environment, real-time insights are essential for
timely decision making. Big data analytics allows organizations to process and analyze data in
real- time or near real-time, enabling them to respond quickly to emerging trends, market shifts,
and customer demands.

- Customer Understanding : Understanding customers is vital for tailoring products, services, and
marketing strategies. Big data analytics provides a holistic view of customer behavior, preferences,
and needs by analyzing multiple data sources, such as online interactions, social media sentiment,
purchase history, and demographic information. This knowledge enables businesses to personalize
their offerings, deliver targeted marketing campaigns, and enhance the overall customer
experience.

- Competitive Advantage : In today’s competitive landscape, gaining an edge over rivals is crucial.
Big data analytics helps companies uncover insights that can differentiate them from competitors.
By identifying market trends, consumer preferences, and emerging opportunities, businesses can
develop innovative products, optimize pricing strategies, and deliver superior customer service.

- Risk Management : Big data analytics plays a vital role in risk management by identifying
potential risks and predicting future outcomes. By analyzing historical data and using predictive
modelling techniques, businesses can identify potential threats, fraud patterns, and anomalies.
This empowers organizations to take proactive measures to mitigate risks, improve security, and
safeguard their operations, reputation, and financial well-being.

18. Short Note on Association Rule Mining and Deep Learning.

- Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently an itemset occurs in a transaction. A typical
example is a Market Based Analysis. It is one of the key techniques used by large relations to
show associations between items. It allows retailers to identify relationships between the items
that people buy together frequently.

Given a set of transactions, we can find rules that will predict the occurrence of an item based
on the occurrences of other items in the transaction.

- Deep learning is a branch of machine learning which is based on artificial neural networks. It is
capable of learning complex patterns and relationships within data. In deep learning, we don’t
need to explicitly program everything. It has become increasingly popular in recent years due to
the advances in processing power and the availability of large datasets. Because it is based on
artificial neural networks (ANNs) also known as deep neural networks (DNNs). These neural
networks are inspired by the structure and function of the human brain’s biological neurons, and
they are designed to learn from large amounts of data.
Practice Set
(Multiple Choice Questions)

1. Which of the following is one of the largest boost subclass in boosting?


a) variance boosting b) gradient boosting
c) mean boosting d) all of the mentioned

2. Which of the following is the most important language for Data Science?
a) Java b) Ruby
c) R d) None of the mentioned

3. Which of the following data mining technique is used to uncover patterns in data?
a) Data bagging b) Data booting
c) Data merging d) Data Dredging

4. Non-overlapping categories or intervals are known as .


a) Inclusive b) Exhaustive
c) Mutually exclusive d) Mutually exclusive and exhaustive

5. Which of the following approach should be used if you can’t fix the variable?
a) randomize it b) non stratify it
c) generalize it d) none of the mentioned

6. Focusing on describing or explaining data versus going beyond immediate data and making
inferences is the difference between .
a) Central tendency and common tendency
b) Mutually exclusive and mutually exhaustive properties
c) Descriptive and inferential
d) Positive skew and negative skew

7. Which of the Standard Probability density functions is applicable to discrete Random Variables?
a) Gaussian Distribution b) Poisson Distribution
c) Rayleigh Distribution d) Exponential Distribution

8. The expected value of a discrete random variable ‘x’ is given by


a) P(x) b) ∑ P(x)
c) ∑ x P(x) d) 1

9. The denominator (bottom) of the z-score formula is


a) the standard deviation b) the difference between a score and the mean
c) the range d) the mean

10. If a test was generally very easy, except for a few students who had very low scores, then
the distribution of scores would be .
a) Positively skewed b) Negatively skewed
c) Not skewed at all d) Normal

11. Which of the following approach should be used to ask Data Analysis question?
a) Find only one solution for particular problem
b) Find out the question which is to be answered
c) Find out answer from dataset without asking question
d) None of the mentioned
12. What is the mean of this set of numbers: 4, 6, 7, 9, 2000000
a) 7.5 b) 400,005.2
c) 7 d) 4

13. Which of the following are the advantage/s of Decision Trees?


a) Possible Scenarios can be added
b) Use a white box model, If given result is provided by a model
c) Worst, best and expected values can be determined for different scenarios
d) All of the mentioned

14. What is the purpose of performing cross-validation?


a) To assess the predictive performance of the models
b) To judge how the trained model performs outside the sample on test data
c) Both A and B
d) None of the mentioned

15. How can you prevent a clustering algorithm from getting stuck in bad local optima?
a) Set the same seed value for each run b) Use multiple random initializations
c) Both A and B d) None of the above

16. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each
iteration. You find that the value of J(Theta) decreases quickly and then levels off. Based on
this, which of the following conclusions seems most plausible?
a) Rather than using the current value of a, use a larger value of a (say a=1.0)
b) Rather than using the current value of a, use a smaller value of a (say a=0.1)
c) a=0.3 is an effective choice of learning rate
d) None of the above

17. If P(x) = 0.5 and x = 4, then E(x) = ?


a) 1 b) 0.5
c) 4 d) 2

18. A fair six-sided die is rolled twice. What is the probability of getting 2 on the first roll and not
getting 4 on the second roll?
a) 1/36 b) 1/18
c) 5/36 d) 1/6
e) 1/3

19. Suppose you have trained a logistic regression classifier and it outputs a new example x with a
prediction ho(x) = 0.2. This means
a) Our estimate for P(y=1 | x) b) Our estimate for P(y=0 | x)
c) Our estimate for P(y=1 | x) d) Our estimate for P(y=0 | x)

20. For t distribution, increasing the sample size, the affect will be on ?
a) Standard Error of the Means b) The t-ratio
c) Degrees of Freedom d) All of the above

21. In random experiment, the observations of random variable are classified as


a) events b) composition
c) trials d) functions
22. When do the conditional density functions get converted into the marginally density functions?
a) Only if random variables exhibit statistical dependency
b) Only if random variables exhibit statistical independency
c) Only if random variables exhibit deviation from its mean value
d) If random variables do not exhibit deviation from its mean value

23. Which of the following are universal approximators?


a) Kernel SVM b) Neural Networks
c) Boosted Decision Trees d) All of the above

24. Which of the following is defined as the rule or formula to test a Null Hypothesis?
a) Test statistic b) Population statistic
c) Variance statistic d) Null statistic

25. The point where the Null Hypothesis gets rejected is called as?
a) Significant Value b) Rejection Value
c) Acceptance Value d) Critical Value

26. Which of the following are universal approximators?


a) Kernel SVM b) Neural Networks
c) Boosted Decision Trees d) All of the above

27. What is the function of a post-test in ANOVA?


a) Determine if any statistically significant group differences have occurred.
b) Describe those groups that have reliable differences between group means.
c) Set the critical value for the F test (or chi-square)
d) None of the mentioned

28. The process of constructing a mathematical model or function that can be used to predict or
determine one variable by another variable is called
a) regression b) correlation
c) residual d) outlier plot

29. In the regression equation Y = 75.65 + 0.50X, the intercept is


a) 0.50 b) 75.65
c) 1.00 d) indeterminable

30. The probability of Type 1 error is referred as?


a) 1-α b) β
c) α d) 1-β

31. Large values of the log-likelihood statistic indicate -


a) That there are a greater number of explained vs. unexplained observations.
b) That the statistical model fits the data well.
c) That as the predictor variable increases, the likelihood of the outcome occurring decreases.
d) That the statistical model is a poor fit of the data.

32. Which would have a constant input in each epoch of training a Deep Learning model?
a) Weight between input and hidden layer b) Weight between hidden and output layer
c) Biases of all hidden layer neurons d) Activation function of output layer
e) None of the above
33. Identify the correct one -
Statement 1: It is possible to train a network well by initializing all the weights as 0
Statement 2: It is possible to train a network well by initializing biases as 0
Which of the statements given above is true?
a) Statement 1 is true while Statement 2 is false
b) Statement 2 is true while statement 1 is false
c) Both statements are true
d) Both statements are false

34. What is stability plasticity dilemma?


a) system can neither be stable nor plastic
b) static inputs & categorization can’t be handled
c) dynamic inputs & categorization can’t be handled
d) none of the mentioned.

35. Drawbacks of template matching are?


a) time consuming b) highly restricted
c) more generalized d) none of the mentioned.

36. What is true regarding backpropagation rule?


a) it is a feedback neural network
b) actual output is determined by computing the outputs of units for each hidden layer
c) hidden layers output is not all important, they are only meant for supporting input & output layers
d) none of the mentioned.

37. What is meant by generalized in statement “backpropagation is a generalized delta rule”?


a) because delta rule can be extended to hidden layer units
b) because delta is applied to only input and output layers, making it more simple and generalized.
c) it has no significance
d) none of the mentioned.

38. Correlation learning law can be represented by equation?


a) ∆wij= µ(si) aj
b) ∆wij= µ(bi – si) aj
c) ∆wij= µ(bi – si) aj Á(xi),where Á(xi) is derivative of xi
d) ∆wij= µ bi aj

39. How are input layer units connected to second layer in competitive learning networks?
a) feedforward manner b) feedback manner
c) feedforward and feedback d) feedforward or feedback.

40. What is the name of the model in figure below?

a) Rosenblatt perceptron model b) McCulloch-Pitts model


c) Widrow’s Adaline model d) None of the mentioned.
41. Which is/are true about Random Forest and Gradient Boosting ensemble methods?
1. Both methods can be used for classification task
2. Random Forest is use for classification whereas Gradient Boosting is use for regression task
3. Random Forest is use for regression whereas Gradient Boosting is use for Classification task
4. Both methods can be used for regression task
a) 1 b) 2
c) 1 and 4 d) 3

42. In Random forest you can generate hundreds of trees (say T1, T2. Tn) and then aggregate the
results of these tree. Which of the following is true about individual (Tk) tree in Random Forest?
1. Individual tree is built on a subset of the features
2. Individual tree is built on all the features
3. Individual tree is built on a subset of observations
4. Individual tree is built on full set of observations
a) 1 and 3 b) 1 and 4
c) 2 and 3 d) 2 and 4

43. Which of the following algorithm would you take into the consideration in your final
model building based on performance?
Suppose you have given the following graph which shows the ROC curve for two different
classification algorithms such as Random Forest (Red) and Logistic Regression (Blue)

a) Random Forest b) Logistic Regression


c) Both of the above d) None of these

44. In random forest or gradient boosting algorithms, features can be of any type. For example, it can
be a continuous feature or a categorical feature. Which of the following option is true when
you consider these types of features?
a) Only Random Forest algorithm handles real valued attributes by discretizing them
b) Only Gradient boosting algorithm handles real valued attributes by discretizing them
c) Both algorithms can handle real valued attributes by discretizing them
d) None of these.

45. The cell body of neuron can be analogous to what mathematical operation?
a) summing b) differentiator
c) integrator d) none of the mentioned.
46. What is the advantage of basis function over multilayer feedforward neural networks?
a) training of basis function is faster than MLFFNN
b) training of basis function is slower than MLFFNN
c) storing in basis function is faster than MLFFNN
d) none of the mentioned.

47. Suppose you are using a bagging-based algorithm say a Random Forest in model building. Which
of the following can be true?
1. Number of trees should be as large as possible
2. You will have interpretability after using Random Forest
a) 1 b) 2
c) 1 and 2 d) None of these

48. What consist of a basic counter propagation network?


a) a feedforward network only b) a feedforward network with hidden layer
c) two feedforward networks with hidden layer d) none of the mentioned.

49. The process of adjusting the weight is known as?


a) activation b) synchronisation
c) learning d) none of the mentioned.

50. How do you handle missing or corrupted data in a dataset?


a) Drop missing rows or columns
b) Replace missing values with mean/median/mode
c) Assign a unique category to missing values
d) All of the above.

51. In which of the following cases will K-means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with nonconvex shapes
a) 1 and 2 b) 2 and 3
c) 1, 2 and 3 d) 1 and 3

52. Which scenario prefers failover cluster instance over standalone instance in SQL Server?
a) High Confidentiality b) High Availability
c) High Integrity d) None of the mentioned.

53. The resources owned by WSFC node include


a) Destination address
b) SQL Server Browser
c) One file share resource, if the FILESTREAM feature is installed
d) None of the mentioned.

54. A Windows Failover Cluster can support up to how many nodes?


a) 12 b) 14
c) 16 d) 18.

55. An exciting new feature in SQL Server 2014 is the support for the deployment of a
Failover Cluster Instance (FCI) with
a) Cluster Shared Volumes (CSV). b) In memory database.
c) Column oriented database. d) All of the mentioned.
56. Which of the following is a Windows Failover Cluster quorum mode?
a) Node Majority b) No Majority: Read Only
c) File Read Majority d) None of the mentioned.

57. Benefits that SQL Server failover cluster instances provide


a) Protection at the instance level through redundancy
b) Disaster recovery solution using a multi-subnet FCI
c) Zero reconfiguration of applications and clients during failovers
d) All of the mentioned.

58. Which of the following argument is used to set importance values? *Correct - (a)
a) scale b) set
c) value d) all of the mentioned.

59. Point out the correct statement.


a) All z nodes are ephemeral, which means they are describing a “temporary” state
b) /hbase/replication/state contains the list of RegionServers in the main cluster
c) Offline snapshots are coordinated by the Master using ZooKeeper to communicate with the
RegionServers using a two-phase-commit-like transaction
d) None of the mentioned.

60. To register a “watch” on a znode data, you need to use the commands to access the
current content or metadata.
a) stat b) put
c) receive d) gets

61. Which of the following has a design policy of using ZooKeeper only for transient data?
a) Hive b) Imphala
c) Hbase d) Oozie

62. Which of the following specifies the required minimum number of observations for each column
pair in order to have a valid result?
a) min_periods b) max_periods
c) minimum_periods d) all of the mentioned.

63. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?
a) Big data management and data mining b) Data warehousing and business intelligence
c) Management of Hadoop clusters d) Collecting and storing unstructured data.

64. All of the following accurately describe Hadoop, EXCEPT


a) Open-source b) Real-time
c) Java-based d) Distributed computing approach.

65. What are the five V's of Big Data?


a) Volume. b) Velocity.
c) Variety. d) All the above.

66. What are the different features of Big Data Analytics?


a) Open-Source b) Scalability
c) Data Recovery d) All the above.
67. Facebook Tackles Big Data With based on Hadoop.
a) ‘Project Prism’ b) ‘Prism’
c) ‘Project Big’ d) ‘Project Data’.

68. What is the unit of data that flows through a Flume agent?
a) Log b) Row
c) Record d) Event

69. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management, and SQL support.

70. What was Hadoop named after?


a) Creator Doug Cutting’s favourite circus act
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop development.

71. When a jobTracker schedules a task is first looks for?


a) A node with empty slot in the same rack as datanode
b) Any node on the same rack as the datanode
c) Any node on the rack adjacent to rack of the datanode
d) Just any node in the cluster

72. Which RNA carries the genetic information from the DNA to the ribosome for protein synthesis?
a) tRNA b) mRNA
c) rRNA d) DNA

73. What is the primary function of tRNA (transfer RNA) in the process of protein synthesis?
a) Transcribing genetic information from DNA b) Carrying amino acids to the ribosome
c) Providing a template for protein synthesis d) Forming the ribosomal structure.

74. Which would be more appropriate to be replaced with question mark in the following figure?

a) data analysis b) data science


c) descriptive analytics d) none of the mentioned

75. Which of the following approach should be used to ask Data Analysis question?
a) Find only one solution for a particular problem
b) Find out the question which is to be answered
c) Find out answer from dataset without asking question
d) None of the mentioned.
76. Choose which of the following design term is perfectly applicable to the below figure?

a) correlation b) cofounding
c) causation d) none of the mentioned.

77. Choose the goal of is to focus on summarizing and explaining a specific set of data.
a) inferential statistics b) descriptive statistics
c) none of these d) all of these.

78. Anita randomly picks 4 cards from a deck of 52-cards and places them back into the deck (Any
set of 4 cards is equally like
a) 48C4 x 52C4 b) 48C4 x 52C8
c) 48C8 x 52C8 d) None of these

79. If a fair six-sided die is rolled 6 times. What is the probability of getting all outcomes as unique?
a) 0.01543 b) 0.01993
c) 0.23148 d) 0.03333

80. Select when an event A independent of itself?


a) always b) if and only if P(A)=0
c) if and only if P(A)=1 d) if and only if P(A)=0 or 1

81. Some test scores follow a normal distribution with a mean of 18 and a standard deviation of
6. Select what proportion of test takers have scored between 18 and 24?
a) 0.2 b) 0.22
c) 0.34 d) none of these

82. Weight (Y) is regressed on height (X) of 40 adults. The height range in the data is 50-100 and
the regression line is Y = 100+0.1X with R^2= 0.12. Choose which of the conclusions below
does not necessarily follow?
a) the data suggests a weak relationship between X and Y
b) an adult with an X-value of 60 has an estimated Y-value of 106
c) an adult with an X-value of 80 has an estimated Y-value of 108
d) an adult with an X-value of 90 has an estimated Y-value of 10

83. What is the number of restrictions in the calculation of the F-statistics in question 22 above?
a) 1 b) 2
c) 3 d) 4

84. Choose what is the degree of freedom of any t-statistic calculated?


a) 30 b) 29
c) 28 d) 5
85. Which of the following is an assumption of one-way ANOVA comparing samples from three ‘or
more experimental treatments?
a) All the response variables within the k populations follow Normal distributions.
b) The samples associated with each population are randomly selected and are independent from all
other samples.
c) The response variable within each of the k populations has equal variances.
d) All of the above.

86. In a study, subjects are randomly assigned to one of three groups: control, experimental A, or
experimental B. After treatment, the mean scores for the three groups are compared. Choose
the appropriate statistical test for comparing these means is:
a) the analysis of variance b) the correlation coefficient
c) chi square d) the t-test

87. Assume that there is no overlap between the box and whisker plots for three drug
treatments where each drug was administered to 35 individuals. Choose the box plots for
these data:
a) represent evidence against the null hypothesis of ANOVA
b) provide no evidence for, or against, the null hypothesis of ANOVA
c) represent evidence for the null hypothesis of ANOVA
d) none of the mentioned.

88. Select what would happen if instead of using an ANOVA to compare 10 groups, you
performed multiple t- tests
a) making multiple comparisons with t-test increases the probability of making a Type I error
b) sir Ronald Fischer would be turning over in his grave; he put all that work into developing
ANOVA, and you use multiple t-tests
c) nothing serious, except that making multiple comparisons with a t-test requires more computation
than doing a single ANOVA.
d) Nothing, there is no difference between using an ANOVA and using a t-test.

89. If you pooled all the individuals from all three lakes into a single group, select they would have
a standard deviation of
a) 1.257 b) 1.58
c) 3.767 d) 14.19

90. Choose Big data is used to uncover -


a) hidden patterns and unknown correlations b) market trends and customer preferences
c) other useful information d) all of these

91. Consider a hypothesis H0 where ϕ0 = 5 against H1 where ϕ1 > 5. Select the test is
a) right tailed b) left tailed
c) center tailed d) cross tailed

92. Logistic regression is used when you select to


a) predict a dichotomous variable from continuous or dichotomous variables
b) predict a continuous variable from dichotomous variables
c) predict any categorical variable from several other categorical variables
d) predict a continuous variable from dichotomous or continuous variable

93. Select Linear discriminant analysis is


a) unsupervised learning b) supervised learning
c) semi-supervised learning d) none of these
94. Choose in binary logistic regression
a) The dependent variable is continuous.
b) The dependent variable is divided into two equal subcategories.
c) The dependent variable consists of two categories.
d) There is no dependent variable.

95. Select Logistic regression assumes a


a) Linear relationship between continuous predictor variables and the outcome variable.
b) Linear relationship between continuous predictor variables and the logit of the outcome variable.
c) Linear relationship between continuous predictor variables.
d) Linear relationship between observations.

96. Choose in supervised learning, class labels of the training samples are
a) known b) unknown
c) does not matter d) partially known

97. Choose if Sw is singular and N < D, its rank is at most (N is total number of samples, D
dimension of data, C is number of classes
a) N+C b) N
c) C d) N-C

98. Select if Sw is singular and N < D the alternative solution is to use (N is total number of samples,
D dimension of data
a) EM b) PCA
c) ML d) any of these

99. Choose which of the following method options is provided by train function for bagging
a) bag earth b) tree bag
c) bag fda d) all of the mentioned

100. Select which of the following is statistical boosting based on additive logistic regression
a) gamboost b) gbm
c) ada d) All of these

101. Which of the following are the advantage/s of Decision Trees?


a) Possible Scenarios can be added
b) Use a white box model, If given result is provided by a model
c) Worst, best and expected values can be determined for different scenarios
d) All of the mentioned

102. In Hebbian learning initial weights are set?


a) random b) near to zero
c) near to target value d) near to target value

103. What conditions are must for competitive network to perform pattern clustering?
a) non linear output layers
b) connection to neighbours is excitatory and to the farther units inhibitory
c) on centre off surround connections
d) none of the mentioned fulfils the whole criteria
1. Describe cleansing and what are the best ways to practice data cleansing?
Data Cleansing, also called Data Scrubbing is the first step of data preparation. It is a process of
finding out and correcting or removing incorrect, incomplete, inaccurate, or irrelevant data in the
dataset. If data is incorrect, outcomes and algorithms are unreliable, though they may look correct.
The best practices for Data Cleaning are :
a. Develop a data quality strategy.
b. Correct data at the point of entry.
c. Validate the accuracy of data.
d. Manage removing duplicate data.

2. Discuss a few problems that data analyst usually encounters while performing the analysis?
Biased Data : Data could be biased due to the source from which it is collected. For instance,
suppose you collect data to determine the winner of an electoral campaign, collecting from a specific
region alone introduces one form of a bias, while collecting data from a specific income group
introduces another form of bias.
Duplicates in the data : Data could have duplicates which may impact the result of analysis.
Missing data : All data points might not have the values for all attributes you are
analyzing. Noisy data : The data could be noisy, usually a high value of variance indicates
noise.
Outliers in the data : Points outside expected range of data that introduce inconsistencies in model.
Difference in formats in various data sources : Some data could be crawled and collected in html
format, while other data might be collected from online reviews in text format. A third source of
data might be structured data already in the database. A data analyst usually must ingest several
data sources to get richer data.

Data Volume : A large amount of data will require a different class of algorithms for processing to
handle efficiently.

3. Write the best practices in big data analytics?


Some of the best practices in big data analytics are :
- Defining clear objectives.
- Knowing which data is important and which is not.
- Assuring data quality.
- Committing to proper data labelling.
- Choosing proper data storage locations.
- Simplifying procedures for backup.
- Implementing security measures.
- Ensuring scalable infrastructure.
- Arranging data audits at regular basis.

4. Describe why SVMs often more accurate than logistic regression?

- SVM try to maximize the margin between the closest support vectors whereas
logistic regression maximizes the posterior class probability.
- LR is used for solving Classification problems, while SVM model is used for both Classification
and regression.
- SVM is deterministic while LR is probabilistic.
- LR is vulnerable to overfitting, while the risk of overfitting is less in SVM.
SVM outperforms LR in classifying grayscale images of handwritten digits (like the digits 0 to 9).
5. Classify overfitting and underfitting and how to combat them?
Overfitting : Overfitting occurs when the model tries to cover all the data points, or more than the
required data points present in the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy
of the model. The overfitted model has low bias and high variance.

Some ways by which the occurrence of overfitting can be reduced are : Cross-Validation, Training
with more data, Removing features, Early stopping the training, Regularization, etc.
Underfitting : Underfitting occurs when the model is not able to capture the underlying trend of
the data. To avoid overfitting, the fed of training data is stopped earlier due to which the model
is not able to learn enough from the training data, and causes underfitting. Hence it reduces the
accuracy and produces unreliable predictions. An underfitted model has high bias and low variance.

We can avoid underfitting by increasing : the training time of the model and the number of features.

6. Define the various hierarchical methods of cluster analysis.


Hierarchical clustering refers to an unsupervised learning procedure that determines successive
clusters based on previously defined clusters.

There are two methods of Hierarchical Clustering :

Agglomerative Clustering :
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual
data points as a single cluster then it is merged continuously based on similarity until it forms one
big cluster containing all objects. It is good at identifying small clusters.
Divisive Clustering :
Divisive clustering follows a top-down approach and is more efficient than agglomerative
clustering. It starts by considering all the data points into a big single cluster and later splitting
them into smaller heterogeneous clusters continuously until all data points are in their own cluster.
Thus, they are good at identifying large clusters.

7. Tell the initial number of clusters for k-means algorithm be estimated.


Elbow Method : It is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS (Within Cluster Sum of Squares) value, which defines the total
variations within a cluster.
Gap Statistic Method : It compares the total within intra-cluster variation for different values of k
with their expected values under null reference distribution of the data. The estimate of the optimal
clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This
means that the clustering structure is far away from the random uniform distribution of points.
Silhouette Approach : It measures the quality of a clustering, i.e., it determines how well each object
lies within its cluster. A high average silhouette width indicates a good clustering. Average
silhouette method computes the average silhouette of observations for different values of k. The
optimal number of clusters k is the one that maximize the average silhouette over a range of
possible values for k.

8. Explain how is Hadoop related to big data? What are the features of Hadoop.
HADOOP is an open source, Java based framework used for storing and processing big data. The
data is stored on inexpensive commodity servers that run as clusters. Its distributed file system
enables concurrent processing and fault tolerance.

HDFS
 Hadoop comes with a distributed file system called the Hadoop Distributed File System (HDFS)
which was designed for Big Data processing.
 It attempts to enable storage of large files, by distributing the data among a pool of data nodes.
 It holds very large amount of data and provides easier access.
 It is highly fault tolerant and designed using low-cost hardware.
9. Explain in detail about the probability distribution and entropy.

- Probability Distributions
A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the
minimum and maximum possible values, but where the possible value would be plotted on the
probability distribution will be determined by a number of factors like mean (average), standard
deviation, skewness, and kurtosis. Two types are : Discrete Probability Distributions and
Continuous Probability Distributions.

- Entropy
Entropy measures the amount of surprise and data present in a variable. In information theory, a
random variable’s entropy reflects the average uncertainty level in its possible outcomes. Events
with higher uncertainty have higher entropy.

10. Identify the difference between active learning and reinforcement learning explains it
with suitable example and diagram?
Active learning is based on the concept, if a learning algorithm can choose the data it wants to learn
from, it can perform better than traditional methods with substantially less data for training. So it's
kind of semi-supervised machine learning.

Query Output
World Active Learner Classifier/Model
Response

Reinforcement Learning is a type of machine learning technique that enables an agent to learn in an
interactive environment by trial and error using feedback from its own actions and experiences. It is
based on rewards and punishments mechanism which can be both active and passive.

Environment

Next State Reward/Penalty Action

Agent

11. Write the apriori algorithm for mining frequent item sets with an example.
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say
that the apriori algorithm is an association rule leaning that analyzes that people who bought
product A also bought product B.
Support : It is an association rule which is the percentage of transactions in the database that
contains A U B. i.e.,
Support (A) = Number of transactions containing A / Total number of transactions
Confidence : For an association rule, it is the ratio of the number of transactions containing A U B to
the number of transactions containing A, i.e.,
Confidence (A => B) = No. of transactions containing A U B / No. of transactions containing A
The Apriori algorithm operates on a straightforward premise. When the support value of an item
set exceeds a certain threshold, it is considered a frequent item set. To begin, set the support
criterion, meaning that only those things that have more than the support criterion are considered
relevant.
12. Write the difference between data mining and data analysis?

Criteria Data Mining Data Analysis


It is the process of extracting It is the process of analyzing and
Definition important pattern from large organizing raw data to
datasets. determine useful information and
decisions
In this all operations are involved
It is used in discovering hidden
Function in examining data sets to fine
patterns in raw data sets.
conclusions.
Dataset can be large, medium, or
In this data set are generally
Data set small. It can also be structured, semi
large and structured.
structured or unstructured.

Often require mathematical and Analytical and business intelligence


Models
statistical models models.

It generally does not


Visualization Surely requires Data visualization.
require visualization
It is used to make data driven
Goal Goal is to make data usable.
decisions.
It requires the knowledge of
It involves the intersection of
Required computer science, statistics,
machine learning, statistics, and
Knowledge mathematics, subject knowledge
databases.
Al/Machine Learning.
The output is verified or discarded
Output It shows the data tends and patterns.
hypothesis

13. State detail about the challenges of conventional system.


Fundamental challenges
- Storage
- Processing
- Security
- Finding and Fixing Data Quality Issues
- Evaluating and Selecting Big Data Technologies
- Data Validation
- Scaling Big Data Systems

14. Can you express multiple one way repeated measure ANOVA on a two way design?

- A one-way ANOVA is primarily designed to enable the equality testing between three or
more means. A two-way ANOVA is designed to assess the interrelationship of two
independent variables on a dependent variable.
- A one-way ANOVA only involves one factor or independent variable, whereas there are two
independent variables in a two-way ANOVA.
- In a one-way ANOVA, the one factor or independent variable analysed has three or more
categorical groups. A two-way ANOVA instead compares multiple groups of two factors.
- One-way ANOVA need to satisfy only two principles of design of experiments, i.e., replication
and randomization. As opposed to two-way ANOVA, which meets all three principles of design
of experiments which are replication, randomization and local control.
15. Group the main characteristics of big data. Why you need to take big data?
Five Vs of Big Data :
Volume - The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of
data generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Variety - Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, photos, videos, etc.
Veracity - Veracity means how much the data is reliable. It has many ways to filter or translate the
data. Veracity is the process of being able to handle and manage data efficiently.
Value - Value is an essential characteristic of big data. It is valuable and reliable data that are used
for storing, processing, and analysing.
Velocity - Velocity creates the speed by which the data is created in real-time. It contains the linking
of incoming data sets speeds, rate of change. The primary aspect of Big Data is to provide
demanding data rapidly.

16. Write the role of activation function in neural network.


The activation function in Neural Networks takes an input 'x' multiplied by a weight 'w'. Bias allows
you to shift the activation function by adding a constant (i.e. the given bias) to the input. Bias in
Neural Networks can be thought of as analogous to the role of a constant in a linear function,
whereby the line is effectively transposed by the constant value.

With no bias, the input to the activation function is 'x' multiplied by the connection weight 'w0'.

In a scenario with bias, the input to the activation function is 'x' times the connection weight 'w0'
plus the bias times the connection weight for the bias 'w1'. This has the effect of shifting the
activation function by a constant amount (b * w1).

17. Compare and contrast the relationship between clustering and centroids.
Several approaches to clustering exist. Each approach is best suited to a particular data distribution.
Focusing on centroid-based clustering using k-means :
Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to
hierarchical clustering defined below. k-means is the most widely-used centroid-based clustering
algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.
Centroid-based Clustering

18. Show the detail about the application association rule.


The Association rule is a learning technique that helps identify the dependencies between two data
items. Based on the dependency, it then maps accordingly so that it can be more profitable.
Association rule furthermore looks for interesting associations among the variables of the dataset.
Applications :
- Market Basket Analysis
- Medical Diagnosis
- Census Data
- Building an Intelligent Transportation System
- Synthesis of Protein Sequences

19. Judge how can the initial number of clusters for k-means algorithm be estimated? - Ques (7)

20. Write big data analysis helpful in increasing business revenue? Explain the steps to be followed
to deploy a big data solution.
- Improved Accuracy : Big data analytics enables businesses to make decisions based on facts and
evidence rather than intuition or guesswork. By analyzing large volumes of data, patterns and
trends that may not be apparent at a smaller scale can be identified.
- Real-time Insights : In the fast-paced business environment, real-time insights are essential for
timely decision making. Big data analytics allows organizations to process and analyze data in
real- time or near real-time, enabling them to respond quickly to emerging trends, market shifts,
and customer demands.
- Customer Understanding : Understanding customers is vital for tailoring products, services, and
marketing strategies. Big data analytics provides a holistic view of customer behavior, preferences,
and needs by analyzing multiple data sources, such as online interactions, social media sentiment,
purchase history, and demographic information. This knowledge enables businesses to personalize
their offerings, deliver targeted marketing campaigns, and enhance the overall customer
experience.
- Competitive Advantage : In today’s competitive landscape, gaining an edge over rivals is crucial.
Big data analytics helps companies uncover insights that can differentiate them from competitors.
By identifying market trends, consumer preferences, and emerging opportunities, businesses can
develop innovative products, optimize pricing strategies, and deliver superior customer service.
- Risk Management : Big data analytics plays a vital role in risk management by
identifying potential risks and predicting future outcomes. By analyzing historical data
and using predictive modelling techniques, businesses can identify potential threats,
fraud patterns, and anomalies. This empowers organizations to take proactive measures
to mitigate risks, improve security, and safeguard their operations, reputation, and
financial well-being.
21. Describe K-Means, is it necessary to convert the data into zero mean and unit covariance?
K-Means Clustering is an unsupervised learning algorithm used to solve clustering problems in
machine learning or data science. It groups the unlabeled dataset into different clusters.
The k-means clustering algorithm mainly performs two tasks:
- Determines the best value for K center points or centroids by an iterative process.
- Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.

22. Define the complexity theory for Map Reduce. What is the reduced size of Map Reduce?
In the context of MapReduce, complexity theory refers to the analysis of the computational
complexity of algorithms and problems when using the MapReduce programming model.
Map Phase : If n is the number of input records and m is the size of the input data, the time
complexity of the Map phase is typically O(n+m).
Shuffle and Sort : The time complexity depends on the efficiency of the shuffling mechanism and the
size of the data being shuffled.
Reduce Phase : If r is the number of reducer nodes and k is the number of unique keys, the time
complexity of the Reduce phase is often O(r+k).
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
Shuffle : Shuffling helps to carry data from the Mapper to the required Reducer. With the help
of HTTP, the framework calls for applicable partition of the output in all Mappers.
Sort : In this phase, the output of the mapper that is actually the key-value pairs will be sorted
on the basis of its key value.
Reduce : Once shuffling and sorting will be done the Reducer combines the obtained result and
perform the computation operation as per the requirement. OutputCollector.collect() property is
used for writing the output to the HDFS.

23. Group few problems that data analyst usually encounter while performing analysis. - Ques (2)

24. Write down the short notes on Associative Rule Mining.


Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently an itemset occurs in a transaction. A typical example is a
Market Based Analysis. It is one of the key techniques used by large relations to show associations
between items. It allows retailers to identify relationships between the items that people buy
together frequently.
Given a set of transactions,
Support : It indicates how frequently the if/then relationship appears in the database.
Support (A) = Number of transactions containing A / Total number of transactions
Confidence : It tells about the number of times these relationships have been found to be true, i.e.,
Confidence (A => B) = No. of transactions containing A U B / No. of transactions containing A
25. Write down the short notes about Deep learning.
Deep learning is a branch of machine learning which is based on artificial neural networks. It is
capable of learning complex patterns and relationships within data. In deep learning, we don’t
need to explicitly program everything. It has become increasingly popular in recent years due to
the advances in processing power and the availability of large datasets. Because it is based on
artificial neural networks (ANNs) also known as deep neural networks (DNNs). These neural
networks are inspired by the structure and function of the human brain’s biological neurons, and
they are designed to learn from large amounts of data.

26. Explain the applications of clustering.


The clustering technique can be widely used in various tasks. Some most common uses of this
technique are :
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system to provide
the recommendations as per the past search of products. Netflix also uses this technique to
recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.

27. Define about the Quadratic discriminant analysis.


Quadratic Discrimination is the general form of Bayesian discrimination. Discriminant analysis is used
to determine which variables discriminate between two or more naturally occurring groups.
Difference from LDA is it relaxed the assumption that the mean and covariance of all the classes were
equal.
Working :
QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of
observations. QDA is particularly useful if there is prior knowledge that individual classes exhibit
distinct covariances.
QDA assumes that observation of each class is drawn from a normal distribution (similar to linear
discriminant analysis).
QDA assumes that each class has its own covariance matrix (different from linear discriminant
analysis)

The advantage of quadratic discriminant analysis over linear discriminant analysis and linear
regression is that when the decision boundaries are linear, the linear discriminant analysis and logistic
regression will perform well.
28. Explain what is ANOVA?
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups or treatments. It is often used to determine whether there
are any statistically significant differences between the means of different groups.
ANOVA compares the variation between group means to the variation within the groups. If the
variation between group means is significantly larger than the variation within groups, it suggests a
significant difference between the means of the groups.

29. Describe in detail about the role of Probability Distribution in data analytics.
A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range. This range will be bound by the minimum
and maximum possible values, but where the possible value would be plotted on the probability
distribution will be determined by a number of factors like mean (average), standard deviation,
skewness, and kurtosis. Two types are : Discrete Probability Distributions and Continuous
Probability Distributions. Probability distributions enable us to analyze data and draw meaningful
conclusions by describing the likelihood of different outcomes or events.
In statistical analysis, these distributions play a pivotal role in parameter estimation, hypothesis
testing, and data inference. They also find extensive use in risk assessment, particularly in finance
and insurance, where they help assess and manage financial risks by quantifying the likelihood of
various outcomes.
30. Describe briefly descriptive statistics.
Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and
presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's main
features and characteristics without making any generalizations or inferences to a larger population.
The primary goal of descriptive statistics is to provide a clear and concise summary of the data,
enabling researchers or analysts to gain insights and understand patterns, trends, and distributions
within the dataset. This summary typically includes measures such as central tendency (e.g., mean,
median, mode), dispersion (e.g., range, variance, standard deviation), and shape of the distribution
(e.g., skewness, kurtosis).

31. Write the difference between probability distribution and descriptive statistic.

Sl. No. Descriptive Statistics Probability Distribution

Summarize and describe an existing Model and predict outcomes from


Purpose dataset. random variables.

Mathematical model for random


Data Type Applied to observed data.
variables.

Mean, median, mode, range, Normal, exponential, binomial,


Example variance Poisson

Hypothesis testing, Risk


Use Cases Data exploration, Data presentation. assessment, modelling random
processes.

Understand randomness &


Describe and summarize data
Goal uncertainty in data-generating
properties. processes.

32. Describe in detail about the role of statistical model in data analytics.
Statistical modelling is the process of applying statistical analysis to a dataset. A statistical model is
a mathematical representation (or mathematical model) of observed data.

When data analysts apply various statistical models to the data they are investigating, they are able
to understand and interpret the information more strategically. Rather than sifting through the raw
data, this practice allows them to identify relationships between variables, make predictions about
future sets of data, and visualize that data so that non-analysts and stakeholders can consume and
leverage it.

33. Illustrate Why are SVMs often more accurate than logistic regression? - Ques (4)

34. Explain What is overfitting and underfitting and how to combat them? - Ques (5)

35. Write how Hadoop is related to big data? - Ques (8)

36. Illustrate about the features of Hadoop. - Ques (8)


37. Discriminate overfitting and underfitting. - Ques (5)

38. Explain the design principles of neural network.

Input neurons
- This is the number of features the neural network uses to make its predictions.
- The input vector needs one input neuron per feature. For tabular data, this is the number of
relevant features in the dataset.
Output neurons
- This is the number of predictions the user wants to make.
- Regression : For regression tasks, this can be one value (e.g. housing price). For multi-variate
regression, it is one neuron per predicted value (e.g. for bounding boxes it can be 4 neurons —
one each for bounding box height, width, x-coordinate, y-coordinate).
- Classification : For binary classification (spam-not spam), we use one output neuron per positive
class, wherein the output represents the probability of the positive class.

39. Explain Why are SVMs often more accurate than logistic regression with examples. - Ques (4)

40. Justify why SVM is so fast.


Kernel Trick for Non-Linearity:
By avoiding the explicit computation of high-dimensional features, SVMs can handle non-linearities
without a significant increase in computational cost.
Sequential Minimal Optimization (SMO) Algorithm:
SMO is an algorithm commonly used for training SVMs. It breaks down the large optimization
problem into smaller, more manageable subproblems.
Efficient Caching and Memory Usage:
SVM implementations often incorporate efficient caching mechanisms and memory optimizations.
Parallelization:
SVM computations can be parallelized, taking advantage of multiple processors or distributed
computing environments.
Well-Optimized Libraries:
These libraries are implemented in efficient programming languages and are carefully optimized for
performance, contributing to the overall speed of SVMs.
Versatility in Linear Cases:
In cases where the decision boundary is mostly linear, SVMs with a linear kernel can be particularly
fast. The linear kernel simplifies the decision function, resulting in efficient computations.
41. Justify What are the best practices in big data analytics. - Ques (3)

42. Assess the techniques used in big data analytics.


Classification
These techniques are used to identify categories in which new data points belong based on data
points that have already been categorized in a training set.
Cluster analysis
Statistical method for classifying objects based on similarities among diverse groups of objects, but
without knowing in advance what characteristics make them similar.
Crowdsourcing
Crowdsourcing is a method for collecting data submitted by a large group of people or crowd
through an open call, usually through a networked medium such as the Internet.
Ensemble learning
Multiple predictive models (each constructed using statistics and/or machine learning) are used to
achieve better performance than any of the constituent models.
Neural networks
Finding patterns in data using computational models inspired by the structure and workings of
biological neural networks (such as the cells and connections found in the brain).
Network analysis
An analysis technique for describing relationships among discrete nodes in a graph or a network.
The social network analysis aims to investigate the connections among individuals in a group.
Sentiment analysis
The process of extracting and identifying subjective information from a text source using natural
language processing and analytic techniques.

43. Write the relationship of Centroid with clustering - Ques (17)

44. How to solve overfitting and underfitting problem? - Ques (5)

45. Write a few problems data analyst usually encounter during performing the analysis. - Ques (2)

46. Discuss about the application of ANOVA.

- Experimental studies : ANOVA is frequently used to evaluate the impact of independent


variables on the dependent variable in experimental studies. It helps researchers determine
which variables have a significant effect on the outcome of the study.
- Quality control : ANOVA finds its application in quality control to ascertain whether there
exist any discrepancies among the means of multiple groups. For instance, in manufacturing,
ANOVA can be employed to examine whether there is any variation in the quality of products
manufactured by different machines.
- Medical research : ANOVA is used in medical research to test the effectiveness of various
treatments for a particular disease. Researchers can compare the mean outcomes of different
treatments to identify the most effective one.
- Market research : ANOVA is used in market research to compare the mean responses of
customers to different products or advertising campaigns.
- Agriculture : ANOVA is employed in agriculture to compare the growth rates of crops
under different environmental conditions or in different soil types.

47. Analyze the relation between Hadoop and big data. - Ques (8)
48. Illustrate the different features of Hadoop. - Ques (8)

49. Compare overfitting and underfitting. - Ques (5)

50. Analyze the design principles of neural network. - Ques (38)

51. Analyze why SVMs are often more accurate than logistic regression with examples. - Ques (4)

52. Judge why SVM is so fast. - Ques (40)

53. Assess what are the best practices in big data analytics. - Ques (3)

54. Evaluate the techniques used in big data analytics. - Ques (42)

55. Explain the concept of core points in the DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) algorithm, and discuss their significance in the clustering process.
Provide an example to illustrate your explanation.
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea is
that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.

In this algorithm, we have 3 types of data points.


Core Point : A point is a core point if it has more than MinPts points within eps.
Border Point : A point which has fewer than MinPts within eps but in neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.

Steps Used In DBSCAN Algorithm


- Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
- For each core point if it is not already assigned to a cluster, create a new cluster.
- Find recursively all its density-connected points and assign them to the same cluster as the core
point.
- A point a and b are said to be density connected if there exists a point c which has a sufficient
number of points in its neighbors and both points a and b are within the eps distance. This is a
chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which
in turn is neighbor of a implying that b is a neighbor of a.
- Iterate through the remaining unvisited points in the dataset. Those points that do not belong
to any cluster are noise.
56. Write the main objective of the BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies) algorithm in data mining, and explain how it achieves this objective.
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that
can cluster large datasets by first generating a small and compact summary of the large dataset
that retains as much information as possible. This smaller summary is then clustered instead of
clustering the larger dataset. The BIRCH clustering algorithm consists of two stages :

- Building the CF Tree : BIRCH summarizes large datasets into smaller, dense regions called
Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered
triple, (N, LS, SS) where ’N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the
data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF
entry to be composed of other CF entries. Optionally, we can condense this initial CF tree into a
smaller CF.

- Global Clustering : Applies an existing clustering algorithm on the leaves of the CF tree. A CF
tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree contains a
pointer to a child node and a CF entry made up of the sum of CF entries in the child nodes.
Optionally, we can refine these clusters.
Due to this two-step process, BIRCH is also called Two Step Clustering.

57. Reframe the basic idea behind the DIANA (Divisive Analysis) clustering algorithm in data
mining, and describe the key steps involved in its process.
DIANA is also known as DIvisie ANAlysis clustering algorithm. It is the top-down approach form of
hierarchical clustering where all data points are initially assigned a single cluster. Further, the
clusters are split into two least similar clusters. This is done recursively until clusters groups are
formed which are distinct to each other.

In step 1 that is the blue outline circle can be thought of as all the points are assigned a single
cluster. Moving forward it is divided into 2 red-colored clusters based on the distances/density of
points. Now, we have two red-colored clusters in step 2. Lastly, in step 3 the two red clusters are
further divided into 2 black dotted each, again based on density and distances to give us final four
clusters. Since the points in the respective 4 clusters are very similar to each other and very different
when compared to the other cluster groups they are not further divided. Thus, this is how we get
DIANA clusters or top-down approached Hierarchical clusters.
58. Express the concept of state transitions in Hidden Markov Models (HMMs) and their significance
in modeling sequential data. Provide an example to illustrate your explanation.
Hidden Markov Model (HMM) is a statistical model that is used to describe the probabilistic
relationship between a sequence of observations and a sequence of hidden states. It is often used in
situations where the underlying system or process that generates the observations is unknown or
hidden, hence it got the name “Hidden Markov Model.”

It is used to predict future observations or classify sequences, based on the underlying hidden
process that generates the data.
An HMM consists of two types of variables: hidden states and observations.
- The hidden states are the underlying variables that generate the observed data, but they are not
directly observable.
- The observations are the variables that are measured and observed.
The relationship between the hidden states and the observations is modeled using a probability
distribution. The Hidden Markov Model (HMM) is the relationship between the hidden states and the
observations using two sets of probabilities :
- The transition probabilities describe probability of transitioning from one hidden state to another.
- The emission probabilities describe the probability of observing an output given a hidden state.

Block Diagram of Hidden Markov Model

You might also like