You are on page 1of 1

1. Describe why SVMs offer more accurate results than Logistic Regression. - 2.

2. Express multiple one way repeated measure ANOVA on a two way 9. How can the initial number of clusters for k-means algorithm be
SVM try to maximize the margin between the closest support vectors whereas design? estimated? Give example.
logistic regression maximize the posterior class probability. - LR is used for - A one-way ANOVA is primarily designed to enable the equality testing Elbow Method : It is one of the most popular ways to find the optimal
solving Classification problems, while SVM model is used for both between three or more means. A two-way ANOVA is designed to assess number of clusters. This method uses the concept of WCSS (Within
Classification and regression. - SVM is deterministic while LR is probabilistic. - the interrelationship of two independent variables on a dependent Cluster Sum of Squares) value, which defines the total variations within
LR is vulnerable to overfitting, while the risk of overfitting is less in SVM. variable. - A one-way ANOVA only involves one factor or independent a cluster. Gap Statistic Method : It compares the total within intra-
variable, whereas there are two independent variables in a two-way cluster variation for different values of k with their expected values
2. Explain about Probability Distribution and Entropy. ANOVA. - In a one-way ANOVA, the one factor or independent variable under null reference distribution of the data. The estimate of the
- Probability Distributions A probability distribution is a statistical function analysed has three or more categorical groups. A two-way ANOVA optimal clusters will be value that maximize the gap statistic (i.e., that
that describes all the possible values and probabilities for a random variable instead compares multiple groups of two factors. - One-way ANOVA yields the largest gap statistic). This means that the clustering
within a given range. This range will be bound by the minimum and maximum need to satisfy only two principles of design of experiments, i.e., structure is far away from the random uniform distribution of points.
possible values, but where the possible value would be plotted on the replication and randomization. Silhouette Approach : It measures the quality of a clustering, i.e., it
probability distribution will be determined by a number of factors like mean determines how well each object lies within its cluster. A high average
(average), standard deviation, skewness, and kurtosis. Two types are : 3. Classify overfitting and underfitting and how to combat them? silhouette width indicates a good clustering. Average silhouette
Discrete Probability Distributions and Continuous Probability Distributions. - Overfitting : Overfitting occurs when the model tries to cover all the method computes the average silhouette of observations for different
Entropy Entropy measures the amount of surprise and data present in a data points, or more than the required data points present in the given values of k. The optimal number of clusters k is the one that maximize
variable. In information theory, a random variable’s entropy reflects the dataset. Because of this, the model starts caching noise and inaccurate the average silhouette over a range of possible values for k.
average uncertainty level in its possible outcomes. Events with higher values present in the dataset, and all these factors reduce the
uncertainty have higher entropy. efficiency and accuracy of the model. The overfitted model has low 10. Describe cleansing and what are the best ways to practice data
bias and high variance. Some ways by which the occurrence of cleansing?
3. Identify the difference between Active Learning and Reinforcement overfitting can be reduced are : Cross-Validation, Training with more Data Cleansing, also called Data Scrubbing is the first step of data
Learning. Explain it with suitable examples and diagram. data, Removing features, Early stopping the training, Regularization, preparation. It is a process of finding out and correcting or removing
Active learning is based on the concept, if a learning algorithm can choose the etc. Underfitting : Underfitting occurs when the model is not able to incorrect, incomplete, inaccurate, or irrelevant data in the dataset. If
data it wants to learn from, it can perform better than traditional methods capture the underlying trend of the data. To avoid overfitting, the fed data is incorrect, outcomes and algorithms are unreliable, though they
with substantially less data for training. So it's kind of semi-supervised of training data is stopped earlier due to which the model is not able to may look correct. The best practices for Data Cleaning are : - Develop a
machine learning. Query Output World Active Learner Classifier/Model learn enough from the training data, and causes underfitting. Hence it data quality strategy. - Correct data at the point of entry. - Validate the
Response Reinforcement Learning is a type of machine learning technique reduces the accuracy and produces unreliable predictions. accuracy of data. - Manage removing duplicate data.
that enables an agent to learn in an interactive environment by trial and error 4. Describe why SVM is more accurate than Logistic Regression?
using feedback from its own actions and experiences. It is based on rewards - SVM try to maximize the margin between the closest support vectors 11. Define the Complexity theory of Map Reduce. What is the reduce
and punishments mechanism which can be both active and passive. whereas logistic regression maximize the posterior class probability. - size of Map Reduce?
Environment Next State Reward/Penalty Action Agent LR is used for solving Classification problems, while SVM model is used In the context of MapReduce, complexity theory refers to the analysis
for both Classification and regression. - SVM is deterministic while LR of the computational complexity of algorithms and problems when
4. Express Multiple one-way ANOVA on a two-way device. is probabilistic. - LR is vulnerable to overfitting, while the risk of using the MapReduce programming model. Time Complexity : Map
- A one-way ANOVA is primarily designed to enable the equality testing overfitting is less in SVM. Phase : If n is the number of input records and m is the size of the input
between three or more means. A two-way ANOVA is designed to assess the data, the time complexity of the Map phase is typically O(n+m). Shuffle
interrelationship of two independent variables on a dependent variable. - A 5. Write the best practice for big data analytics. and Sort : The time complexity depends on the efficiency of the
one-way ANOVA only involves one factor or independent variable, whereas Some of the best practices in big data analytics are : - Defining clear shuffling mechanism and the size of the data being shuffled. Reduce
there are two independent variables in a two-way ANOVA. - In a one-way objectives. - Knowing which data is important and which is not. - Phase : If r is the number of reducer nodes and k is the number of
ANOVA, the one factor or independent variable analysed has three or more Assuring data quality. - Committing to proper data labelling. - Choosing unique keys, the time complexity of the Reduce phase is often O(r+k).
categorical groups. A two-way ANOVA instead compares multiple groups of proper data storage locations. - Managing data lifecycle. - Simplifying The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
two factors. - One-way ANOVA need to satisfy only two principles of design of procedures for backup. - Implementing security measures. - Ensuring Shuffle : Shuffling helps to carry data from the Mapper to the required
experiments, i.e., replication and randomization. As opposed to two-way scalable infrastructure. - Arranging data audits at regular basis. Reducer. With the help of HTTP, the framework calls for applicable
ANOVA, which meets all three principles of design of experiments which are partition of the output in all Mappers. Sort : In this phase, the output of
replication, randomization and local control. 6. Write the role of activation function in neural network. the mapper that is actually the key-value pairs will be sorted on the
The activation function in Neural Networks takes an input 'x' multiplied basis of its key value. Reduce : Once shuffling and sorting will be done
5. Why you mean to take Big Data? by a weight 'w'. Bias allows you to shift the activation function by the Reducer combines the obtained result and perform the
Big Data is a massive amount of data sets that cannot be stored, processed, adding a constant (i.e. the given bias) to the input. Bias in Neural computation operation as per the requirement.
or analysed using traditional tools. There are millions of data sources that Networks can be thought of as analogous to the role of a constant in a OutputCollector.collect() property is used for writing the output to the
generate data at a very rapid rate. These data sources are present across the linear function, whereby the line is effectively transposed by the HDFS. Keep remembering that the output of the Reducer will not be
world. Some of the largest sources of data are social media platforms and constant value. With no bias, the input to the activation function is 'x' sorted.
networks. Let’s use Facebook as an example—it generates more than 500 multiplied by the connection weight
terabytes of data every day. This data includes pictures, videos, messages, 12. Challenges of Conventional System.
and more. Data also exists in different formats, like structured data, semi- 7. Describe Descriptive Statistics. Fundamental challenges - Storage - Processing - Security - Finding and
structured data, and unstructured data. For example, in a regular Excel sheet, Descriptive statistics refers to a branch of statistics that involves Fixing Data Quality Issues - Evaluating and Selecting Big Data
data is classified as structured data summarizing, organizing, and presenting data meaningfully and Technologies - Data Validation - Scaling Big Data Systems
concisely. It focuses on describing and analyzing a dataset's main
6. Write the role of Activation Function in Neural Networks. features and characteristics without making any generalizations or 13. What are the different hierarchical methods for cluster analysis?
The activation function in Neural Networks takes an input 'x' multiplied by a inferences to a larger population. The primary goal of descriptive Hierarchical clustering refers to an unsupervised learning procedure
weight 'w'. Bias allows you to shift the activation function by adding a statistics is to provide a clear and concise summary of the data, that determines successive clusters based on previously defined
constant (i.e. the given bias) to the input. Bias in Neural Networks can be enabling researchers or analysts to gain insights and understand clusters. It works via grouping data into a tree of clusters. Hierarchical
thought of as analogous to the role of a constant in a linear function, whereby patterns, trends, and distributions within the dataset. This summary clustering stats by treating each data points as an individual cluster.
the line is effectively transposed by the constant value. With no bias, the typically includes measures such as central tendency (e.g., mean, There are two methods of Hierarchical Clustering : - Agglomerative
input to the activation function is 'x' multiplied by the connection weight 'w0'. median, mode), dispersion (e.g., range, variance, standard deviation), Clustering : Agglomerative clustering is a bottom-up approach. It starts
In a scenario with bias, the input to the activation function is 'x' times the and shape of the distribution (e.g., skewness, kurtosis). clustering by treating the individual data points as a single cluster then
connection weight 'w0' plus the bias times the connection weight for the bias it is merged continuously based on similarity until it forms one big
'w1'. This has the effect of shifting the activation function by a constant 8. Applications of Clustering. cluster containing all objects. It is good at identifying small clusters. -
amount (b * w1). The clustering technique can be widely used in various tasks. Some Divisive Clustering : Divisive clustering works just the opposite of
most common uses of this technique are : o Market Segmentation o agglomerative clustering. It starts by considering all the data points
7. Compare and construct the relationship between Clustering and Centroid. Statistical data analysis o Social network analysis o Image into a big single cluster and later on splitting them into smaller
Several approaches to clustering exist. Each approach is best suited to a segmentation o Anomaly detection, etc. Apart from these general heterogeneous clusters continuously until all data points are in their
particular data distribution. Focusing on centroid-based clustering using k- usages, it is used by the Amazon in its recommendation system to own cluster. Thus, they are good at identifying large clusters. It follows
means : Centroid-based clustering organizes the data into non-hierarchical provide the recommendations as per the past search of products. a top-down approach and is more efficient than agglomerative
clusters, in contrast to hierarchical clustering defined below. k-means is the Netflix also uses this technique to recommend the movies and web- clustering. But, due to its complexity in implementation, it doesn’t
most widely-used centroid-based clustering algorithm. Centroid-based series to its users as per the watch history. The below diagram explains have any predefined implementation in any of the major machine
algorithms are efficient but sensitive to initial conditions and outliers. the working of the clustering algorithm. We can see the different fruits learning frameworks.
are divided into several groups with similar properties.
8. Write short note on Deep Learning. 14. State the Quadratic Determinant Analysis.
Deep learning is a branch of machine learning which is based on artificial 17. Explain why Big Data Analytics is helpful in Business Reserve. Quadratic Discrimination is the general form of Bayesian
neural networks. It is capable of learning complex patterns and relationships Explain the steps to be followed to deploy a Big Data solution. discrimination. Discriminant analysis is used to determine which
within data. In deep learning, we don’t need to explicitly program everything. - Improved Accuracy : Big data analytics enables businesses to make variables discriminate between two or more naturally occurring
It has become increasingly popular in recent years due to the advances in decisions based on facts and evidence rather than intuition or groups. Difference from LDA is it relaxed the assumption that the mean
processing power and the availability of large datasets. Because it is based on guesswork. By analyzing large volumes of data, patterns and trends and covariance of all the classes were equal. Working : QDA is a variant
artificial neural networks (ANNs) also known as deep neural networks (DNNs). that may not be apparent at a smaller scale can be identified. - Real- of LDA in which an individual covariance matrix is estimated for every
These neural networks are inspired by the structure and function of the time Insights : In the fast-paced business environment, real-time class of observations. QDA is particularly useful if there is prior
human brain’s biological neurons, and they are designed to learn from large insights are essential for timely decision making. Big data analytics knowledge that individual classes exhibit distinct covariances. QDA
amounts of data. allows organizations to process and analyze data in realtime or near assumes that observation of each class is drawn from a normal
real-time, enabling them to respond quickly to emerging trends, distribution (similar to linear discriminant analysis). QDA assumes that
9. Explain ANOVA. ANOVA stands for Analysis of Variance. market shifts, and customer demands. - Customer Understanding : each class has its own covariance matrix (different from linear
It is a statistical method used to analyze the differences between the means Understanding customers is vital for tailoring products, services, and discriminant analysis) The advantage of quadratic discriminant
of two or more groups or treatments. It is often used to determine whether marketing strategies. Big data analytics provides a holistic view of analysis over linear discriminant analysis and linear regression is that
there are any statistically significant differences between the means of customer behavior, preferences, and needs by analyzing multiple data when the decision boundaries are linear, the linear discriminant
different groups. ANOVA compares the variation between group means to the sources, such as online interactions, social media sentiment, purchase analysis and logistic regression will perform well.
variation within the groups. If the variation between group means is history, and demographic information. This knowledge enables
significantly larger than the variation within groups, it suggests a significant businesses to personalize their offerings. 15. What is the role of statistical model in Data Analytics?
difference between the means of the groups. 18. Short Note on Association Rule Mining and Deep Learning. Statistical modelling is the process of applying statistical analysis to a
- Association rule mining finds interesting associations and dataset. A statistical model is a mathematical representation (or
10. Explain Quadratic Determinant Analysis. relationships among large sets of data items. This rule shows how mathematical model) of observed data. When data analysts apply
Quadratic Discrimination is the general form of Bayesian discrimination. frequently an itemset occurs in a transaction. A typical example is a various statistical models to the data they are investigating, they are
Discriminant analysis is used to determine which variables discriminate Market Based Analysis. It is one of the key techniques used by large able to understand and interpret the information more strategically.
between two or more naturally occurring groups. Difference from LDA is it relations to show associations between items. It allows retailers to Rather than sifting through the raw data, this practice allows them to
relaxed the assumption that the mean and covariance of all the classes were identify relationships between the items that people buy together identify relationships between variables, make predictions about
equal. Working : QDA is a variant of LDA in which an individual covariance frequently. Given a set of transactions, we can find rules that will future sets of data, and visualize that data so that non-analysts and
matrix is estimated for every class of observations. QDA is particularly useful predict the occurrence of an item based on the occurrences of other stakeholders can consume and leverage it.
if there is prior knowledge that individual classes exhibit distinct covariances. items in the transaction. - Deep learning is a branch of machine
QDA assumes that observation of each class is drawn from a normal learning which is based on artificial neural networks. It is capable of 16. Explain how HADOOP is related to Big Data?
distribution (similar to linear discriminant analysis). QDA assumes that each learning complex patterns and relationships within data. In deep What are the features of HADOOP? HADOOP is an open source, Java
class has its own covariance matrix (different from linear discriminant learning, we don’t need to explicitly program everything. It has become based framework used for storing and processing big data. The data is
analysis) The advantage of quadratic discriminant analysis over linear increasingly popular in recent years due to the advances in processing stored on inexpensive commodity servers that run as clusters. Its
discriminant analysis and linear regression is that when the decision power and the availability of large datasets. Because it is based on distributed file system enables concurrent processing and fault
boundaries are linear. artificial neural networks (ANNs) also known as deep neural networks tolerance. HDFS - Hadoop comes with a distributed file system called
(DNNs). These neural networks are inspired by the structure and the Hadoop Distributed File System (HDFS) which was designed for Big
11. Describe Probability Distribution in details function of the human brain’s biological neurons, and they are Data processing. - It attempts to enable storage of large files, by
A probability distribution is a mathematical function that defines the designed to learn from large amounts of data. distributing the data among a pool of data nodes. - It holds very large
likelihood of different outcomes or values of a variable. This function is amount of data and provides easier access. - It is highly fault tolerant
commonly represented by a graph or probability table, and it provides the 16. Write the role of activation function in neural network. and designed using low-cost hardware.
probabilities of various possible results of an experiment or random The activation function in Neural Networks takes an input 'x' multiplied
phenomenon based on the sample space and the probabilities of events. by a weight 'w'. Bias allows you to shift the activation function by 14. Can you express multiple one way repeated measure ANOVA on a
Probability distributions are fundamental in probability theory and statistics adding a constant (i.e. the given bias) to the input. Bias in Neural two way design?
for analyzing data and making predictions. Probability distributions enable us Networks can be thought of as analogous to the role of a constant in a - A one-way ANOVA is primarily designed to enable the equality testing
to analyze data and draw meaningful conclusions by describing the likelihood linear function, whereby the line is effectively transposed by the between three or more means. A two-way ANOVA is designed to assess
of different outcomes or events. In statistical analysis, these distributions constant value. the interrelationship of two independent variables on a dependent
play a pivotal role in parameter estimation, hypothesis testing, and data variable. - A one-way ANOVA only involves one factor or independent
inference. They also find extensive use in risk assessment. 20. Write big data analysis helpful in increasing business revenue? variable, whereas there are two independent variables in a two-way
Explain the steps to be followed to deploy a big data solution. ANOVA. - In a one-way ANOVA, the one factor or independent variable
13. What is the role of Statistical Model in data analytics? Improved Accuracy : Big data analytics enables businesses to make analysed has three or more categorical groups. A two-way ANOVA
Statistical modelling is the process of applying statistical analysis to a decisions based on facts and evidence rather than intuition or instead compares multiple groups of two factors. - One-way ANOVA
dataset. A statistical model is a mathematical representation (or guesswork. By analyzing large volumes of data, patterns and trends need to satisfy only two principles of design of experiments, i.e.,
mathematical model) of observed data. When data analysts apply various that may not be apparent at a smaller scale can be identified. replication and randomization. As opposed to two-way ANOVA, which
statistical models to the data they are investigating, they are able to Real-time Insights : In the fast-paced business environment, real-time meets all three principles of design of experiments which are
understand and interpret the information more strategically. Rather than insights are essential for timely decision making. Big data analytics replication, randomization and local control.
sifting through the raw data, this practice allows them to identify allows organizations to process and analyze data in realtime or near
relationships between variables, make predictions about future sets of data, real-time, enabling them to respond quickly to emerging trends, 15. Group the main characteristics of big data. Why you need to take
and visualize that data so that non-analysts and stakeholders can consume market shifts, and customer demands. big data?
and leverage it. Customer Understanding : Understanding customers is vital for Five Vs of Big Data : Volume - The name Big Data itself is related to an
tailoring products, services, and marketing strategies. Big data enormous size. Big Data is a vast 'volumes' of data generated from
14. Justify why SVM is so fast? analytics provides a holistic view of customer behavior, preferences, many sources daily, such as business processes, machines, social
SVM performs and generalized well on the out of sample data. Due to this as and needs by analyzing multiple data sources, such as online media platforms, networks, human interactions, and many more.
it performs well on out of generalization sample data SVM proves itself to be interactions, social media sentiment, purchase history, and Variety - Big Data can be structured, unstructured, and semi-
fast as the sure fact says that in SVM for the classification of one sample , the demographic information. This knowledge enables businesses to structured that are being collected from different sources. Data will
kernel function is evaluated and performed for each and every support personalize their offerings, deliver targeted marketing campaigns, and only be collected from databases and sheets in the past, But these
vectors. enhance the overall customer experience. days the data will comes in array forms, that are PDFs, Emails, audios,
Competitive Advantage : In today’s competitive landscape, gaining an photos, videos, etc. Veracity - Veracity means how much the data is
15. Main characteristics of Data Analytics. edge over rivals is crucial. Big data analytics helps companies uncover reliable. It has many ways to filter or translate the data. Veracity is the
Data analytics is the process of examining, cleaning, transforming, and insights that can differentiate them from competitors. By identifying process of being able to handle and manage data efficiently. Value -
interpreting data to extract valuable insights and support decision-making. market trends, consumer preferences, and emerging opportunities, Value is an essential characteristic of big data. It is valuable and
main characteristics of data analytics are: ● Data collection from various businesses can develop innovative products, optimize pricing reliable data that are used for storing, processing, and analysing.
sources. ● Data cleaning and preprocessing for accuracy. ● Descriptive, strategies, and deliver superior customer service. Velocity - Velocity creates the speed by which the data is created in
diagnostic, predictive, and prescriptive analytics. ● Data visualization for Risk Management : Big data analytics plays a vital role in risk real-time. It contains the linking of incoming data sets speeds, rate of
better understanding. ● Use of machine learning and AI. ● Handling big data management by identifying potential risks and predicting future change. The primary aspect of Big Data is to provide demanding data
and real-time analytics. ● Emphasis on data security and privacy. ● An outcomes. By analyzing historical data and using predictive modelling rapidly.
iterative process to refine insights. ● Domain-specific and fosters a data- techniques, businesses can identify potential threats, fraud patterns,
driven culture Class Test I and anomalies. This empowers organizations to take proactive
measures to mitigate risks, improve security, and safeguard their
operations, reputation, and financial well-being.

You might also like