You are on page 1of 100

A quality product by

BrainheatersTM LLC

Brainheaters Notes
Applied data Science (ADS)
COMPUTER Semester-8

‘C’ SCHEME - 2022-2023


© 2016-23 | Proudly Powered by www.brainheaters.in
`

BH.Index
(Learn as per the Priority to prepare smartly)

Sr Chapter/Module Name Priority Pg no


No

1. Introduction to Data Science 4 02

2. Data Exploration 1 13

3. Methodology and Data Visualization 2 38

4. Anomaly Detection 2 62

5. Time Series Forecasting 3 72

6. Applications of Data Science 3 89

Page no - 1 Handcrafted by Engineers | P - Priority


`

MODULE-1

Q1. Write a short note on Data Science. (P4 -Appeared 1 Time) (5-10
Marks)
Ans: Data Science is a multidisciplinary field that involves the use of
statistical, computational, and machine learning techniques to extract
insights and knowledge from complex data sets.
● It encompasses a range of topics including data analysis, data
mining, machine learning, and artificial intelligence.
● Data Science involves a structured approach to analyzing and
interpreting large and complex data sets.
● This includes identifying patterns, making predictions, and
developing models that can be used to gain insights and drive
business decisions.
● Data Scientists use a variety of tools and techniques, such as
statistical programming languages like R and Python, to work with
data sets and extract meaningful information.
● The applications of Data Science are widespread and can be found
in various industries, such as healthcare, finance, marketing, and
transportation. It plays a crucial role in providing businesses with
valuable insights into customer behavior, market trends, and
operational efficiencies.
● To become a Data Scientist, one typically needs to have a strong
background in mathematics, statistics, and computer science.
● Many universities now offer Data Science programs, which provide
students with the necessary skills to work in this field.
● The demand for Data Scientists is growing rapidly, and it is
expected to continue to increase in the future.

Page no - 2 Handcrafted by Engineers | P - Priority


`

Q2. Explain in detail the Data Science Process. (P4 -Appeared 1 Time)
(5-10 Marks)
Ans: Data Science process is a structured and iterative approach to solving
complex problems using data.
● It involves a set of steps that help to extract insights and knowledge
from data to inform business decisions. The following are the steps
involved in the Data Science process:
1. Problem Formulation:
In this step, the problem is identified, and the business
objective is defined. The goal is to determine what questions
need to be answered using data and to ensure that they
align with the business goals.
2. Data Collection:
The next step is to collect relevant data. This can involve
various sources, such as internal databases, public data
repositories, or web scraping. The data must be accurate,
complete, and representative of the problem.
3. Data Cleaning and Preparation:
In this step, the collected data is cleaned, pre-processed,
and transformed to ensure that it is consistent and ready for
analysis. This involves tasks such as removing duplicates,
handling missing values, and encoding categorical
variables.
4. Data Exploration:
The goal of this step is to understand the data and gain
insights. This involves the use of descriptive statistics, data
visualization, and exploratory analysis techniques to identify
patterns, correlations, and trends.

Page no - 3 Handcrafted by Engineers | P - Priority


`

5. Feature Engineering:
This step involves the creation of new features or variables
from the existing data. This is done to improve the
performance of the predictive model or to gain further
insights into the problem.
6. Model Selection and Training:
In this step, a suitable model is selected to solve the
problem. This involves a trade-off between accuracy,
interpretability, and complexity. Once the model is selected,
it is trained on the data to learn the underlying patterns and
relationships.
7. Model Evaluation:
The performance of the model is evaluated using metrics
such as accuracy, precision, recall, and F1-score. The goal is
to determine whether the model is performing well or
whether it needs to be improved.
8. Model Deployment:
Once the model has been evaluated and deemed
acceptable, it can be deployed into production. This involves
integrating the model into the existing business processes or
systems.
9. Model Monitoring and Maintenance:
The final step involves monitoring the performance of the
model in the real-world environment. This involves tracking
the model's performance, detecting any drift, and retraining
the model as necessary.
● In summary, the Data Science process is a systematic and iterative
approach that involves several steps, including problem
formulation, data collection, data cleaning and preparation, data
exploration, feature engineering, model selection and training,

Page no - 4 Handcrafted by Engineers | P - Priority


`

model evaluation, model deployment, and model monitoring and


maintenance.
● The goal is to extract insights and knowledge from data to inform
business decisions.

Q3. Describe Motivation to use Data Science Techniques: Volume,


Dimensions and Complexity. (P4 -Appeared 1 Time) (5-10 Marks)
Ans: Data Science techniques are used to solve problems involving large,
complex, and high-dimensional data sets that cannot be analyzed using
traditional methods.
● The motivation to use Data Science techniques is driven by three
main factors: volume, dimensions, and complexity.
1. Volume:
The amount of data being generated is growing at an
unprecedented rate. Large volumes of data are being
generated from various sources such as social media, IoT
devices, and sensors. Data Science techniques are used to
manage and analyze these large volumes of data efficiently.
2. Dimensions:
The number of variables or dimensions in a data set can be
very high. For example, in genetics research, the number of
genes being analyzed can be in the millions. Data Science
techniques such as dimensionality reduction, feature
selection, and feature extraction are used to reduce the
number of dimensions in the data and to identify the most
relevant features.
3. Complexity:
Data sets can be complex, containing non-linear
relationships between variables, missing values, and noisy
data. Data Science techniques such as machine learning,

Page no - 5 Handcrafted by Engineers | P - Priority


`

deep learning, and natural language processing are used to


analyze and make predictions from these complex data
sets.
● Data Science techniques are used in various industries, including
healthcare, finance, marketing, and transportation, to name a few.
● For example, in healthcare, Data Science techniques are used to
analyze medical images, identify genetic markers, and develop
personalized medicine.
● In finance, Data Science techniques are used to detect fraud,
forecast market trends, and optimize trading strategies.

Q4. Explain in detail Data Science Tasks and Examples. (P4 -Appeared 1
Time) (5-10 Marks)
Ans: Data Science tasks can be broadly classified into four categories:
descriptive analysis, exploratory analysis, predictive analysis, and
prescriptive analysis.
These tasks involve using statistical, machine learning, and data
visualization techniques to extract insights and knowledge from data. The
following are examples of each type of Data Science task:
1. Descriptive Analysis:
Descriptive analysis involves summarizing and describing the
characteristics of a data set. This includes measures such as mean,
median, mode, standard deviation, and frequency distribution.
Descriptive analysis is used to gain an understanding of the data
and to identify patterns and trends.
Examples of descriptive analysis include:
● Calculating the average age of customers in a retail store.
● Determining the percentage of males and females in a
population.
● Analyzing the distribution of income levels in a city.

Page no - 6 Handcrafted by Engineers | P - Priority


`

2. Exploratory Analysis:
Exploratory analysis involves visualizing and exploring the data to
identify relationships and patterns. This includes techniques such as
scatter plots, histograms, box plots, and heat maps. Exploratory
analysis is used to gain a deeper understanding of the data and to
identify potential insights.
Examples of exploratory analysis include:
● Visualizing the relationship between temperature and
humidity in a climate data set.
● Creating a histogram of the distribution of customer
purchase amounts in an e-commerce store.
● Plotting the distribution of crime rates in different areas of a
city.

3. Predictive Analysis:
● Predictive analysis involves using machine learning and
statistical techniques to make predictions based on the
data. This includes techniques such as regression,
classification, clustering, and time series analysis. Predictive
analysis is used to make forecasts and predictions based on
past data.
Examples of predictive analysis include:
● Predicting the sales volume of a product based on past
sales data.
● Forecasting the stock price of a company based on
historical data.
● Identifying potential customer churn based on past
purchase behavior.

Page no - 7 Handcrafted by Engineers | P - Priority


`

4. Prescriptive Analysis:
Prescriptive analysis involves using optimization and simulation
techniques to make recommendations based on the data. This
includes techniques such as linear programming, decision trees,
and Monte Carlo simulation. Prescriptive analysis is used to make
data-driven recommendations for decision-making.
Examples of prescriptive analysis include:
● Optimizing production schedules in a manufacturing plant
to minimize costs.
● Determining the optimal route for delivery trucks to minimize
travel time and distance.
● Recommending the most profitable investment portfolio
based on risk and return.

Q5. Write a short note on Data Preparation. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Data preparation, also known as data preprocessing, is a crucial step
in the Data Science process.
● It involves cleaning, transforming, and organizing raw data into a
format suitable for analysis. The quality of data preparation has a
significant impact on the accuracy and reliability of the results
obtained from data analysis.
● Data preparation involves several steps, including:
1. Data Cleaning: Data cleaning involves identifying and
correcting errors, inconsistencies, and missing data in the
data set. This includes removing duplicate entries, correcting
typos, and filling in missing data.
2. Data Transformation: Data transformation involves
converting data from one format to another to make it
suitable for analysis. This includes converting categorical

Page no - 8 Handcrafted by Engineers | P - Priority


`

variables into numerical variables, normalizing data to a


common scale, and creating new variables by combining or
splitting existing variables.
3. Data Integration: Data integration involves combining data
from multiple sources to create a single, unified data set.
This includes resolving discrepancies in data format and
merging data based on common variables.
4. Data Reduction: Data reduction involves reducing the size
and complexity of the data set without losing significant
information. This includes techniques such as feature
selection, dimensionality reduction, and data sampling.
● The importance of data preparation cannot be overstated. Poor
quality data preparation can lead to inaccurate analysis and
incorrect conclusions.
● Therefore, it is essential to devote sufficient time and effort to data
preparation to ensure that the data set is suitable for analysis and
that the results obtained are accurate and reliable.

Q6. Write a short note on Modeling. (P4 -Appeared 1 Time) (5-10 Marks)
Ans: Modeling is a critical step in the Data Science process that involves
building mathematical or computational models to represent real-world
phenomena.
● The goal of modeling is to gain a better understanding of the data
and to make predictions or decisions based on the model's output.
The following are some key aspects of modeling in Data Science:
1. Model Selection: Choosing the appropriate model for a given
problem is critical to the success of modeling. The model
should be able to capture the essential features of the data
and should be able to make accurate predictions or
decisions.

Page no - 9 Handcrafted by Engineers | P - Priority


`

2. Model Training: Model training involves fitting the model to


the data set. This includes determining the model
parameters that best fit the data and optimizing the model's
performance using techniques such as cross-validation.
3. Model Evaluation: Model evaluation involves assessing the
quality of the model's predictions or decisions. This includes
measuring the model's accuracy, precision, recall, and other
performance metrics.
4. Model Deployment: Model deployment involves integrating
the model into the decision-making process. This includes
implementing the model in a software application,
developing an API for the model, or integrating the model
into a larger system.
● Models can be broadly classified into two categories: supervised
and unsupervised.
● In supervised learning, the model is trained on a labeled data set,
where the output variable is known. The goal of supervised learning
is to predict the output variable for new, unseen data. Examples of
supervised learning include regression and classification.
● In unsupervised learning, the model is trained on an unlabeled data
set, where the output variable is unknown. The goal of unsupervised
learning is to identify patterns and structure in the data. Examples
of unsupervised learning include clustering and dimensionality
reduction.

Page no - 10 Handcrafted by Engineers | P - Priority


`

Q7. Difference between data science and data analytics. (P4 -


Appeared 1 Time) (5-10 Marks)
Ans:

Data Science Data Analytics


Parameter

A field that encompasses A subset of data science that


a wide range of focuses on using statistical
Definition techniques, tools, and and computational methods
methodologies for working to explore, analyze, and
with data extract insights from data

Extraction of insights and


Uncovering patterns, trends,
knowledge from complex
and relationships in data and
Focus and large data sets using
using that information to
advanced statistical and
make data-driven decisions
computational methods

Data preparation, data


Exploratory data analysis,
Stages of modeling, data
statistical analysis, and
Analysis visualization, and
predictive modeling
communication of insights

Mainly used in business


Used in various fields,
settings to optimize
Application including business,
processes, improve customer
s healthcare, finance, and
experience, and increase
more
profitability

Page no - 11 Handcrafted by Engineers | P - Priority


`

Machine learning, deep Statistical analysis,


Techniques learning, natural language regression analysis,
Used processing, computer clustering, data visualization,
vision, and more and more

In- depth knowledge of Basic Programming skills


Programm
programming is required are necessary for data
ing Skills
for data science. analytics.

Data Science makes use


Use of
of machine learning Data Analytics doesn’t make
Machine
algorithms to get use of machine learning.
Learning
insights.

Page no - 12 Handcrafted by Engineers | P - Priority


`

MODULE-2

Q1. Write down Types of data. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: There are several types of data that are commonly used in Data
Science and other fields. These types of data include:
1. Nominal Data: Nominal data is categorical data that cannot be
ranked or ordered. Examples include gender, race, and type of car.
2. Ordinal Data: Ordinal data is categorical data that can be ranked or
ordered. Examples include education level (e.g., high school,
college, graduate degree), socioeconomic status (e.g., low, middle,
high), and customer satisfaction ratings (e.g., poor, fair, good,
excellent).
3. Interval Data: Interval data is numerical data that has a consistent
scale and unit of measurement, but does not have a true zero point.
Examples include temperature in Celsius or Fahrenheit and dates
on a calendar.
4. Ratio Data: Ratio data is numerical data that has a consistent scale
and unit of measurement, and does have a true zero point.
Examples include weight, height, and income.
5. Time Series Data: Time series data is a type of data where the
observations are recorded at regular intervals over time. Examples
include stock prices, weather data, and website traffic.
6. Text Data: Text data is unstructured data in the form of text that can
be analyzed using Natural Language Processing techniques.
Examples include social media posts, customer reviews, and news
articles.
7. Spatial Data: Spatial data is data that is associated with
geographic locations. Examples include GPS data, satellite imagery,
and maps.

Page no - 13 Handcrafted by Engineers | P - Priority


`

Understanding the type of data is important because it can influence the


choice of statistical and computational techniques used for analysis.

Q2. Discuss in detail Properties of data. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: In Data Science, data is often described in terms of its properties,
which are characteristics that define the data and influence how it can be
analyzed and processed. Here are some of the key properties of data:
1. Scale: Scale refers to the range and distribution of values in the
data. Data can have a small or large scale, depending on the range
of values that it encompasses. For example, a dataset containing
the ages of people in a population might have a scale of 0 to 100
years.
2. Resolution: Resolution refers to the level of detail or granularity in the
data. Data can be high resolution, with a fine level of detail, or low
resolution, with a coarser level of detail. For example, satellite
imagery can have a high resolution, allowing for the identification of
small details on the ground, while weather data might have a lower
resolution, providing broader information about a region.
3. Accuracy: Accuracy refers to the degree to which the data
represents the true or intended values. Accurate data is essential
for making informed decisions and drawing accurate conclusions.
For example, a dataset containing inaccurate or incomplete
customer information could lead to incorrect conclusions about
customer behavior.
4. Completeness: Completeness refers to the extent to which the data
represents the full set of values or observations that are needed.
Incomplete data can result in gaps or biases in the analysis, and
can limit the ability to draw accurate conclusions. For example, a

Page no - 14 Handcrafted by Engineers | P - Priority


`

dataset that contains only a subset of the population may not


accurately represent the true population.
5. Consistency: Consistency refers to the degree to which the data is
uniform and follows a consistent format or structure. Inconsistent
data can make it difficult to analyze and compare data, and can
lead to errors in analysis. For example, a dataset that contains
inconsistent date formats could make it difficult to accurately
analyze time-series data.
6. Relevance: Relevance refers to the extent to which the data is useful
for the intended analysis or application. Relevant data is essential
for making informed decisions and drawing accurate conclusions.
For example, a dataset containing irrelevant variables or data
points could lead to incorrect conclusions about the relationships
between variables.
7. Timeliness: Timeliness refers to the degree to which the data is
current and relevant to the intended analysis or application. Timely
data is essential for making informed decisions and drawing
accurate conclusions. For example, stock price data that is delayed
or outdated could lead to incorrect conclusions about market
trends.

Q3. Explain in detail Descriptive Statistics. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Descriptive statistics refers to the process of analyzing and
summarizing data using various statistical methods.
● The purpose of descriptive statistics is to provide an overview of the
data and to help identify patterns, trends, and relationships that
may be present.
● Here are some of the key methods used in descriptive statistics:

Page no - 15 Handcrafted by Engineers | P - Priority


`

1. Measures of Central Tendency: Measures of central


tendency are statistics that represent the "center" of the
data, or the typical or average value. The three most
common measures of central tendency are the mean
(average), median (middle value), and mode (most
common value).
2. Measures of Variability: Measures of variability are statistics
that describe how spread out or varied the data is. The most
common measures of variability are the range (difference
between the highest and lowest values), variance (average
squared deviation from the mean), and standard deviation
(square root of the variance).
3. Frequency Distributions: Frequency distributions show how
often each value or range of values occurs in the data.
Frequency distributions can be displayed using histograms,
bar charts, or pie charts.
4. Correlation Analysis: Correlation analysis is used to identify
the relationship between two variables. Correlation
coefficients range from -1 to +1, with a value of 0 indicating
no correlation and a value of +1 indicating a perfect positive
correlation.
5. Regression Analysis: Regression analysis is used to model
the relationship between two or more variables. Simple
linear regression models the relationship between two
variables, while multiple regression models the relationship
between multiple variables.
6. Percentiles and Quartiles: Percentiles and quartiles are used
to divide the data into equal parts based on their rank or
position. The median represents the 50th percentile, while

Page no - 16 Handcrafted by Engineers | P - Priority


`

the quartiles divide the data into quarters (25th, 50th, and
75th percentiles).
● Descriptive statistics is an important tool for analyzing and
summarizing data in a meaningful way. It is often used to provide a
baseline understanding of the data before more complex analyses
are performed.

Q4. Describe in detail Univariate Exploration. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Univariate exploration is a data analysis technique that focuses on
examining a single variable at a time.
● The purpose of univariate exploration is to gain an understanding of
the distribution and characteristics of a single variable, which can
help in identifying any patterns or outliers in the data.
● Here are some of the key methods used in univariate exploration:
1. Histograms: Histograms are used to visualize the distribution
of a single variable. A histogram is a graph that displays the
frequency of data within different intervals or bins. The
height of each bar represents the number of data points
within that interval.
2. Box plots: Box plots are used to visualize the distribution of a
single variable by displaying the median, quartiles, and
outliers. A box plot consists of a box that spans the
interquartile range (IQR) and whiskers that extend to the
highest and lowest values within 1.5 times the IQR.
3. Density plots: Density plots are used to visualize the
probability density function of a single variable. A density
plot is a smoothed version of a histogram that represents
the distribution of the variable as a continuous curve.

Page no - 17 Handcrafted by Engineers | P - Priority


`

4. Bar charts: Bar charts are used to visualize the distribution of


categorical variables. A bar chart displays the frequency or
proportion of each category as a bar.
5. Summary statistics: Summary statistics such as mean,
median, mode, variance, and standard deviation can be
used to describe the central tendency and variability of a
single variable.
6. Skewness and Kurtosis: Skewness and Kurtosis are used to
measure the shape of the distribution of a variable.
Skewness measures the asymmetry of the distribution, while
kurtosis measures the degree of peakedness or flatness of
the distribution.
● Univariate exploration is an important step in data analysis as it
provides a detailed understanding of a single variable and helps in
identifying any patterns or outliers in the data.

Q5. Explain in detail the Measure of Central Tendency. (P4 - Appeared 1


Time) (5-10 Marks)
Ans: Measures of central tendency are statistics that describe the "center"
or typical value of a dataset. There are three common measures of central
tendency: the mean, median, and mode.
1. Mean: The mean is calculated by adding up all the values in a
dataset and then dividing by the number of values. It is the most
commonly used measure of central tendency. However, it can be
sensitive to outliers, or extreme values, which can skew the mean.
2. Median: The median is the middle value of a dataset when the
values are arranged in order. If there is an even number of values,
the median is the average of the two middle values. The median is
less sensitive to outliers than the mean.

Page no - 18 Handcrafted by Engineers | P - Priority


`

3. Mode: The mode is the value that occurs most frequently in a


dataset. It can be useful in identifying the most common value in a
dataset, but it may not be a good measure of central tendency if
there are multiple modes or if the dataset is continuous.
Each measure of central tendency has its advantages and disadvantages,
and the choice of which to use depends on the nature of the data and the
research question. For example, if the data is normally distributed, the
mean may be the most appropriate measure of central tendency.
However, if the data is skewed or contains outliers, the median may be a
better choice. Similarly, the mode may be useful for categorical data, but
not for continuous data.

Q6. Write down Measures of Spread, Symmetry. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Measures of spread and symmetry are important descriptive statistics
that help to characterize the distribution of a dataset. Here are some of the
most common measures of spread and symmetry:
● Measures of spread:
1. Range: The range is the difference between the largest and
smallest values in a dataset.
2. Interquartile range (IQR): The IQR is the range of the middle
50% of the dataset, calculated by subtracting the 25th
percentile from the 75th percentile.
3. Variance: The variance is the average of the squared
differences from the mean. It measures how much the data
varies from the mean.
4. Standard deviation: The standard deviation is the square
root of the variance. It measures the spread of the data
around the mean.

Page no - 19 Handcrafted by Engineers | P - Priority


`

● Measures of symmetry:
1. Skewness: Skewness is a measure of the asymmetry of the
distribution. A positive skew indicates that the tail of the
distribution is longer on the positive side, while a negative
skew indicates that the tail is longer on the negative side. A
skewness value of zero indicates that the distribution is
perfectly symmetrical.
2. Kurtosis: Kurtosis is a measure of the "peakedness" of the
distribution. A high kurtosis value indicates that the
distribution has a sharp peak and heavy tails, while a low
kurtosis value indicates a flat or rounded distribution.
● These measures are important because they provide valuable
information about the shape and variability of a dataset.
Understanding the spread and symmetry of a dataset can help in
making informed decisions about the data and in identifying any
patterns or outliers that may be present.

Q7. Write a short note on Skewness. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Skewness is a measure of the asymmetry of a probability distribution.
It describes the extent to which a distribution deviates from symmetry
around its mean.
● A distribution can be skewed to the left (negative skewness) or
skewed to the right (positive skewness).
● If a distribution is skewed to the left, the tail of the distribution is
longer on the left-hand side, and the mean is less than the median.
● This is because there are more extreme values on the left side of the
distribution. Conversely, if a distribution is skewed to the right, the
tail of the distribution is longer on the right-hand side, and the
mean is greater than the median.

Page no - 20 Handcrafted by Engineers | P - Priority


`

● This is because there are more extreme values on the right side of
the distribution.
● Skewness can be quantified using a number of different measures,
such as Pearson's moment coefficient of skewness, the Bowley
skewness, or the quartile skewness coefficient.
● Skewness is an important measure of distributional shape, as it
provides insight into the direction and degree of deviation from
symmetry.
● Skewed distributions can have important implications in statistical
analysis, as they can affect the interpretation of statistical tests,
such as the t-test or ANOVA.

Q8. Discuss in detail Karl Pearson Coefficient of skewness. (P4 -


Appeared 1 Time) (5-10 Marks)
Ans: The Karl Pearson Coefficient of skewness is a measure of the skewness
of a distribution.
● It is defined as the ratio of the difference between the mean and the
mode of a distribution, to the standard deviation of the distribution.
● This measure was developed by Karl Pearson, a British
mathematician and statistician.
● The formula for Karl Pearson Coefficient of skewness is:
Skewness = 3 * (Mean - Median) / Standard deviation
Where,
Mean = arithmetic mean of the dataset
Median = median of the dataset
Standard deviation = standard deviation of the dataset
● The Karl Pearson Coefficient of skewness is a dimensionless
measure of skewness, meaning that it has no units.
● The measure is always zero for a perfectly symmetrical distribution.

Page no - 21 Handcrafted by Engineers | P - Priority


`

● Positive values of skewness indicate that the tail of the distribution is


longer on the right-hand side, while negative values of skewness
indicate that the tail is longer on the left-hand side.
● One of the main advantages of the Karl Pearson Coefficient of
skewness is that it is easy to calculate and interpret.
● However, it can be sensitive to outliers in the data, and it may not be
appropriate for distributions that are heavily skewed.
● Overall, the Karl Pearson Coefficient of skewness is a useful measure
of the skewness of a distribution, and it can provide valuable insight
into the shape and characteristics of the data.

Q9. Explain in detail Bowley‘s Coefficient. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Bowley's coefficient of skewness, also known as quartile skewness
coefficient, is a measure of skewness of a distribution.
● It is based on the difference between the upper and lower quartiles
of a dataset. The measure was developed by Arthur Bowley, an
English statistician.
● The formula for Bowley's coefficient of skewness is:
Skewness = (Q3 + Q1 - 2 * Q2) / (Q3 - Q1)
Where,
Q1 = first quartile of the dataset
Q2 = second quartile or median of the dataset
Q3 = third quartile of the dataset
● The coefficient of skewness can take values ranging from -1 to +1. A
value of zero indicates that the distribution is perfectly symmetrical,
while negative and positive values indicate that the distribution is
skewed to the left or right, respectively.

Page no - 22 Handcrafted by Engineers | P - Priority


`

● Overall, Bowley's coefficient of skewness is a useful measure of the


skewness of a distribution, and it can provide valuable insight into
the shape and characteristics of the data.
● One of the main advantages of Bowley's coefficient of skewness is
that it is less sensitive to extreme values or outliers than other
measures of skewness, such as the Karl Pearson coefficient. This is
because it is based on quartiles, which are less affected by extreme
values than the mean and standard deviation.
● However, one limitation of Bowley's coefficient of skewness is that it
is only based on three quartiles of the dataset, and may not be as
accurate as other measures of skewness for distributions that are
heavily skewed.

Q10. Discuss in detail Kurtosis Multivariate Exploration. (P4 - Appeared 1


Time) (5-10 Marks)
Ans: Kurtosis is a statistical measure that describes the shape of a
distribution. It is a measure of the "peakedness" or "flatness" of a distribution
compared to a normal distribution.
● Multivariate exploration refers to the analysis of more than one
variable at a time. In this context, kurtosis can be used to analyze
the relationship between multiple variables in a dataset.
● The most commonly used measure of kurtosis is the Pearson's
coefficient of kurtosis, which is calculated by dividing the fourth
moment by the square of the variance.
● The formula for Pearson's coefficient of kurtosis is:
Kurtosis = (M4 / S 4 ) - 3
Where,
M4 = fourth moment of the dataset
S = standard deviation of the dataset

Page no - 23 Handcrafted by Engineers | P - Priority


`

● The value of kurtosis can be positive or negative. A positive value


indicates that the distribution is more peaked than a normal
distribution, while a negative value indicates that the distribution is
flatter than a normal distribution.
● A value of zero indicates that the distribution is similar in shape to a
normal distribution.
● In multivariate exploration, kurtosis can be used to analyze the
relationship between multiple variables in a dataset.
● For example, if two variables have similar levels of kurtosis, it may
indicate that they are related or have a similar distribution.
● On the other hand, if two variables have different levels of kurtosis, it
may indicate that they are not related or have different
distributions.
● One limitation of kurtosis in multivariate exploration is that it only
measures the shape of a distribution, and does not take into
account other factors such as the location and spread of the data.
● Therefore, it is important to use kurtosis in combination with other
measures such as central tendency and dispersion when analyzing
relationships between multiple variables.

Q11. Explain in detail Central Data Point. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Central data point is a term used to describe a specific value that
represents the central tendency or central location of a dataset.
● In statistics, measures of central tendency are used to describe the
typical or most common value in a dataset.
● There are three commonly used measures of central tendency:
mean, median, and mode.
● The mean is calculated by adding up all the values in a dataset and
dividing by the number of observations. The mean is a sensitive

Page no - 24 Handcrafted by Engineers | P - Priority


`

measure of central tendency and can be affected by outliers or


extreme values in the dataset.
● The median is the middle value in a dataset when the observations
are arranged in order from smallest to largest. The median is less
sensitive to extreme values and is a more robust measure of central
tendency than the mean.
● The mode is the value that appears most frequently in a dataset.
The mode is useful when there is a high frequency of one or a few
specific values in the dataset.
● The choice of central data point depends on the characteristics of
the dataset and the research question being addressed.
● In some cases, the mean may be more appropriate, while in other
cases the median or mode may be more appropriate.
● For example, if the dataset contains extreme values or outliers, the
median may be a better measure of central tendency than the
mean.

Q12. Write a short note on Correlation. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Correlation is a statistical measure that describes the relationship
between two variables.
● It is used to determine whether there is a statistical association
between the two variables, and if so, the strength and direction of
the association. Correlation can be expressed as a numerical value
known as the correlation coefficient.
● The correlation coefficient ranges from -1 to +1.
● A value of -1 indicates a perfect negative correlation, meaning that
as one variable increases, the other variable decreases in a linear
fashion.

Page no - 25 Handcrafted by Engineers | P - Priority


`

● A value of +1 indicates a perfect positive correlation, meaning that


as one variable increases, the other variable increases in a linear
fashion. A value of 0 indicates no correlation between the two
variables.
● Correlation is commonly used in research to investigate the
relationship between variables.
● For example, in a medical study, correlation may be used to
determine whether there is a relationship between a specific
treatment and patient outcomes.
● In finance, correlation may be used to determine whether there is a
relationship between the performance of two stocks or investments.
● It is important to note that correlation does not imply causation.
Just because two variables are correlated does not mean that one
causes the other.
● It is also important to consider other factors that may influence the
relationship between the variables.
● Correlation is a useful statistical tool for investigating the
relationship between two variables.
● It can help researchers identify patterns and trends in the data and
make predictions about future outcomes.
● However, it is important to interpret correlation in the context of the
research question being addressed and consider other factors that
may influence the relationship between the variables.

Q13. Different forms of correlation. (P4 - Appeared 1 Time) (5-10 Marks)


Ans: There are different forms of correlation that can be used to measure
the strength and direction of the relationship between two variables. Here
are some of the most common forms of correlation:
1. Pearson correlation: This is the most commonly used form of
correlation and measures the linear relationship between two

Page no - 26 Handcrafted by Engineers | P - Priority


`

continuous variables. It ranges from -1 to +1, with 0 indicating no


correlation and values closer to -1 or +1 indicating a stronger
correlation.
2. Spearman correlation: This form of correlation measures the
relationship between two variables based on their rank order. It is
used when the variables are ordinal or when the relationship is
non-linear.
3. Kendall correlation: This form of correlation is also based on rank
order and measures the strength and direction of the relationship
between two variables. It is often used when the data is
non-parametric or when the variables are not normally distributed.
4. Point-biserial correlation: This form of correlation measures the
relationship between a continuous variable and a dichotomous
variable. It is used when one variable is continuous and the other
variable is binary.
5. Biserial correlation: This form of correlation measures the
relationship between two dichotomous variables. It is used when
both variables are binary.
6. Phi coefficient: This form of correlation is used when both variables
are dichotomous and is similar to the biserial correlation.
The choice of correlation method depends on the type of data being
analyzed and the research question being addressed. It is important to
select the appropriate method to accurately measure the relationship
between two variables.

Q14. Describe Karl Pearson Correlation Coefficient for bivariate


distribution. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: Karl Pearson correlation coefficient is a statistical measure used to
quantify the strength and direction of the linear relationship between two
continuous variables.

Page no - 27 Handcrafted by Engineers | P - Priority


`

● It is denoted by the symbol "r" and ranges between -1 and +1. A


value of +1 indicates a perfect positive correlation, while a value of
-1 indicates a perfect negative correlation.
● A value of 0 indicates no correlation between the two variables.
● The formula to calculate Pearson correlation coefficient is as
follows:
r = (Σ((x - x̄)(y - ȳ)))/√((Σ(x - x̄)²)(Σ(y - ȳ)²))
where x and y are the two variables, x̄ and ȳ are their respective
means, and Σ represents the sum of the values.
● To interpret the correlation coefficient, we can use the following
guidelines:
1. A value of r between -0.7 and -1 or between 0.7 and 1
indicates a strong correlation.
2. A value of r between -0.5 and -0.7 or between 0.5 and 0.7
indicates a moderate correlation.
3. A value of r between -0.3 and -0.5 or between 0.3 and 0.5
indicates a weak correlation.
4. A value of r between -0.3 and 0.3 indicates no correlation.
● It is important to note that correlation does not imply causation. A
strong correlation between two variables does not necessarily
mean that one variable causes the other. Other factors may be
responsible for the observed relationship.
● Pearson correlation coefficient is commonly used in research and
data analysis to investigate the relationship between two variables.
● It is a useful tool for identifying patterns and trends in the data and
making predictions about future outcomes.

Page no - 28 Handcrafted by Engineers | P - Priority


`

Q15. Explain Overview of Various forms of distributions Normal, Poisson.


(P4 - Appeared 1 Time) (5-10 Marks)
Ans: There are various types of probability distributions, each with its own
unique characteristics and properties.
Two common types of distributions are the normal distribution and the
Poisson distribution.
1. Normal Distribution:
● The normal distribution, also known as the Gaussian
distribution, is a continuous probability distribution that is
symmetric and bell-shaped.
● It is often used to model naturally occurring phenomena
such as heights, weights, and IQ scores.
● The normal distribution is characterized by two parameters,
the mean (μ) and the standard deviation (σ).
● The mean represents the central tendency of the
distribution, while the standard deviation represents the
spread or variability of the distribution.
● The area under the normal curve is equal to 1, and about
68% of the data falls within one standard deviation of the
mean, about 95% falls within two standard deviations, and
about 99.7% falls within three standard deviations.

2. Poisson Distribution:
● The Poisson distribution is a discrete probability distribution
that is used to model the number of times an event occurs in
a fixed interval of time or space, given that the events occur
independently and at a constant rate.
● It is named after French mathematician Siméon Denis
Poisson. The Poisson distribution is characterized by a single

Page no - 29 Handcrafted by Engineers | P - Priority


`

parameter, λ (lambda), which represents the average rate of


occurrence of the event.
● The probability of observing a certain number of events in a
fixed interval of time or space can be calculated using the
Poisson probability mass function.
● The Poisson distribution is commonly used in fields such as
biology, physics, and telecommunications.
Both the normal distribution and the Poisson distribution are important
tools in statistics and data analysis. They allow us to model and analyze
data in a meaningful way, and make predictions about future outcomes
based on past observations.

Q16. Describe in detail Test Hypothesis. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: In statistics, a hypothesis is a proposed explanation or prediction for a
phenomenon or set of observations.
● Hypothesis testing is a statistical technique that allows us to test the
validity of a hypothesis by comparing it to an alternative hypothesis
using a set of statistical tools and methods.
● The goal of hypothesis testing is to determine whether there is
enough evidence to support or reject the null hypothesis.
● The hypothesis testing process typically involves the following steps:
1. Formulate the null and alternative hypotheses: The null
hypothesis is the default hypothesis that there is no
significant difference or effect between two populations or
samples, while the alternative hypothesis is the opposite
hypothesis that there is a significant difference or effect.
2. Choose a significance level: The significance level (denoted
as α) is the probability of rejecting the null hypothesis when it

Page no - 30 Handcrafted by Engineers | P - Priority


`

is actually true. The most common significance level is 0.05


or 5%.
3. Collect data and calculate test statistic: Collect the data
and calculate a test statistic, which is a numerical value that
measures the difference between the observed data and
the expected values under the null hypothesis.
4. Determine the p-value: The p-value is the probability of
observing the test statistic or a more extreme value if the
null hypothesis is true. It is used to determine the statistical
significance of the test result.
5. Compare the p-value to the significance level: If the p-value
is less than the significance level, we reject the null
hypothesis and accept the alternative hypothesis. If the
p-value is greater than the significance level, we fail to reject
the null hypothesis.
6. Draw conclusions: Based on the results of the hypothesis
test, we can draw conclusions about the relationship
between the variables being tested.
● Hypothesis testing is used in many different fields, including
business, medicine, and social sciences.
● It is a powerful tool for making decisions and drawing conclusions
based on data and statistical analysis.
● However, it is important to carefully formulate the null and
alternative hypotheses, choose an appropriate significance level,
and properly interpret the results of the test to ensure accurate and
meaningful conclusions.

Page no - 31 Handcrafted by Engineers | P - Priority


`

Q17. Write a short note on the Central limit theorem. (P4 - Appeared 1
Time) (5-10 Marks)
Ans: The Central Limit Theorem (CLT) is a key concept in data science that
is used to make statistical inferences about large datasets.
● It states that if a large number of independent and identically
distributed random variables are added or averaged together, the
resulting distribution will be approximately normal, regardless of the
underlying distribution of the individual variables.
● In practical terms, this means that if we have a large enough
sample size, we can use the CLT to make accurate estimates about
the population mean and standard deviation, even if we don't know
the underlying distribution of the data.
● This is important in data science because it allows us to draw
meaningful conclusions from data, even when we have incomplete
information.
● For example, the CLT is often used in hypothesis testing, where we
compare a sample mean to a hypothesized population mean.
● By calculating the standard error of the mean using the CLT, we can
estimate the probability of observing a sample mean as extreme as
the one we have, given the hypothesized population mean.
● This helps us determine whether the difference between the sample
mean and the hypothesized mean is statistically significant.
● Overall, the CLT is a fundamental concept in data science that helps
us make sense of large datasets and draw meaningful conclusions
from them.

Page no - 32 Handcrafted by Engineers | P - Priority


`

Q18. Confidence Interval, Z-test, t-test. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Confidence Interval:
● A confidence interval is a range of values that is likely to contain the
true population parameter with a certain level of confidence.
● It is an important tool in statistical inference that is used to estimate
the range of values that the true population parameter could fall
within based on a sample of data.
● The level of confidence is typically expressed as a percentage, such
as 95% or 99%.
● A wider confidence interval implies a lower level of confidence in
the estimate, and a narrower interval implies a higher level of
confidence.
Z-test:
● A Z-test is a hypothesis test that is used to determine whether a
sample mean is significantly different from a hypothesized
population mean, when the population variance is known.
● It involves calculating the Z-score of the sample mean, which is the
number of standard deviations the sample mean is from the
hypothesized mean.
● If the Z-score falls outside a certain range of values, the null
hypothesis is rejected and the sample mean is deemed to be
significantly different from the hypothesized mean.
T-test:
● A T-test is a hypothesis test that is used to determine whether a
sample mean is significantly different from a hypothesized
population mean, when the population variance is unknown.
● The T-test is used instead of the Z-test when the sample size is
small, or when the population variance is unknown.

Page no - 33 Handcrafted by Engineers | P - Priority


`

● The T-test involves calculating the T-score of the sample mean,


which is similar to the Z-score, but takes into account the sample
size and the sample variance.
● If the T-score falls outside a certain range of values, the null
hypothesis is rejected and the sample mean is deemed to be
significantly different from the hypothesized mean.
● In summary, confidence intervals are used to estimate the range of
values that the true population parameter could fall within based
on a sample of data. Z-tests are used when the population variance
is known, and T-tests are used when the population variance is
unknown or when the sample size is small.
● Both tests are used to determine whether a sample mean is
significantly different from a hypothesized population mean.

Q19. Describe in detail Type-I, Type-II Errors. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: In data science, Type-I and Type-II errors have the same definitions as
in statistical hypothesis testing.
● However, in the context of data science, these errors can occur in
different ways and have different consequences.

Type-I Error:
● In data science, a Type-I error occurs when a statistical model or
algorithm incorrectly identifies a pattern or relationship in the data
as being significant, when in fact it is not.
● This is similar to a false positive in statistical hypothesis testing. For
example, suppose a predictive model incorrectly identifies a feature
as being important for predicting a target variable, when in fact it is
not.

Page no - 34 Handcrafted by Engineers | P - Priority


`

● This can lead to inaccurate predictions and misleading insights,


which can have serious consequences in fields such as healthcare
or finance.

Type-II Error
● In data science, a Type-II error occurs when a statistical model or
algorithm fails to identify a significant pattern or relationship in the
data, when in fact it is present. This is similar to a false negative in
statistical hypothesis testing.
● For example, suppose a predictive model fails to identify an
important feature for predicting a target variable, leading to
inaccurate predictions and missed opportunities. This can also
have serious consequences in fields such as healthcare or finance,
where missing an important trend or relationship can have
significant implications.

● It is important to note that in data science, the consequences of


Type-I and Type-II errors can vary depending on the specific
application and the cost of making an incorrect decision.
● For example, in a medical diagnosis application, the cost of a
Type-II error (failing to identify a disease) may be much higher than
the cost of a Type-I error (identifying a disease that is not present),
as the latter can be verified with additional tests, whereas the
former may result in delayed treatment and worse health
outcomes. Therefore, it is important for data scientists to consider
the consequences of both types of errors when designing and
evaluating models and algorithms.

Page no - 35 Handcrafted by Engineers | P - Priority


`

Q20. Write a short note on ANOVA. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: ANOVA, or Analysis of Variance, is a statistical method that can be
used in data science to compare means between three or more groups.
● ANOVA can help to determine whether there are significant
differences in the means of multiple groups and can be useful for
making decisions based on the data.
● In data science, ANOVA can be used in a variety of applications,
such as comparing the performance of different algorithms,
evaluating the effectiveness of different marketing campaigns, or
comparing the results of different A/B tests.
● By comparing the means of multiple groups, ANOVA can provide
insights into which groups are performing better or worse, which
can help to guide decision-making and improve outcomes.
● One important consideration in using ANOVA in data science is the
assumption of equal variances between the groups.
● If the variances are not equal, then the ANOVA results may be
biased and incorrect conclusions may be drawn.
● Therefore, it is important to check the variance assumptions before
applying ANOVA and to use appropriate methods to adjust for
unequal variances if necessary.
● Another consideration in using ANOVA in data science is the
potential for Type-I and Type-II errors.
● Type-I errors occur when a significant difference is detected
between groups, even though there is no true difference.
● Type-II errors occur when no significant difference is detected
between groups, even though there is a true difference.
● These errors can be controlled by setting appropriate significance
levels and sample sizes.

Page no - 36 Handcrafted by Engineers | P - Priority


`

● Overall, ANOVA is a powerful tool in data science for comparing


means between multiple groups and can provide valuable insights
for decision-making.
● However, it is important to carefully consider the assumptions and
potential errors associated with ANOVA in order to obtain accurate
and reliable results.

Page no - 37 Handcrafted by Engineers | P - Priority


`

MODULE-3

Q1. Write a short note on Methodology. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Methodology refers to the set of principles, techniques, and
procedures that are used to conduct research, analyze data, and draw
conclusions. It is an important aspect of research, as it helps to ensure that
the research is conducted in a rigorous and systematic manner, and that
the results are valid, reliable, and generalizable.
In general, methodology includes several key components, such as:
1. Research design: This involves defining the research question or
problem, selecting the appropriate research design (such as
experimental, observational, or survey research), and designing the
study to ensure that it is feasible and ethical.
2. Sampling: This involves selecting a representative sample from the
population of interest, and ensuring that the sample is sufficiently
large and diverse to provide reliable and valid results.
3. Data collection: This involves selecting the appropriate data
collection methods (such as surveys, interviews, or observations),
collecting the data in a systematic and standardized manner, and
ensuring that the data is accurate and complete.
4. Data analysis: This involves analyzing the data using appropriate
statistical methods, such as descriptive statistics, inferential
statistics, or machine learning algorithms, and interpreting the
results in light of the research question or problem.
5. Conclusions: This involves drawing conclusions from the results, and
making recommendations for future research or practice.

Page no - 38 Handcrafted by Engineers | P - Priority


`

Overall, methodology is essential for ensuring that research is conducted in


a rigorous and systematic manner, and that the results are valid, reliable,
and generalizable. By following sound methodology principles, researchers
can increase the likelihood that their research will be useful and impactful.

Q2. Write Overview of model building. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Model building is the process of creating a mathematical or statistical
representation of a real-world system, process, or phenomenon.
● The model is constructed using available data and knowledge of
the system, and is used to make predictions or test hypotheses
about the system.
● In data science, model building is an important step in the process
of developing predictive models that can be used to make
informed decisions or gain insights into complex data.
● The process of model building typically involves several key steps,
including:
1. Data collection: This involves gathering relevant data from
various sources, such as databases, surveys, or sensors.
2. Data cleaning and preparation: This involves cleaning the
data to remove errors, missing values, or outliers, and
preparing the data for analysis by transforming or
normalizing it as necessary.
3. Feature engineering: This involves selecting and
transforming the variables or features that will be used to
build the model, such as selecting relevant variables,
creating new variables based on existing ones, or scaling
variables to ensure comparability.
4. Model selection: This involves selecting the appropriate
model or algorithm that will be used to build the predictive

Page no - 39 Handcrafted by Engineers | P - Priority


`

model, based on the nature of the data and the research


question or problem.
5. Model training and evaluation: This involves fitting the model
to the data using available algorithms and techniques, and
evaluating the performance of the model using various
metrics and techniques, such as cross-validation or
confusion matrices.
6. Model deployment and maintenance: This involves
deploying the model in a real-world setting and monitoring
its performance over time to ensure that it continues to
provide accurate and reliable predictions.
● Model building is a complex and iterative process that requires
careful attention to data quality, feature selection, and algorithm
selection, as well as ongoing monitoring and maintenance to
ensure that the model remains valid and reliable.
● By following sound model building practices, data scientists can
create predictive models that provide useful insights and inform
decision-making.

Q3. Write a short note on Cross Validation. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Cross-validation is a statistical technique used to evaluate the
performance of predictive models.
● The basic idea of cross-validation is to split the available data into
two or more subsets, one for training the model and the other for
testing the model's performance.
● This helps to ensure that the model is not overfitting to the training
data and provides an unbiased estimate of the model's
performance on new data.

Page no - 40 Handcrafted by Engineers | P - Priority


`

● The most common type of cross-validation is k-fold


cross-validation, where the data is split into k subsets of roughly
equal size.
● The model is then trained on k-1 of these subsets and tested on the
remaining subset.
● This process is repeated k times, with each subset used exactly
once for testing.
● The results of each test are then averaged to provide an overall
estimate of the model's performance.
● Cross-validation is a useful technique for evaluating the
performance of different predictive models and selecting the best
model for a given problem.
● It can also be used to tune the parameters of a model, such as the
regularization parameter in linear regression or the number of trees
in a random forest, by testing different values of the parameter on
different subsets of the data.
● Cross-validation is a powerful technique for evaluating the
performance of predictive models and ensuring that they
generalize well to new data.
● By using cross-validation, data scientists can make more informed
decisions about which models to use and how to tune their
parameters, resulting in more accurate and reliable predictions.

Q4. Explain in detail K-fold cross validation. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: K-fold cross-validation is a technique used to evaluate the
performance of a predictive model.
● It involves dividing the available data into k subsets of roughly
equal size, and then training and testing the model k times, each
time using a different subset as the test set.

Page no - 41 Handcrafted by Engineers | P - Priority


`

● The basic steps involved in k-fold cross-validation are as follows:


1. Split the data into k equally sized subsets.
2. Train the model on k-1 of the subsets.
3. Use the trained model to predict the outcomes of the test
set.
4. Calculate the performance metric (such as accuracy,
precision, recall, or F1 score) for the test set.
5. Repeat steps 2-4 k times, each time using a different subset
as the test set.
6. Average the performance metrics over the k folds to get an
overall estimate of the model's performance.
● The advantage of k-fold cross-validation is that it allows for a more
accurate and reliable estimate of the model's performance than
simply splitting the data into a training set and a test set.
● By repeating the process k times and averaging the results, we can
reduce the variance of the performance estimate and obtain a
more accurate assessment of the model's ability to generalize to
new data.
● The choice of k depends on the size of the dataset and the
complexity of the model.
● In general, larger values of k are more computationally expensive
but provide more accurate estimates of the model's performance,
while smaller values of k are less computationally expensive but
may be more prone to overfitting.
● K-fold cross-validation is widely used in data science and machine
learning, as it provides a robust and reliable way to evaluate the
performance of predictive models and select the best model for a
given problem.

Page no - 42 Handcrafted by Engineers | P - Priority


`

Q5. Explain in detail leave-1 out. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: Leave-one-out cross-validation (LOOCV) is a variation of k-fold
cross-validation where k is equal to the number of samples in the dataset.
● In other words, LOOCV involves training the model on all but one of
the samples in the dataset, and using the remaining sample as the
test set.
● The basic steps involved in LOOCV are as follows:
1. Remove one sample from the dataset and use the
remaining samples to train the model.
2. Use the trained model to predict the outcome of the
removed sample.
3. Calculate the performance metric (such as accuracy,
precision, recall, or F1 score) for the removed sample.
4. Repeat steps 1-3 for each sample in the dataset.
5. Average the performance metrics over all samples to get an
overall estimate of the model's performance.
● The advantage of LOOCV is that it provides the most accurate
estimate of the model's performance possible, as each sample in
the dataset is used as the test set exactly once.
● However, LOOCV can be computationally expensive, especially for
large datasets, as it requires training the model on almost all of the
samples multiple times.
● LOOCV is useful for evaluating the performance of models that have
a small number of samples or that are prone to overfitting.
● By using LOOCV, data scientists can obtain a more accurate
estimate of the model's ability to generalize to new data and select
the best model for a given problem.
● However, LOOCV should be used judiciously, as it can be
computationally expensive and may not be necessary for all
datasets and models.

Page no - 43 Handcrafted by Engineers | P - Priority


`

Q6. Write a short note on Bootstrapping. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Bootstrapping is a resampling technique used in statistics and
machine learning to estimate the uncertainty of a statistical estimator or to
generate new samples from an existing dataset.
● It involves repeatedly sampling from the original dataset with
replacement, creating new datasets of the same size as the original
dataset.
● The basic steps involved in bootstrapping are as follows:
1. Take a sample of size n from the original dataset.
2. Randomly sample from the n observations with
replacement, creating a new dataset of size n.
3. Repeat steps 1-2 a large number of times (typically 1,000 or
more).
4. Calculate the statistics of interest (such as mean, median,
variance, or confidence intervals) for each of the
bootstrapped datasets.
5. Compute the mean and standard error of the statistics of
interest across all the bootstrapped datasets.
● Bootstrapping can be used to estimate the sampling distribution of
a statistic, such as the mean or standard deviation, when the
population distribution is unknown or when the sample size is too
small to assume normality.
● It can also be used to generate new datasets that are similar to the
original dataset, which can be useful for testing different machine
learning models or for data augmentation.
● The advantage of bootstrapping is that it can provide a more
accurate estimate of the variability of a statistic or the distribution
of the data than traditional methods, especially when the
underlying population distribution is unknown or non-normal.

Page no - 44 Handcrafted by Engineers | P - Priority


`

● It can also be applied to a wide range of statistical problems and is


relatively easy to implement. However, bootstrapping can be
computationally intensive and may not be necessary for datasets
with large sample sizes.

Q7. Discuss in detail Univariate Visualization. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Univariate visualization is a technique used in data analysis and
visualization to explore and understand a single variable in a dataset.
● It involves plotting and summarizing the distribution of a single
variable, such as a numerical or categorical variable, without taking
into account the relationship between multiple variables.
● There are several techniques for univariate visualization, including:
1. Histogram: A histogram is a graphical representation of the
distribution of a numerical variable. It displays the frequency
or count of observations in each bin or interval of the
variable.
2. Boxplot: A boxplot is a graphical representation of the
distribution of a numerical variable that displays the
median, quartiles, and outliers of the data.
3. Bar chart: A bar chart is a graphical representation of the
distribution of a categorical variable. It displays the
frequency or count of observations in each category or
group of the variable.
4. Pie chart: A pie chart is a graphical representation of the
distribution of a categorical variable that displays the
proportion or percentage of observations in each category
or group of the variable.

Page no - 45 Handcrafted by Engineers | P - Priority


`

5. Density plot: A density plot is a graphical representation of


the distribution of a numerical variable that displays the
density of observations across the range of the variable.
6. Frequency polygon: A frequency polygon is a graphical
representation of the distribution of a numerical variable
that displays the frequency or count of observations as a
line connecting the midpoints of each bin or interval of the
variable.
● Univariate visualization can help data analysts and scientists
identify patterns, outliers, and anomalies in the distribution of a
variable.
● It can also provide insights into the skewness, kurtosis, and central
tendency of the data, and help determine the appropriate
statistical tests and models to use for further analysis.
● However, univariate visualization should be complemented with
bivariate and multivariate visualization techniques to explore the
relationships between variables and to gain a more comprehensive
understanding of the data.

Q8. Describe in detail Histogram, Quartile. (P4 - Appeared 1 Time) (5-10


Marks)
Ans:A histogram is a graphical representation of the distribution of a set of
numerical data.
● It is a type of bar chart where the x-axis represents the range of the
data, divided into intervals or bins, and the y-axis represents the
frequency of data points falling within each bin.
● The histogram is a useful tool for visualizing the shape, center, and
spread of the distribution of the data.
● It allows us to see how the data is distributed and how frequently
certain values occur.

Page no - 46 Handcrafted by Engineers | P - Priority


`

● For example, a histogram can show if the data is skewed to one


side, if it has a bell-shaped distribution, or if it has multiple peaks.
● To create a histogram, we first choose the number and size of the
bins. The number of bins should be large enough to capture the
shape of the data but not so large that individual bins have very few
data points.
● The size of the bins determines the width of the bars in the
histogram.
● After selecting the bins, we count the number of data points that fall
within each bin and plot these counts as the height of the bars. The
resulting histogram provides a visual representation of the
distribution of the data.
● The following are some typical histograms, with a caption below
each one explaining the distribution of the data, as well as the
characteristics of the mean, median, and mode.
1.

Figure 1, represents a bell-shaped distribution, which has a


single peak and tapers off to both the left and to the right of
the peak. The shape appears to be symmetric about the
center of the histogram. The single peak indicates that the
distribution is unimodal.

Page no - 47 Handcrafted by Engineers | P - Priority


`

2.

Figure 2, represents a distribution that is approximately


uniform and forms a rectangular, flat shape. The frequency
of each class is approximately the same.
3.

Figure 3, represents a left-skewed distribution, which has a


peak to the right of the distribution and data values that
taper off to the left. This distribution has a single peak and is
also unimodal. For a histogram that is skewed to the left, the
mean is located to the left on the distribution and is the
smallest value of the measures of central tendency.
● Quartiles are values that divide a dataset into four equal parts. They
are a measure of central tendency and variability in a dataset.

Page no - 48 Handcrafted by Engineers | P - Priority


`

● Quartiles are a type of percentile. A percentile is a value with a


certain percentage of the data falling below it. In general terms, k%
of the data falls below the kth percentile.
1. The first quartile (Q1, or the lowest quartile) is the 25th
percentile, meaning that 25% of the data falls below the first
quartile.
2. The second quartile (Q2, or the median) is the 50th
percentile, meaning that 50% of the data falls below the
second quartile.
3. The third quartile (Q3, or the upper quartile) is the 75th
percentile, meaning that 75% of the data falls below the third
quartile.
● By splitting the data at the 25th, 50th, and 75th percentiles, the
quartiles divide the data into four equal parts.
● In a sample or dataset, the quartiles divide the data into four
groups with equal numbers of observations.
● In a probability distribution, the quartiles divide the distribution’s
range into four intervals with equal probability.

● Quartiles are often used to describe the spread of data in a dataset,


and they can be used to identify outliers or extreme values.

Page no - 49 Handcrafted by Engineers | P - Priority


`

● They are also used in statistical calculations, such as calculating


the interquartile range (IQR), which is the difference between the
third and first quartiles.
● The IQR is a measure of the spread of the middle 50% of the data
and is often used to identify potential outliers in a dataset.

Q9. Write a short note on the Distribution Chart. (P4 - Appeared 1 Time)
(5-10 Marks)
Ans: A distribution chart is a graphical representation of the frequency
distribution of a dataset. It shows how the data is distributed across
different values or ranges. There are different types of distribution charts,
each of which is used for a specific purpose:
1. Histogram: A histogram is a type of distribution chart that shows the
frequency distribution of continuous data. It is used to visualize the
shape, center, and spread of the distribution of the data.
2. Box plot: A box plot is a type of distribution chart that shows the
distribution of continuous data using quartiles. It is used to identify
outliers, skewness, and the spread of the data.
3. Stem and leaf plot: A stem and leaf plot is a type of distribution
chart that shows the distribution of discrete data. It is used to
visualize the frequency distribution of the data.
4. Probability density function: A probability density function is a type
of distribution chart that shows the probability of a continuous
random variable falling within a certain range. It is used to model
and analyze continuous data.
5. Bar chart: A bar chart is a type of distribution chart that shows the
frequency distribution of categorical data. It is used to visualize the
distribution of data across different categories.
Distribution charts are an important tool in data analysis and are used to
gain insights into the distribution and characteristics of a dataset. They are

Page no - 50 Handcrafted by Engineers | P - Priority


`

useful for identifying patterns, trends, and outliers in the data, and can be
used to inform decision-making in various fields, including finance,
healthcare, and marketing.

Q10. Discuss in detail Scatter Plot. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: A scatter plot is a type of data visualization that displays the
relationship between two variables in a dataset.
● It is a graph with points that represent individual data points and
show their positions along two axes.
● Here is an example of a scatter plot:

● In this scatter plot, the x-axis represents one variable, and the y-axis
represents the other variable. Each point represents a single data
point and its value for both variables. The pattern of the points on
the plot provides insight into the relationship between the two
variables.

Page no - 51 Handcrafted by Engineers | P - Priority


`

● Scatter plots can be used to analyze relationships between


variables in many fields, including finance, science, and social
science.
● For example, a scatter plot can show the relationship between a
stock's price and its earnings per share, or the relationship between
a person's height and their weight.
● Scatter plots can reveal a number of important characteristics
about the data.
1. The first is the strength of the relationship between the two
variables.
2. Another important characteristic of a scatter plot is the
direction of the relationship between the two variables.
● Scatter plots can also be used to identify outliers, which are data
points that fall far outside the normal range of values.
● Outliers can have a significant impact on the relationship between
the two variables, and can often be identified as points that are far
away from the main cluster of data points.

Q11. Write a short note on Scatter Matrix. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: A scatter matrix, also known as a pair plot or pair scatter plot.
● It is a type of data visualization that allows for the exploration of the
relationships between multiple variables in a dataset.
● It consists of a matrix of scatter plots, with each scatter plot
showing the relationship between two variables.
● Here's an example of a scatter matrix:

Page no - 52 Handcrafted by Engineers | P - Priority


`

● In this scatter matrix, each diagonal plot shows a histogram of a


single variable, while the off-diagonal plots show the scatter plots
of two variables.
● The patterns in the scatter plots can provide insight into the
relationships between the variables.
● Scatter matrices are a useful tool for exploring multivariate
datasets and identifying patterns or relationships between
variables.
● They can help identify variables that are highly correlated or
variables that are not related at all. Scatter matrices can also help
detect outliers and provide a quick overview of the distribution of
each variable in the dataset.
● Scatter matrices can be created using various software tools, such
as Python's Seaborn library or R's GGally package.

Page no - 53 Handcrafted by Engineers | P - Priority


`

● It is important to choose an appropriate size for the scatter matrix


to ensure that all the plots are visible and the patterns are easy to
interpret.
● Scatter matrices are a powerful tool for exploring the relationships
between multiple variables in a dataset. By using scatter matrices
to visualize and analyze data, researchers and analysts can gain
deeper insights and make more informed decisions.

Q12. Write a short note on the Bubble chart. (P4 - Appeared 1 Time)
(5-10 Marks)
Ans: A bubble chart is a type of data visualization that is used to display
three dimensions of data in a two-dimensional space.
● In a bubble chart, data points are represented by bubbles of
varying sizes and colors. The x and y axes represent two variables,
and the size and color of the bubbles represent a third variable.
● Bubble charts are useful for displaying data with multiple
dimensions, as they allow for the visualization of relationships
between three variables.
● They are often used in business and economics to show market
trends, and in social science to visualize relationships between
multiple variables.
● When creating a bubble chart, it is important to choose an
appropriate size and color scheme to ensure that the bubbles are
easy to read and interpret.
● It is also important to label the axes and provide a clear legend to
explain the meaning of the bubble sizes and colors.
● In conclusion, bubble charts are a powerful tool for displaying data
with multiple dimensions. By using bubble charts to visualize and
analyze data, researchers and analysts can gain deeper insights
and make more informed decisions.

Page no - 54 Handcrafted by Engineers | P - Priority


`

Q13. Discuss in detail Density Chart. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: A density chart is a useful data visualization tool that displays the
distribution of a variable in a dataset.
● It provides a continuous estimate of the probability density function
of the data by smoothing out the frequency distribution of a
histogram.
● The area under the curve of a density chart represents the
probability of the variable being in a certain range of values.
● Density charts are particularly useful for displaying continuous
variables, such as age or weight, where there are many potential
values in the data.
● They can help identify patterns in the data, such as whether the
variable is normally distributed or skewed to one side. By comparing
the density charts of different variables, analysts can also identify
relationships between variables, such as correlation or causation.
● There are several ways to create a density chart, including using
statistical software packages such as R or Python.
● One common method for creating a density chart is to use kernel
density estimation (KDE), which is a non-parametric way of
estimating the probability density function of a variable.
● In KDE, a kernel function is used to smooth the frequency distribution
of the variable, resulting in a continuous curve that approximates
the probability density function.
● When creating a density chart, it is important to choose an
appropriate bandwidth for the kernel function.
● A higher bandwidth will result in a smoother curve, but it may also
hide important features in the data, while a lower bandwidth may
result in a more jagged curve that is difficult to interpret.

Page no - 55 Handcrafted by Engineers | P - Priority


`

● It is also important to label the axes and provide a clear legend to


explain the meaning of the density values.
● Density charts are a powerful tool for displaying the distribution of a
variable in a dataset. By using density charts to visualize and
analyze data, researchers and analysts can gain deeper insights
and make more informed decisions.

Q14. Explain in detail Roadmap for Data Exploration. (P4 - Appeared 1


Time) (5-10 Marks)
Ans: Data exploration is a crucial step in the data analysis process, as it
allows analysts to understand the data and identify patterns or
relationships between variables. A roadmap for data exploration can help
guide analysts through this process and ensure that all important aspects
of the data are considered.
Here is a roadmap for data exploration:
1. Define the problem: Start by defining the research question or
problem that the data will be used to address. This will help guide
the exploration process and ensure that the analysis is focused on
relevant variables.
2. Collect the data: Gather all relevant data, whether it be from
surveys, experiments, or other sources. Ensure that the data is clean
and organized, and that missing values are appropriately handled.
3. Describe the data: Begin by describing the data using basic
statistical measures such as mean, median, mode, and standard
deviation. This will provide an overview of the data and help identify
any outliers or unusual values.
4. Visualize the data: Use data visualization techniques such as
histograms, box plots, and scatter plots to explore the relationships
between variables and identify any patterns or trends.

Page no - 56 Handcrafted by Engineers | P - Priority


`

5. Check for correlations: Analyze the correlation matrix to identify


strong correlations between variables. This can help identify
potential predictors or explanatory variables for further analysis.
6. Identify outliers: Use outlier detection techniques to identify any
data points that are significantly different from the rest of the data.
This can help identify potential errors or unusual events that may
need to be further investigated.
7. Test hypotheses: Use statistical tests to test hypotheses and explore
the relationships between variables in more depth. This can include
regression analysis, ANOVA, or other techniques depending on the
research question.
8. Communicate results: Finally, communicate the results of the data
exploration process to relevant stakeholders. This can include
visualizations, tables, and reports that summarize the key findings
and insights from the data.
By following this roadmap for data exploration, analysts can gain a deeper
understanding of the data and make more informed decisions. It is
important to note that this process is iterative and may require multiple
rounds of exploration and analysis to fully understand the data and
address the research question.

Q15. Explain Visualizing high dimensional data: Parallel chart. (P4 -


Appeared 1 Time) (5-10 Marks)
Ans: Visualizing high-dimensional data can be challenging, as it can be
difficult to represent all of the dimensions in a way that is easy to
understand.
● One technique for visualizing high-dimensional data is the parallel
coordinates chart, also known as a parallel plot or parallel chart.

Page no - 57 Handcrafted by Engineers | P - Priority


`

● A parallel coordinates chart displays multivariate data by


representing each dimension as a separate axis, all parallel to each
other.
● Each data point is represented as a line that intersects with each
axis at the value of that dimension for that particular data point.
● By displaying all of the dimensions in parallel, the parallel
coordinates chart can provide a comprehensive view of the
relationships between different variables.
● Here are some key features and considerations of parallel
coordinates charts:
1. Axes and scaling: Each axis represents a single dimension or
variable, and it is important to scale the axes appropriately
to ensure that all data points are visible. Nonlinear scaling
may be required for data that is highly skewed.
2. Interactivity: Parallel coordinates charts can be made
interactive by allowing users to highlight or filter data points
based on specific values or ranges.
3. Overplotting: When multiple data points overlap each other
in a parallel coordinates chart, it can be difficult to
distinguish between them. Techniques such as
transparency, jittering, and bundling can be used to alleviate
this issue.
4. Clustering: Parallel coordinates charts can be used to
identify clusters or patterns in the data, as groups of data
points that are close together along multiple axes may
represent a distinct subgroup.
● Parallel coordinates charts can be useful for a variety of
applications, such as exploratory data analysis, cluster analysis,
and classification.

Page no - 58 Handcrafted by Engineers | P - Priority


`

● They can help identify patterns and relationships between variables


that may be difficult to see with other visualization techniques.
● However, they can also be complex and difficult to interpret,
especially for large datasets with many dimensions.

Q16. Discuss in detail Deviation chart. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: A deviation chart, also known as a diverging stacked bar chart or a
butterfly chart, is a visualization technique used to compare two groups of
data with a common metric.
● It displays the difference between two sets of data using stacked
bars that deviate from a central axis.
● The deviation chart is particularly useful when comparing positive
and negative values, as it allows for a clear comparison of the
differences between the two groups.
● The central axis represents the point of balance between the two
groups, with the positive values on one side and the negative values
on the other.
● Here are some key features and considerations of deviation charts:
1. Layout and design: Deviation charts can be designed in
various ways, with either horizontal or vertical bars. The bars
can be colored or shaded to indicate positive and negative
values.
2. Labels and annotations: It is important to label the bars and
provide clear annotations to explain the meaning of the
chart. This can include axis labels, legend, and annotations
to indicate the source of the data and any important
insights.
3. Data preparation: To create a deviation chart, the data must
be prepared by calculating the differences between the two

Page no - 59 Handcrafted by Engineers | P - Priority


`

groups and stacking the positive and negative values on


either side of the central axis.
4. Interpretation: Deviation charts can help identify the
magnitude and direction of differences between two groups.
They can be used to compare a variety of metrics, such as
revenue, profit, or performance indicators.
5. Limitations: Deviation charts can be difficult to read if there
are many categories or if the differences between the
groups are small. They may also be less effective for
comparing more than two groups of data.
● Overall, deviation charts can be a useful tool for comparing two
groups of data with a common metric, especially when there are
significant differences in positive and negative values. They can
help highlight patterns and trends in the data, and provide a clear
visual representation of the differences between the two groups.

Q17. Write a short note on Andrews Curves. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Andrews curves is a technique used for visualizing high-dimensional
data by representing each observation as a smooth curve.
● It was introduced by F.J. Andrews in 1972, and has been used for
data exploration, data classification, and data clustering.
● The Andrews curve plot is created by mapping each data point to a
Fourier series of sine and cosine functions, where each coefficient
represents a feature or variable in the data.
● The curve is then generated by summing up the Fourier series. Each
curve represents a single observation, and the shape of the curve
reflects the values of the features in the data.
● The Andrews curve plot is useful for visualizing high-dimensional
data and identifying patterns and relationships between variables.

Page no - 60 Handcrafted by Engineers | P - Priority


`

● By comparing the shapes of the curves, it is possible to determine


which features have the most influence on the data.
● The curves can also be colored or labeled to represent different
groups or classes of observations.
● One limitation of Andrews curves is that they may not be suitable
for very large datasets, as the computation of Fourier coefficients
can be computationally expensive.
● They also may not be as effective for datasets with complex
nonlinear relationships between features.
● Overall, Andrews curves can be a useful tool for visualizing
high-dimensional data and exploring the relationships between
features.
● They provide a way to represent complex data in a simple and
intuitive way, and can help to uncover patterns and insights that
may not be apparent with other visualization techniques.

Page no - 61 Handcrafted by Engineers | P - Priority


`

MODULE-4

Q1. Write a short note on Outliers. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: In statistics, outliers are data points that are significantly different from
other observations in a dataset.
● Outliers can have a significant impact on statistical analysis, as
they can affect the mean, standard deviation, and other measures
of central tendency and dispersion.
● Outliers can occur due to various reasons, such as measurement
errors, natural variation, extreme events, and data manipulation.
● Outliers can be detected using various techniques, such as box
plots, scatter plots, and statistical tests like Z-score, Tukey's method,
and Grubbs' test.
● However, identifying an outlier does not necessarily mean it should
be removed from the dataset.
● Outliers can sometimes provide valuable insights into the data and
need to be analyzed further.
● It is essential to understand the cause of the outliers and their
impact on the data before deciding whether to remove them or not.
● To summarize this we can say, outliers are data points that are
significantly different from other observations in a dataset, and they
can occur due to various reasons. Identifying and analyzing outliers
is essential for effective statistical analysis, but it is equally
important to understand their cause and impact before deciding
whether to remove them or not.

Page no - 62 Handcrafted by Engineers | P - Priority


`

Q2. Discuss in detail Causes of Outliers. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: In statistics, an outlier is an observation that is significantly different
from other observations in a dataset.
● Outliers can occur due to a variety of reasons, and understanding
the causes of outliers is crucial to effectively analyze and interpret
data.
● Here are some of the common causes of outliers:
1. Measurement error: Outliers can occur due to errors in the
measurement process. For example, a device used to
measure temperature may malfunction and produce an
inaccurate reading that is significantly different from other
readings in the dataset.
2. Data entry errors: Human error during data entry can result
in outliers. For instance, a data entry operator may
accidentally enter a wrong value, leading to an outlier.
3. Sampling errors: Outliers can occur due to sampling errors,
where the sample data is not representative of the
population. For instance, if a sample of a population is
skewed towards one end of the distribution, it may result in
outliers in the dataset.
4. Natural variation: In some cases, outliers can occur naturally
due to variation in the data. For example, in a dataset of
heights of adult humans, there may be a few individuals who
are significantly taller or shorter than the rest of the
population.
5. Extreme events: Outliers can occur due to extreme events
that are not representative of the typical behavior of the
system being observed. For example, a stock market crash

Page no - 63 Handcrafted by Engineers | P - Priority


`

or a natural disaster can cause outliers in financial or


weather datasets, respectively.
6. Data manipulation: Outliers can also be deliberately
introduced into a dataset through data manipulation. For
example, an individual may add an outlier to a dataset to
achieve a particular result or to influence a decision.
● It is essential to identify the cause of outliers in a dataset to
determine the appropriate course of action.
● In some cases, outliers may need to be removed from the dataset,
while in others, they may be valuable data points that need to be
analyzed further.
● Techniques such as data visualization, statistical tests, and outlier
detection algorithms can help identify and analyze outliers in a
dataset.

Q3. Anomaly detection techniques. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Anomaly detection techniques are used to identify and isolate
unusual data points or patterns that deviate from the norm or expected
behavior in a dataset.
● Here are some of the commonly used anomaly detection
techniques:
1. Statistical methods: Statistical techniques such as Z-score,
Grubbs' test, and the modified Thompson Tau test are used
to identify outliers or anomalies in a dataset based on their
deviation from the mean or other statistical measures.
2. Machine learning algorithms: Machine learning algorithms,
such as clustering, classification, and regression algorithms,
can be used to identify anomalies in a dataset. These
algorithms are trained on normal data patterns and can

Page no - 64 Handcrafted by Engineers | P - Priority


`

detect any deviations from the expected patterns as


anomalies.
3. Time-series analysis: Time-series analysis techniques can
be used to identify anomalies in time-series data by
identifying any sudden or unexpected changes in the data
patterns.
4. Pattern recognition: Pattern recognition techniques, such as
neural networks and decision trees, can be used to identify
anomalies in a dataset based on their deviation from the
expected patterns.
5. Visualization techniques: Visualization techniques, such as
scatter plots, histograms, and heat maps, can be used to
identify anomalies visually by identifying any unusual
patterns or outliers in the data.
6. Rule-based methods: Rule-based methods involve setting
predefined rules or thresholds to identify anomalies based
on specific criteria or domain knowledge.
● Overview of this could be concluded as anomaly detection
techniques are used to identify unusual data points or patterns in a
dataset.
● These techniques include statistical methods, machine learning
algorithms, time-series analysis, pattern recognition, visualization
techniques, and rule-based methods.
● The choice of the appropriate technique depends on the type and
nature of the data and the specific anomaly detection
requirements.

Page no - 65 Handcrafted by Engineers | P - Priority


`

Q4. Discuss Outlier Detection using Statistics. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Outlier detection is the process of identifying data points that deviate
significantly from the expected behavior of a system or dataset.
● Outliers can affect statistical analysis, as they can skew the mean,
standard deviation, and other measures of central tendency and
dispersion.
● Statistical methods are widely used for outlier detection, as they
provide a systematic approach to identifying anomalies based on
statistical significance.
● Here are some of the commonly used statistical methods for outlier
detection:
1. Z-score: The Z-score is a measure of how many standard
deviations a data point is from the mean. A data point is
considered an outlier if its Z-score is greater than a
predefined threshold, typically 2 or 3.
2. Percentile rank: The percentile rank is the percentage of data
points that fall below a particular value. An observation is
considered an outlier if its percentile rank is below a
predefined threshold, typically 1 or 5.
3. Tukey's method: Tukey's method, also known as the
box-and-whisker plot, is a graphical method for outlier
detection. The method uses the interquartile range (IQR) to
define the range of typical values and identifies outliers as
data points outside the range of 1.5 times the IQR.
4. Grubbs' test: Grubbs' test is a statistical method that tests
whether a data point is significantly different from other
observations in the dataset. The test computes a test
statistic, which is compared to a critical value to determine
whether the data point is an outlier.

Page no - 66 Handcrafted by Engineers | P - Priority


`

5. Mahalanobis distance: Mahalanobis distance is a measure


of the distance between a data point and the mean of the
dataset, adjusted for the correlation between the variables.
An observation is considered an outlier if its Mahalanobis
distance is greater than a predefined threshold.
● These statistical methods can be used individually or in
combination to identify outliers in a dataset.
● The choice of method depends on the characteristics of the data
and the application requirements.
● It is important to note that identifying an outlier does not
necessarily mean it should be removed from the dataset.
● Outliers can sometimes provide valuable insights into the data and
need to be analyzed further.

Q5. Explain Outlier Detection using Distance based method. (P4 -


Appeared 1 Time) (5-10 Marks)
Ans:Distance-based methods are a common approach for outlier
detection in data analysis.
● These methods identify outliers based on the distance between
data points, using various distance metrics such as Euclidean
distance, Mahalanobis distance, and cosine similarity.
● Below are some commonly used distance-based methods for
outlier detection:
1. K-nearest neighbor (KNN): In KNN, the distance between a
data point and its k-nearest neighbors is computed, and the
point is considered an outlier if its distance to the kth
neighbor exceeds a predefined threshold.
2. Local Outlier Factor (LOF): LOF measures the local density of
a data point relative to its neighbors. An observation is

Page no - 67 Handcrafted by Engineers | P - Priority


`

considered an outlier if its LOF score is significantly lower


than the average LOF score of its neighbors.
3. Distance to cluster center: In clustering-based outlier
detection, the distance between a data point and the center
of the cluster is computed. If a data point is significantly far
from the center of the cluster, it is considered an outlier.
4. DBSCAN: Density-Based Spatial Clustering of Applications
with Noise (DBSCAN) is a clustering algorithm that identifies
core points, border points, and noise points in a dataset.
Data points that are classified as noise points are
considered outliers.
5. One-class SVM: One-Class Support Vector Machines (SVM)
is a machine learning technique that learns a boundary
around the normal data points and identifies outliers as
data points outside the boundary.
● These distance-based methods can be used alone or in
combination to identify outliers in a dataset.
● The choice of method depends on the characteristics of the data
and the application requirements.
● It is important to note that distance-based methods have
limitations, such as sensitivity to the choice of distance metric,
scalability, and parameter tuning.
● Therefore, careful consideration is required when selecting and
applying these methods for outlier detection.

Q6. Describe Outlier detection using density-based methods. (P4 -


Appeared 1 Time) (5-10 Marks)
Ans: Density-based methods are a popular approach for outlier detection
in data analysis.

Page no - 68 Handcrafted by Engineers | P - Priority


`

● These methods identify outliers based on the density of data points


in a region of the feature space. In these methods, an outlier is
defined as a data point that is located in a low-density region, while
normal data points are located in high-density regions.
● Some commonly used density-based methods for outlier detection
are:
1. Local Outlier Factor (LOF): LOF measures the local density of
a data point relative to its neighbors. It computes the ratio of
the average density of the k-nearest neighbors of a data
point to its own density. An observation is considered an
outlier if its LOF score is significantly lower than the average
LOF score of its neighbors.
2. Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): DBSCAN is a clustering algorithm that identifies
core points, border points, and noise points in a dataset.
Core points are defined as data points with a minimum
number of neighbors within a specific distance. Border
points are neighbors of core points that are not core points
themselves, while noise points have no neighbors within the
specified distance. Data points that are classified as noise
points are considered outliers.
3. Gaussian Mixture Model (GMM): GMM is a probabilistic
clustering technique that models the data distribution as a
mixture of Gaussian distributions. Outliers are identified as
data points that have low probabilities of being generated
by the model.
4. Local Density-Based Outlier Factor (LDOF): LDOF is a
density-based method that measures the degree of
outlierness of a data point based on its local density and
distance to high-density regions. It computes a score that

Page no - 69 Handcrafted by Engineers | P - Priority


`

represents the deviation of a data point from the density


distribution of the dataset.
5. Kernel Density Estimation (KDE): KDE is a non-parametric
method that estimates the density of data points in a region
of the feature space. Outliers are identified as data points
with low probability density values.
● These density-based methods can be used alone or in combination
to identify outliers in a dataset.
● The choice of method depends on the characteristics of the data
and the application requirements.
● It is important to note that density-based methods have limitations,
such as sensitivity to the choice of parameters, scalability, and
robustness to high-dimensional data.
● Therefore, careful consideration is required when selecting and
applying these methods for outlier detection.

Q7. Write a short note on SMOT. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: SMOTE (Synthetic Minority Over-sampling Technique) is a technique
used in machine learning to address the problem of imbalanced datasets.
● In imbalanced datasets, the number of instances in the minority
class (e.g., fraud cases) is much smaller than the number of
instances in the majority class (e.g., non-fraud cases).
● This can lead to biased models that are unable to accurately
predict the minority class.
● SMOTE is a technique that generates synthetic samples of the
minority class to balance the dataset.
● The algorithm works by selecting a minority class instance and
computing its k nearest neighbors in the feature space.
● Synthetic instances are then created by interpolating between the
selected instance and its neighbors.

Page no - 70 Handcrafted by Engineers | P - Priority


`

● The SMOTE algorithm has several advantages over other


over-sampling techniques, such as random over-sampling,
including:
1. SMOTE generates synthetic instances that are more
representative of the minority class than random
over-sampling.
2. SMOTE does not create exact copies of existing instances,
reducing the risk of overfitting.
3. SMOTE can be combined with other techniques, such as
under-sampling, to further balance the dataset.
4. SMOTE can improve the performance of machine learning
models, especially in cases where the minority class is
under-represented.
● However, SMOTE also has some limitations, such as:
1. SMOTE can create noisy samples that do not accurately
represent the minority class.
2. SMOTE can increase the risk of overfitting if the synthetic
samples are too similar to the existing samples.
3. SMOTE may not be effective in cases where the minority
class is very small or the data is highly imbalanced.
● It is a useful technique for balancing imbalanced datasets and
improving the performance of machine learning models. However,
it should be used with caution and in combination with other
techniques to ensure the validity and generalizability of the results.

Page no - 71 Handcrafted by Engineers | P - Priority


`

MODULE-5

Q1. Explain Taxonomy of Time Series Forecasting methods. (P4 -


Appeared 1 Time) (5-10 Marks)
Ans: Time series forecasting methods can be classified into various
categories based on the underlying model, algorithm, and approach. Here
are some common taxonomy of time series forecasting methods:
1. Statistical Methods: These methods use statistical techniques to
model the time series and make predictions. Some of the popular
statistical methods include ARIMA (Autoregressive Integrated
Moving Average), SARIMA (Seasonal ARIMA), and exponential
smoothing.
2. Machine Learning Methods: These methods use machine learning
algorithms to learn the patterns in the time series and make
predictions. Some of the popular machine learning methods
include artificial neural networks (ANNs), support vector regression
(SVR), decision trees, and random forests.
3. Deep Learning Methods: These methods use deep neural networks
to learn the complex patterns in the time series and make
predictions. Some of the popular deep learning methods include
Long Short-Term Memory (LSTM), Convolutional Neural Networks
(CNNs), and Gated Recurrent Units (GRUs).
4. Hybrid Methods: These methods combine multiple models or
algorithms to improve the accuracy of the forecasts. For example, a
hybrid model may use a statistical method to model the trend and
seasonality of the time series and a machine learning method to
model the residual errors.
5. Expert Opinion Methods: These methods rely on the knowledge and
expertise of domain experts to make forecasts. For example, a sales

Page no - 72 Handcrafted by Engineers | P - Priority


`

manager may use their experience and knowledge of the market to


forecast sales for the next quarter.
6. Ensemble Methods: These methods combine the forecasts from
multiple models or algorithms to produce a more accurate and
robust forecast. For example, an ensemble model may combine the
forecasts from a statistical method, a machine learning method,
and an expert opinion method.
7. Probabilistic Methods: These methods provide a range of possible
outcomes with associated probabilities. For example, a probabilistic
method may provide a 95% confidence interval for the forecasted
value.

Q2. Discuss in detail Time Series Decomposition. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Time series decomposition is a statistical method that separates a
time series into its different components, namely trend, seasonality, and
residual, to better understand its underlying patterns and behavior.
● Decomposition is a commonly used technique in time series
analysis and forecasting, as it can help identify important patterns
and trends in the data, remove noise and outliers, and improve the
accuracy of forecasts.
● The decomposition of a time series involves the following steps:
1. Trend Component: The first step is to estimate the trend
component of the time series, which represents the
long-term behavior of the data over time. This can be done
using various statistical techniques, such as moving
averages, exponential smoothing, or regression analysis. The
trend component captures the overall direction and
magnitude of the time series and is usually the most
important component for forecasting.

Page no - 73 Handcrafted by Engineers | P - Priority


`

2. Seasonal Component: The second step is to estimate the


seasonal component of the time series, which represents the
periodic patterns or cycles in the data that repeat over fixed
time intervals. This can be done using various methods, such
as seasonal indices, Fourier analysis, or seasonal regression
models. The seasonal component captures the systematic
variation in the time series due to seasonal effects, such as
weather, holidays, or economic cycles.
3. Residual Component: The third step is to estimate the
residual component of the time series, which represents the
random or unpredictable variation in the data that is not
explained by the trend or seasonal components. This can be
done by subtracting the estimated trend and seasonal
components from the original time series. The residual
component captures the unexplained variation in the time
series and is usually the least important component for
forecasting.
● Once the time series has been decomposed into its components,
each component can be analyzed separately to better understand
its characteristics and behavior.
● For example, the trend component can be used to identify
long-term patterns and changes in the data, while the seasonal
component can be used to identify seasonal effects and patterns.
● The residual component can be used to identify outliers and
random fluctuations in the data.
● The decomposition of a time series can also be used for forecasting
by extrapolating the trend and seasonal components into the future
and adding them together to obtain a forecast.
● The residual component can be used to estimate the uncertainty or
variability in the forecast.

Page no - 74 Handcrafted by Engineers | P - Priority


`

Q3. Write a short note on the Average method. (P4 - Appeared 1 Time)
(5-10 Marks)
Ans:The average method is a simple and widely used forecasting
technique that involves calculating the average of past observations and
using it as a forecast for future periods.
● This method is based on the assumption that future values will be
similar to past values, and that the average provides a reasonable
estimate of the future trend.
● To use the average method, the first step is to collect historical data
for the time series. Then, the average of the past observations is
calculated and used as the forecast for the next period.
● This process is repeated for each future period, with the forecast for
each period equal to the average of the past observations.
● The average method is easy to use and does not require any
complex mathematical calculations or statistical analysis.
● It is often used as a benchmark or baseline method for comparing
the performance of more advanced forecasting techniques.
● However, the average method has several limitations, including:
1. It does not capture any trend or seasonal patterns in the
data, and assumes that future values will be the same as
past values.
2. It is sensitive to outliers and extreme values in the data,
which can affect the accuracy of the forecast.
3. It does not take into account any external factors or events
that may affect the time series, such as changes in the
economy or market conditions.
● Despite these limitations, the average method can be useful for
short-term forecasting of stable and relatively predictable time
series.

Page no - 75 Handcrafted by Engineers | P - Priority


`

● It is also a useful tool for generating quick and simple forecasts,


especially when more advanced methods are not available or
necessary.
● However, for more complex time series and longer-term forecasts,
other forecasting methods may be more appropriate.

Q4. Explain in detail Moving Average smoothing. (P4 - Appeared 1


Time) (5-10 Marks)
Ans: Moving Average (MA) smoothing is a widely used statistical method
for analyzing time-series data.
● It is a technique for identifying trends and patterns in a time series
by calculating an average of the values in a sliding window over the
series.
● In simple terms, MA smoothing involves calculating the average of
a fixed number of previous data points in a time series.
● This creates a smoothed version of the series, which is useful for
identifying trends and patterns that may be difficult to see in the
raw data.
● The basic idea behind MA smoothing is to calculate the moving
average of a fixed window size (usually denoted as k) for each point
in the series.
● The window size determines how many data points are included in
the average calculation.
● For example, if the window size is 3, the moving average at each
point is calculated by taking the average of the current point and
the two previous points.
● To calculate the moving average for each point in the series, we
start with the first k data points and calculate the average. We then
move the window one data point at a time and recalculate the
average for each new window.

Page no - 76 Handcrafted by Engineers | P - Priority


`

● The formula for calculating the moving average for each point in
the series is:
MA(t) = (Y(t) + Y(t-1) + ... + Y(t-k+1)) / k
where Y(t) is the value at time t, k is the window size, and MA(t) is
the moving average at time t.
● MA smoothing has several advantages. It can help to reduce noise
in the data, making it easier to identify trends and patterns. It is also
a simple and intuitive method that requires only basic
mathematical skills.
● However, MA smoothing also has some limitations. One of the main
limitations is that it is sensitive to the choice of window size.
● A small window size may not smooth the data enough to reveal
meaningful patterns, while a large window size may smooth the
data too much and obscure important features.
● Another limitation is that it is a backward-looking method and may
not be suitable for predicting future values beyond the range of the
data used to calculate the moving averages.

Q5. Explain Time series analysis using linear regression. (P4 - Appeared
1 Time) (5-10 Marks)
Ans: Time series analysis is a statistical method used to analyze data that
is collected over a period of time.
● In time series analysis, the data is often plotted over time to identify
patterns, trends, and other useful information.
● Linear regression is a statistical method that can be used to analyze
time series data. Linear regression is a technique used to model the
relationship between a dependent variable and one or more
independent variables.
● In time series analysis, the dependent variable is usually the
variable of interest that changes over time, and the independent

Page no - 77 Handcrafted by Engineers | P - Priority


`

variables are time-related variables such as time itself or other


factors that may affect the dependent variable.
● The basic idea behind linear regression in time series analysis is to
use a linear equation to model the relationship between the
dependent variable and the independent variables.
● The linear equation takes the form of:
Y = a + bX + e
where Y is the dependent variable, X is the independent variable, a
is the intercept, b is the slope, and e is the error term.
● The slope and intercept can be estimated using least-squares
regression, which minimizes the sum of the squared differences
between the predicted values and the actual values.
● Once the slope and intercept are estimated, they can be used to
make predictions about future values of the dependent variable
based on the independent variables.
● This is useful in time series analysis because it allows us to identify
trends and patterns in the data and make predictions about future
behavior.
● Linear regression can also be used to test hypotheses about the
relationship between the dependent variable and the independent
variables.
● For example, we may want to test whether a particular independent
variable has a significant effect on the dependent variable, or
whether there is a trend in the data over time.

Q6. Write a short note on ARIMA Model. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: ARIMA (Autoregressive Integrated Moving Average) model is a popular
statistical method used for time series analysis and forecasting.

Page no - 78 Handcrafted by Engineers | P - Priority


`

● It is a combination of three methods: autoregression, differencing,


and moving average.
● The ARIMA model is a generalization of the simpler ARMA
(Autoregressive Moving Average) model, which assumes that the
time series is stationary (i.e., the statistical properties of the series
do not change over time).
● However, many time series in real-world applications are
non-stationary, meaning that the statistical properties change over
time.
● ARIMA models can handle non-stationary time series by
incorporating differencing, which removes the trend or seasonality
component from the series. The ARIMA model also includes
autoregression and moving average components, which capture
the autocorrelation and noise components of the series,
respectively.
● ARIMA models are specified by three parameters:
p, d, and q.
● The parameter p represents the autoregression order, which is the
number of lagged values of the dependent variable used to predict
the current value.
● The parameter q represents the moving average order, which is the
number of lagged errors used to predict the current value.
● The parameter d represents the differencing order, which is the
number of times the series is differenced to make it stationary.
● ARIMA models can be used for both time series analysis and
forecasting.
● For time series analysis, ARIMA models can be used to identify the
underlying patterns and trends in the data, and to test for the
presence of seasonality or other cyclical components.

Page no - 79 Handcrafted by Engineers | P - Priority


`

● For forecasting, ARIMA models can be used to predict future values


of the time series based on its past behavior.
● Here's a flowchart for building an ARIMA model:
1. Examine the time series data to determine if it is stationary. If
it is not stationary, apply differencing to make it stationary.
2. Determine the order of differencing (d) required to make the
series stationary. This can be done by examining the
autocorrelation function (ACF) and partial autocorrelation
function (PACF) plots of the differenced series.
3. Determine the order of the autoregressive (p) and moving
average (q) terms required for the model. This can also be
done by examining the ACF and PACF plots of the
differenced series.
4. Fit the ARIMA model to the time series data using the chosen
values of p, d, and q. This can be done using a variety of
statistical software packages.
5. Check the residuals of the ARIMA model for autocorrelation,
non-normality, and heteroscedasticity. If any of these issues
are present, re-estimate the model or consider using a
different model.
6. Use the ARIMA model to make predictions for future time
periods.
7. Check the accuracy of the ARIMA model predictions using
metrics such as mean absolute error (MAE) or root mean
squared error (RMSE).
8. Refine the ARIMA model as necessary based on the
prediction accuracy and additional insights gained from the
analysis of the time series data.

Page no - 80 Handcrafted by Engineers | P - Priority


`

Q7. Discuss in detail Mean Absolute Error. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Mean Absolute Error (MAE) is a popular metric used to measure the
accuracy of a predictive model.
● It is particularly useful for evaluating models that predict
continuous variables, such as regression models.
● MAE measures the average absolute difference between the actual
and predicted values of a variable.
● It is calculated by taking the absolute value of the difference
between each predicted value and its corresponding actual value,
and then taking the average of these differences.
● The formula for calculating MAE is:
MAE = (1/n) * Σ| yi - ŷi |
where yi is the actual value of the variable, ŷi is the predicted value
of the variable, n is the total number of observations, and Σ is the
summation symbol.
● MAE is a useful metric because it is easy to understand and
interpret. It measures the average size of the errors made by the
model, with larger errors contributing more to the overall score than
smaller errors.
● MAE is expressed in the same units as the variable being predicted,
which makes it easy to compare the accuracy of different models.
● One limitation of MAE is that it treats all errors as equally important,
regardless of whether they are positive or negative.
● This means that a model that consistently underestimates the
variable of interest will have the same MAE as a model that
consistently overestimates the variable, even though these errors
may have different implications for the practical use of the model.
● Despite this limitation, MAE is a widely used metric for evaluating the
accuracy of predictive models.

Page no - 81 Handcrafted by Engineers | P - Priority


`

● It is often used in conjunction with other metrics, such as Root Mean


Squared Error (RMSE), to provide a more comprehensive evaluation
of model performance

Q8. Write a short note on Root Mean Square Error. (P4 - Appeared 1
Time) (5-10 Marks)
Ans: Root Mean Square Error (RMSE) is a commonly used metric for
evaluating the accuracy of a predictive model.
● It measures the average magnitude of the errors made by the
model, with larger errors contributing more to the overall score than
smaller errors. RMSE is particularly useful for evaluating models that
predict continuous variables, such as regression models.
● The formula for calculating RMSE is:
RMSE = sqrt(1/n * Σ(yi - ŷi)2)
where yi is the actual value of the variable,
ŷi is the predicted value of the variable,
n is the total number of observations,
Σ is the summation symbol, and sqrt is the square root function.
● RMSE is expressed in the same units as the variable being predicted,
which makes it easy to compare the accuracy of different models.
● One advantage of RMSE over Mean Absolute Error (MAE) is that it
gives more weight to larger errors, which may be more important to
consider in certain applications.
● However, one limitation of RMSE is that it is sensitive to outliers in the
data, which can inflate the value of the metric. Another limitation is
that it can be difficult to interpret in practical terms, since it does
not have a direct relationship to the performance of the model.
● Despite these limitations, RMSE is widely used as a metric for
evaluating the accuracy of predictive models, especially in cases

Page no - 82 Handcrafted by Engineers | P - Priority


`

where the variable being predicted has a wide range of possible


values.
● It is often used in conjunction with other metrics, such as MAE or
Mean Absolute Percentage Error (MAPE), to provide a more
comprehensive evaluation of model performance.

Q9. Mean Absolute Percentage Error. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Mean Absolute Percentage Error (MAPE) is a commonly used metric for
evaluating the accuracy of a predictive model.
● It measures the average percentage difference between the actual
and predicted values of a variable, making it particularly useful for
evaluating models that predict variables with varying scales or
magnitudes.
● The formula for calculating MAPE is:
𝑀𝐴𝑃𝐸 =
1
𝑛 (
× Σ || 𝑦𝑖 ||)
𝑦𝑖 − ŷ𝑖
× 100 %
where yi is the actual value of the variable, ŷi is the predicted value
of the variable, n is the total number of observations, Σ is the
summation symbol, and | | represents absolute value.
● MAPE is expressed as a percentage, which makes it easy to interpret
and compare across different variables and models.
● It measures the average size of the errors made by the model
relative to the actual values of the variable, with larger errors
contributing more to the overall score than smaller errors.
● Unlike RMSE, MAPE is not sensitive to the scale of the variable being
predicted, which can be an advantage in certain applications.
● One limitation of MAPE is that it is undefined when the actual value
of the variable is zero, which can occur in some applications.

Page no - 83 Handcrafted by Engineers | P - Priority


`

● In addition, MAPE can be affected by outliers in the data, which can


skew the overall score. Finally, MAPE can be less intuitive to interpret
than other metrics, such as RMSE or MAE.

Q10. Discuss in detail Mean Absolute Scaled Error. (P4 - Appeared 1


Time) (5-10 Marks)
Ans: Mean Absolute Scaled Error (MASE) is a commonly used metric for
evaluating the accuracy of a predictive model.
● It measures the average magnitude of the errors made by the
model relative to the errors made by a simple benchmark model,
making it particularly useful for evaluating models that make
predictions over time series data.
● The formula for calculating MASE is:
𝑀𝐴𝑆𝐸 =
1
𝑛 ( 𝑦𝑖 − ŷ𝑖
× Σ || 𝑀𝐴𝐸 ||)
yi is the actual value of the variable,
ŷi is the predicted value of the variable,
n is the total number of observations,
Σ is the summation symbol,
| | represents absolute value, and
MAE is the mean absolute error of a benchmark model that always
predicts the value of the variable as its most recent observation.
● MASE is a unitless measure, which makes it easy to interpret and
compare across different variables and models.
● It measures the average size of the errors made by the model
relative to the errors made by a simple benchmark model, with
values less than 1 indicating that the model is better than the
benchmark model and values greater than 1 indicating that the
model is worse than the benchmark model.

Page no - 84 Handcrafted by Engineers | P - Priority


`

● One advantage of MASE over other metrics, such as RMSE or MAPE, is


that it provides a standardized way to compare the accuracy of
different models over time series data, without being affected by
the scale of the variable being predicted or by outliers in the data.
● In addition, MASE is less sensitive to changes in the distribution of
the data over time, which can be an advantage in applications
where the underlying data is subject to external factors, such as
seasonality or trend.
● MASE can be more difficult to calculate than other metrics, since it
requires the calculation of a benchmark model for each time series,
inspite of having these advantages .
● In addition, MASE can be affected by the choice of benchmark
model, which can influence the overall score.

Q11. Explain Evaluation parameters for Classification. (P4 - Appeared 1


Time) (5-10 Marks)
Ans: Evaluation parameters for classification are metrics that are used to
assess the performance of a classification model.
● These metrics provide a quantitative measure of how well the
model is able to correctly classify instances into their respective
classes.
● Some of the commonly used evaluation parameters for
classification are:
1. Accuracy: It is the proportion of correct predictions made by
the model out of the total number of predictions.
It is calculated as follows:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑂𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

Page no - 85 Handcrafted by Engineers | P - Priority


`

2. Precision: It is the proportion of true positive predictions out


of the total number of positive predictions made by the
model.
It is calculated as follows:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

3. Recall (Sensitivity): It is the proportion of true positive


predictions out of the total number of actual positive
instances in the dataset.
It is calculated as follows:
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

4. F1 Score: It is the harmonic mean of precision and recall. It is


a measure that balances between precision and recall.
It is calculated as follows:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

5. Specificity: It is the proportion of true negative predictions


out of the total number of actual negative instances in the
dataset.
It is calculated as follows:
𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

6. Area under the receiver operating characteristic curve


(AUC-ROC): It is a measure of the classifier's ability to
distinguish between positive and negative classes.
It is calculated as the area under the ROC curve, which is a
plot of true positive rate (sensitivity) against the false
positive rate (1 - specificity) at various classification
thresholds.
● These evaluation parameters help in understanding the
performance of the classification model in terms of its ability to
accurately classify instances into their respective classes.

Page no - 86 Handcrafted by Engineers | P - Priority


`

● Based on the specific requirements of the classification task, one or


more of these parameters can be used to evaluate and compare
the performance of different classification models.

Q12. Write a short note on regression and clustering. (P4 - Appeared 1


Time) (5-10 Marks)
Ans: Regression and clustering are two different types of machine learning
algorithms that are used for different purposes.
Regression:
● It is a type of supervised learning algorithm that is used to model
the relationship between a dependent variable and one or more
independent variables.
● The goal of regression is to find a mathematical function that can
predict the value of the dependent variable based on the values of
the independent variables.
● Regression is used to solve problems such as predicting house
prices based on their features, estimating sales based on
advertising spend, or forecasting the stock market based on
historical data.
Clustering :
● Clustering, on the other hand, is a type of unsupervised learning
algorithm that is used to group similar data points together based
on their features.
● The goal of clustering is to identify clusters or groups of data points
that share similar characteristics, without any prior knowledge of
the groups.
● Clustering is used to solve problems such as customer
segmentation, fraud detection, or image segmentation.
● While regression and clustering are different types of algorithms,
they can be used together in certain scenarios.

Page no - 87 Handcrafted by Engineers | P - Priority


`

● For example, clustering can be used to identify groups of similar


data points, which can then be used as input features for a
regression model.
● This approach can be used to improve the accuracy of the
regression model by capturing non-linear relationships between
the input features and the dependent variable.
● Another use case for combining regression and clustering is to use
clustering to identify outliers in the data, which can then be
removed before fitting a regression model.
● Outliers can have a significant impact on the regression model,
leading to overfitting or underfitting.
● By removing the outliers identified through clustering, the accuracy
of the regression model can be improved.

Page no - 88 Handcrafted by Engineers | P - Priority


`

MODULE-6

Q1. Write a short note on Predictive Modeling. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Predictive modeling is a statistical and data mining technique used to
create a model that can predict future events or behaviors based on
historical data.
● It involves identifying patterns and relationships within data sets to
make predictions about future outcomes.
● The process of predictive modeling typically involves several steps,
including data collection, data cleaning and preprocessing, feature
selection, model training, model evaluation, and deployment.
● The goal is to build a model that accurately predicts future
outcomes based on the available data.
● Predictive modeling has many practical applications, including
fraud detection, marketing analytics, credit scoring, and risk
management.
● It is widely used in industries such as finance, insurance, healthcare,
and retail to help organizations make better decisions and improve
their performance.
● Predictive modeling is a powerful tool that enables businesses to
use data to make more informed decisions and gain a competitive
advantage.
● However, it requires a deep understanding of statistical and data
science concepts and techniques, as well as access to high-quality
data and advanced analytics tools.

Page no - 89 Handcrafted by Engineers | P - Priority


`

Q2. Describe in detail House price prediction. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: House price prediction is a popular application of predictive modeling
in the real estate industry. The goal is to build a model that can accurately
predict the price of a house based on various features or attributes such as
location, size, number of bedrooms, bathrooms, and other amenities.
Here are the steps involved in building a house price prediction model:
1. Data collection: The first step is to gather data on houses that have
been sold in the target area. This data can be obtained from real
estate websites, property databases, and local agencies.
2. Data preprocessing: Once the data has been collected, it needs to
be cleaned and processed. This involves removing duplicates, filling
in missing values, and converting categorical variables into
numerical values.
3. Feature selection: The next step is to select the relevant features
that will be used to build the model. This can be done using various
techniques such as correlation analysis, principal component
analysis, or domain expertise.
4. Model selection: There are several models that can be used for
house price prediction, including linear regression, decision trees,
and random forests. The choice of model depends on the
complexity of the problem and the quality of the data.
5. Model training: Once the model has been selected, it needs to be
trained using the available data. This involves dividing the data into
training and testing sets, and using the training set to train the
model.
6. Model evaluation: After the model has been trained, it needs to be
evaluated to determine its accuracy and performance. This can be
done by comparing the predicted prices with the actual prices of
houses in the testing set.

Page no - 90 Handcrafted by Engineers | P - Priority


`

7. Model deployment: Finally, the model can be deployed to predict


the prices of new houses based on their features. This can be done
using a web-based interface or a mobile app.
House price prediction is a complex task that requires a deep
understanding of statistical and machine learning concepts. However, with
the right data and tools, it can be a powerful tool for real estate
professionals and home buyers alike.

Q3. Explain in detail Fraud Detection. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Fraud detection is a critical application of predictive modeling in
industries such as finance, insurance, and e-commerce. The goal is to
identify and prevent fraudulent activities such as credit card fraud,
insurance fraud, and identity theft.
Here are the steps involved in building a fraud detection model:
1. Data collection: The first step is to gather data on past transactions,
including information such as transaction amount, date and time,
location, and user information.
2. Data preprocessing: Once the data has been collected, it needs to
be cleaned and processed. This involves removing duplicates, filling
in missing values, and converting categorical variables into
numerical values.
3. Feature engineering: The next step is to create new features that
can be used to identify fraudulent activities. For example, the time
of day or day of the week may be an important indicator of fraud.
4. Model selection: There are several models that can be used for
fraud detection, including logistic regression, decision trees, and
neural networks. The choice of model depends on the complexity of
the problem and the quality of the data.

Page no - 91 Handcrafted by Engineers | P - Priority


`

5. Model training: Once the model has been selected, it needs to be


trained using the available data. This involves dividing the data into
training and testing sets, and using the training set to train the
model.
6. Model evaluation: After the model has been trained, it needs to be
evaluated to determine its accuracy and performance. This can be
done by comparing the predicted results with the actual results of
past transactions.
7. Model deployment: Finally, the model can be deployed to identify
fraudulent activities in real-time. This can be done using a
web-based interface or a mobile app.
Overall, fraud detection is a challenging task that requires a deep
understanding of statistical and machine learning concepts. However, with
the right data and tools, it can be a powerful tool for preventing financial
losses and protecting consumer data.
Regenerate response

Q4. Write a short note on Clustering. (P4 - Appeared 1 Time) (5-10


Marks)
Ans: Clustering is a popular unsupervised learning technique used in
machine learning and data mining. The goal of clustering is to group
similar data points together based on their characteristics or features,
without prior knowledge of the specific classes or labels.
The process of clustering typically involves several steps:
1. Data preprocessing: The first step is to clean and preprocess the
data, including removing duplicates, filling in missing values, and
normalizing the data.
2. Feature selection: The next step is to select the relevant features
that will be used to group the data points. This can be done using

Page no - 92 Handcrafted by Engineers | P - Priority


`

various techniques such as correlation analysis, principal


component analysis, or domain expertise.
3. Similarity or distance measure: The similarity or distance measure is
a key component of clustering, which is used to determine how
similar or dissimilar two data points are. Common measures
include Euclidean distance, cosine similarity, and Manhattan
distance.
4. Clustering algorithm: There are several clustering algorithms that
can be used, including K-means, hierarchical clustering, and
density-based clustering. The choice of algorithm depends on the
specific problem and the characteristics of the data.
5. Evaluation: After clustering, it is important to evaluate the quality of
the clustering results. This can be done using various metrics such
as silhouette score, Dunn index, or Calinski-Harabasz index.
Clustering has many practical applications, including customer
segmentation, image segmentation, and anomaly detection. It is widely
used in industries such as marketing, healthcare, and finance to gain
insights from large datasets and make better decisions.

Q5. Describe in detail Customer Segmentation. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Customer segmentation is a technique used by businesses to divide
their customers into groups based on common characteristics such as
demographics, behavior, and purchasing habits. The goal of customer
segmentation is to better understand the needs and preferences of
different customer groups and tailor marketing strategies to each group
accordingly.
Here are the steps involved in customer segmentation:

Page no - 93 Handcrafted by Engineers | P - Priority


`

1. Data collection: The first step is to gather data on customers,


including information such as age, gender, income, and purchase
history.
2. Data preprocessing: Once the data has been collected, it needs to
be cleaned and processed. This involves removing duplicates, filling
in missing values, and converting categorical variables into
numerical values.
3. Feature selection: The next step is to select the relevant features
that will be used to group the customers. This can be done using
various techniques such as correlation analysis, principal
component analysis, or domain expertise.
4. Similarity or distance measure: The similarity or distance measure is
a key component of customer segmentation, which is used to
determine how similar or dissimilar two customers are. Common
measures include Euclidean distance, cosine similarity, and
Manhattan distance.
5. Clustering algorithm: There are several clustering algorithms that
can be used for customer segmentation, including K-means,
hierarchical clustering, and density-based clustering. The choice of
algorithm depends on the specific problem and the characteristics
of the data.
6. Evaluation: After clustering, it is important to evaluate the quality of
the clustering results. This can be done using various metrics such
as silhouette score, Dunn index, or Calinski-Harabasz index.
7. Marketing strategy: Once the customers have been segmented into
groups, businesses can tailor their marketing strategies to each
group. For example, a company may create different advertising
campaigns for high-income customers versus low-income
customers.

Page no - 94 Handcrafted by Engineers | P - Priority


`

Customer segmentation has many practical applications, including


improving customer retention, increasing customer lifetime value, and
optimizing marketing spend. It is widely used in industries such as retail,
e-commerce, and healthcare to gain insights from large datasets and
make better decisions.

Q6. Explain in detail Time series forecasting. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Time series forecasting is a popular technique in data science used to
predict future values of a time-dependent variable based on historical
data. Time series data consists of observations taken at regular time
intervals, such as daily, weekly, or monthly, and includes data from a wide
range of domains such as finance, economics, weather, and energy.
Here are the steps involved in time series forecasting:
1. Data collection: The first step is to gather historical data on the time
series variable of interest, including the time stamps and values.
2. Data preprocessing: Once the data has been collected, it needs to
be cleaned and processed. This involves removing duplicates, filling
in missing values, and handling any outliers or anomalies.
3. Visualization: It is helpful to visualize the data to understand its
trends and patterns over time. This can be done using various
visualization techniques such as line charts, scatterplots, and
heatmaps.
4. Time series modeling: There are several time series models that can
be used for forecasting, including ARIMA (autoregressive integrated
moving average), exponential smoothing, and seasonal
decomposition. The choice of model depends on the specific
problem and the characteristics of the data.
5. Model evaluation: After building the time series model, it is
important to evaluate its performance using various metrics such

Page no - 95 Handcrafted by Engineers | P - Priority


`

as mean absolute error (MAE), mean squared error (MSE), and root
mean squared error (RMSE). This helps to ensure that the model is
accurate and reliable.
6. Forecasting: Once the time series model has been evaluated, it can
be used to make predictions about future values of the
time-dependent variable. This can help businesses and
organizations make better decisions and plan for the future.
Time series forecasting has many practical applications, including
predicting stock prices, forecasting demand for products, and estimating
energy consumption. It is widely used in industries such as finance, retail,
and manufacturing to gain insights from historical data and make better
decisions.

Q7. Write a short note on Weather Forecasting. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Weather forecasting is the process of predicting the future state of the
atmosphere at a given location and time. Weather forecasting is an
important application of data science and is widely used in a range of
fields such as agriculture, aviation, transportation, and emergency
management.
The process of weather forecasting typically involves the following steps:
1. Data collection: The first step is to collect a range of weather data,
including temperature, humidity, pressure, wind speed, and
precipitation. This data can be collected from various sources such
as weather stations, satellites, and radars.
2. Data preprocessing: Once the data has been collected, it needs to
be cleaned and processed. This involves removing duplicates, filling
in missing values, and handling any outliers or anomalies.
3. Weather modeling: There are several models that can be used for
weather forecasting, including numerical weather prediction

Page no - 96 Handcrafted by Engineers | P - Priority


`

models, statistical models, and machine learning models. The


choice of model depends on the specific problem and the
characteristics of the data.
4. Model evaluation: After building the weather model, it is important
to evaluate its performance using various metrics such as
accuracy, precision, and recall. This helps to ensure that the model
is accurate and reliable.
5. Forecasting: Once the weather model has been evaluated, it can be
used to make predictions about future weather conditions. These
predictions can be used to provide weather alerts and advisories to
the public, as well as to inform decision-making in various
industries.
Weather forecasting has many practical applications, including predicting
storms, droughts, and heat waves, as well as forecasting crop yields and
informing transportation planning. It is a critical tool for emergency
management and disaster response, helping to save lives and minimize
property damage.

Q8. Explain in detail Product recommendation. (P4 - Appeared 1 Time)


(5-10 Marks)
Ans: Product recommendation is a technique in data science and machine
learning used to suggest products to users based on their past behavior
and preferences.
● This is done by analyzing the user's historical data, such as their
purchase history, search history, and clickstream data, and using
this information to make personalized recommendations.
● Product recommendation is a powerful technique in data science
and machine learning that can help businesses and organizations
provide personalized recommendations to users, leading to
increased sales and customer satisfaction.

Page no - 97 Handcrafted by Engineers | P - Priority


`

● Here are the steps involved in product recommendation:


1. Data collection: The first step is to collect data on user
behavior, such as their purchase history, search history, and
clickstream data. This data can be collected from various
sources such as e-commerce websites, social media
platforms, and mobile apps.
2. Data preprocessing: Once the data has been collected, it
needs to be cleaned and processed. This involves removing
duplicates, filling in missing values, and handling any outliers
or anomalies.
3. Feature extraction: The next step is to extract features from
the data that are relevant to the product recommendation
task. For example, features such as the user's age, gender,
location, and previous purchases can be used to make
recommendations.
4. Recommendation engine: There are several
recommendation algorithms that can be used, including
collaborative filtering, content-based filtering, and hybrid
models. The choice of algorithm depends on the specific
problem and the characteristics of the data.
5. Model training: After choosing the recommendation
algorithm, the model needs to be trained on the historical
data to learn patterns and relationships between users and
products.
6. Model evaluation: Once the model has been trained, it needs
to be evaluated using various metrics such as precision,
recall, and F1 score. This helps to ensure that the model is
accurate and reliable.
7. Recommendation generation: Once the model has been
evaluated, it can be used to generate personalized

Page no - 98 Handcrafted by Engineers | P - Priority


`

recommendations for users based on their past behavior


and preferences.
● Product recommendation has many practical applications,
including in e-commerce, advertising, and entertainment.
● It can help businesses increase sales by suggesting relevant
products to customers, as well as improve customer satisfaction by
providing personalized recommendations.

Page no - 99 Handcrafted by Engineers | P - Priority

You might also like