Professional Documents
Culture Documents
Ads Notes Sem8 Comps BW
Ads Notes Sem8 Comps BW
BrainheatersTM LLC
Brainheaters Notes
Applied data Science (ADS)
COMPUTER Semester-8
BH.Index
(Learn as per the Priority to prepare smartly)
2. Data Exploration 1 13
4. Anomaly Detection 2 62
MODULE-1
Q1. Write a short note on Data Science. (P4 -Appeared 1 Time) (5-10
Marks)
Ans: Data Science is a multidisciplinary field that involves the use of
statistical, computational, and machine learning techniques to extract
insights and knowledge from complex data sets.
● It encompasses a range of topics including data analysis, data
mining, machine learning, and artificial intelligence.
● Data Science involves a structured approach to analyzing and
interpreting large and complex data sets.
● This includes identifying patterns, making predictions, and
developing models that can be used to gain insights and drive
business decisions.
● Data Scientists use a variety of tools and techniques, such as
statistical programming languages like R and Python, to work with
data sets and extract meaningful information.
● The applications of Data Science are widespread and can be found
in various industries, such as healthcare, finance, marketing, and
transportation. It plays a crucial role in providing businesses with
valuable insights into customer behavior, market trends, and
operational efficiencies.
● To become a Data Scientist, one typically needs to have a strong
background in mathematics, statistics, and computer science.
● Many universities now offer Data Science programs, which provide
students with the necessary skills to work in this field.
● The demand for Data Scientists is growing rapidly, and it is
expected to continue to increase in the future.
Q2. Explain in detail the Data Science Process. (P4 -Appeared 1 Time)
(5-10 Marks)
Ans: Data Science process is a structured and iterative approach to solving
complex problems using data.
● It involves a set of steps that help to extract insights and knowledge
from data to inform business decisions. The following are the steps
involved in the Data Science process:
1. Problem Formulation:
In this step, the problem is identified, and the business
objective is defined. The goal is to determine what questions
need to be answered using data and to ensure that they
align with the business goals.
2. Data Collection:
The next step is to collect relevant data. This can involve
various sources, such as internal databases, public data
repositories, or web scraping. The data must be accurate,
complete, and representative of the problem.
3. Data Cleaning and Preparation:
In this step, the collected data is cleaned, pre-processed,
and transformed to ensure that it is consistent and ready for
analysis. This involves tasks such as removing duplicates,
handling missing values, and encoding categorical
variables.
4. Data Exploration:
The goal of this step is to understand the data and gain
insights. This involves the use of descriptive statistics, data
visualization, and exploratory analysis techniques to identify
patterns, correlations, and trends.
5. Feature Engineering:
This step involves the creation of new features or variables
from the existing data. This is done to improve the
performance of the predictive model or to gain further
insights into the problem.
6. Model Selection and Training:
In this step, a suitable model is selected to solve the
problem. This involves a trade-off between accuracy,
interpretability, and complexity. Once the model is selected,
it is trained on the data to learn the underlying patterns and
relationships.
7. Model Evaluation:
The performance of the model is evaluated using metrics
such as accuracy, precision, recall, and F1-score. The goal is
to determine whether the model is performing well or
whether it needs to be improved.
8. Model Deployment:
Once the model has been evaluated and deemed
acceptable, it can be deployed into production. This involves
integrating the model into the existing business processes or
systems.
9. Model Monitoring and Maintenance:
The final step involves monitoring the performance of the
model in the real-world environment. This involves tracking
the model's performance, detecting any drift, and retraining
the model as necessary.
● In summary, the Data Science process is a systematic and iterative
approach that involves several steps, including problem
formulation, data collection, data cleaning and preparation, data
exploration, feature engineering, model selection and training,
Q4. Explain in detail Data Science Tasks and Examples. (P4 -Appeared 1
Time) (5-10 Marks)
Ans: Data Science tasks can be broadly classified into four categories:
descriptive analysis, exploratory analysis, predictive analysis, and
prescriptive analysis.
These tasks involve using statistical, machine learning, and data
visualization techniques to extract insights and knowledge from data. The
following are examples of each type of Data Science task:
1. Descriptive Analysis:
Descriptive analysis involves summarizing and describing the
characteristics of a data set. This includes measures such as mean,
median, mode, standard deviation, and frequency distribution.
Descriptive analysis is used to gain an understanding of the data
and to identify patterns and trends.
Examples of descriptive analysis include:
● Calculating the average age of customers in a retail store.
● Determining the percentage of males and females in a
population.
● Analyzing the distribution of income levels in a city.
2. Exploratory Analysis:
Exploratory analysis involves visualizing and exploring the data to
identify relationships and patterns. This includes techniques such as
scatter plots, histograms, box plots, and heat maps. Exploratory
analysis is used to gain a deeper understanding of the data and to
identify potential insights.
Examples of exploratory analysis include:
● Visualizing the relationship between temperature and
humidity in a climate data set.
● Creating a histogram of the distribution of customer
purchase amounts in an e-commerce store.
● Plotting the distribution of crime rates in different areas of a
city.
3. Predictive Analysis:
● Predictive analysis involves using machine learning and
statistical techniques to make predictions based on the
data. This includes techniques such as regression,
classification, clustering, and time series analysis. Predictive
analysis is used to make forecasts and predictions based on
past data.
Examples of predictive analysis include:
● Predicting the sales volume of a product based on past
sales data.
● Forecasting the stock price of a company based on
historical data.
● Identifying potential customer churn based on past
purchase behavior.
4. Prescriptive Analysis:
Prescriptive analysis involves using optimization and simulation
techniques to make recommendations based on the data. This
includes techniques such as linear programming, decision trees,
and Monte Carlo simulation. Prescriptive analysis is used to make
data-driven recommendations for decision-making.
Examples of prescriptive analysis include:
● Optimizing production schedules in a manufacturing plant
to minimize costs.
● Determining the optimal route for delivery trucks to minimize
travel time and distance.
● Recommending the most profitable investment portfolio
based on risk and return.
Q6. Write a short note on Modeling. (P4 -Appeared 1 Time) (5-10 Marks)
Ans: Modeling is a critical step in the Data Science process that involves
building mathematical or computational models to represent real-world
phenomena.
● The goal of modeling is to gain a better understanding of the data
and to make predictions or decisions based on the model's output.
The following are some key aspects of modeling in Data Science:
1. Model Selection: Choosing the appropriate model for a given
problem is critical to the success of modeling. The model
should be able to capture the essential features of the data
and should be able to make accurate predictions or
decisions.
MODULE-2
Q1. Write down Types of data. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: There are several types of data that are commonly used in Data
Science and other fields. These types of data include:
1. Nominal Data: Nominal data is categorical data that cannot be
ranked or ordered. Examples include gender, race, and type of car.
2. Ordinal Data: Ordinal data is categorical data that can be ranked or
ordered. Examples include education level (e.g., high school,
college, graduate degree), socioeconomic status (e.g., low, middle,
high), and customer satisfaction ratings (e.g., poor, fair, good,
excellent).
3. Interval Data: Interval data is numerical data that has a consistent
scale and unit of measurement, but does not have a true zero point.
Examples include temperature in Celsius or Fahrenheit and dates
on a calendar.
4. Ratio Data: Ratio data is numerical data that has a consistent scale
and unit of measurement, and does have a true zero point.
Examples include weight, height, and income.
5. Time Series Data: Time series data is a type of data where the
observations are recorded at regular intervals over time. Examples
include stock prices, weather data, and website traffic.
6. Text Data: Text data is unstructured data in the form of text that can
be analyzed using Natural Language Processing techniques.
Examples include social media posts, customer reviews, and news
articles.
7. Spatial Data: Spatial data is data that is associated with
geographic locations. Examples include GPS data, satellite imagery,
and maps.
the quartiles divide the data into quarters (25th, 50th, and
75th percentiles).
● Descriptive statistics is an important tool for analyzing and
summarizing data in a meaningful way. It is often used to provide a
baseline understanding of the data before more complex analyses
are performed.
● Measures of symmetry:
1. Skewness: Skewness is a measure of the asymmetry of the
distribution. A positive skew indicates that the tail of the
distribution is longer on the positive side, while a negative
skew indicates that the tail is longer on the negative side. A
skewness value of zero indicates that the distribution is
perfectly symmetrical.
2. Kurtosis: Kurtosis is a measure of the "peakedness" of the
distribution. A high kurtosis value indicates that the
distribution has a sharp peak and heavy tails, while a low
kurtosis value indicates a flat or rounded distribution.
● These measures are important because they provide valuable
information about the shape and variability of a dataset.
Understanding the spread and symmetry of a dataset can help in
making informed decisions about the data and in identifying any
patterns or outliers that may be present.
● This is because there are more extreme values on the right side of
the distribution.
● Skewness can be quantified using a number of different measures,
such as Pearson's moment coefficient of skewness, the Bowley
skewness, or the quartile skewness coefficient.
● Skewness is an important measure of distributional shape, as it
provides insight into the direction and degree of deviation from
symmetry.
● Skewed distributions can have important implications in statistical
analysis, as they can affect the interpretation of statistical tests,
such as the t-test or ANOVA.
Q11. Explain in detail Central Data Point. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Central data point is a term used to describe a specific value that
represents the central tendency or central location of a dataset.
● In statistics, measures of central tendency are used to describe the
typical or most common value in a dataset.
● There are three commonly used measures of central tendency:
mean, median, and mode.
● The mean is calculated by adding up all the values in a dataset and
dividing by the number of observations. The mean is a sensitive
2. Poisson Distribution:
● The Poisson distribution is a discrete probability distribution
that is used to model the number of times an event occurs in
a fixed interval of time or space, given that the events occur
independently and at a constant rate.
● It is named after French mathematician Siméon Denis
Poisson. The Poisson distribution is characterized by a single
Q17. Write a short note on the Central limit theorem. (P4 - Appeared 1
Time) (5-10 Marks)
Ans: The Central Limit Theorem (CLT) is a key concept in data science that
is used to make statistical inferences about large datasets.
● It states that if a large number of independent and identically
distributed random variables are added or averaged together, the
resulting distribution will be approximately normal, regardless of the
underlying distribution of the individual variables.
● In practical terms, this means that if we have a large enough
sample size, we can use the CLT to make accurate estimates about
the population mean and standard deviation, even if we don't know
the underlying distribution of the data.
● This is important in data science because it allows us to draw
meaningful conclusions from data, even when we have incomplete
information.
● For example, the CLT is often used in hypothesis testing, where we
compare a sample mean to a hypothesized population mean.
● By calculating the standard error of the mean using the CLT, we can
estimate the probability of observing a sample mean as extreme as
the one we have, given the hypothesized population mean.
● This helps us determine whether the difference between the sample
mean and the hypothesized mean is statistically significant.
● Overall, the CLT is a fundamental concept in data science that helps
us make sense of large datasets and draw meaningful conclusions
from them.
Type-I Error:
● In data science, a Type-I error occurs when a statistical model or
algorithm incorrectly identifies a pattern or relationship in the data
as being significant, when in fact it is not.
● This is similar to a false positive in statistical hypothesis testing. For
example, suppose a predictive model incorrectly identifies a feature
as being important for predicting a target variable, when in fact it is
not.
Type-II Error
● In data science, a Type-II error occurs when a statistical model or
algorithm fails to identify a significant pattern or relationship in the
data, when in fact it is present. This is similar to a false negative in
statistical hypothesis testing.
● For example, suppose a predictive model fails to identify an
important feature for predicting a target variable, leading to
inaccurate predictions and missed opportunities. This can also
have serious consequences in fields such as healthcare or finance,
where missing an important trend or relationship can have
significant implications.
MODULE-3
Q3. Write a short note on Cross Validation. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Cross-validation is a statistical technique used to evaluate the
performance of predictive models.
● The basic idea of cross-validation is to split the available data into
two or more subsets, one for training the model and the other for
testing the model's performance.
● This helps to ensure that the model is not overfitting to the training
data and provides an unbiased estimate of the model's
performance on new data.
Q5. Explain in detail leave-1 out. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: Leave-one-out cross-validation (LOOCV) is a variation of k-fold
cross-validation where k is equal to the number of samples in the dataset.
● In other words, LOOCV involves training the model on all but one of
the samples in the dataset, and using the remaining sample as the
test set.
● The basic steps involved in LOOCV are as follows:
1. Remove one sample from the dataset and use the
remaining samples to train the model.
2. Use the trained model to predict the outcome of the
removed sample.
3. Calculate the performance metric (such as accuracy,
precision, recall, or F1 score) for the removed sample.
4. Repeat steps 1-3 for each sample in the dataset.
5. Average the performance metrics over all samples to get an
overall estimate of the model's performance.
● The advantage of LOOCV is that it provides the most accurate
estimate of the model's performance possible, as each sample in
the dataset is used as the test set exactly once.
● However, LOOCV can be computationally expensive, especially for
large datasets, as it requires training the model on almost all of the
samples multiple times.
● LOOCV is useful for evaluating the performance of models that have
a small number of samples or that are prone to overfitting.
● By using LOOCV, data scientists can obtain a more accurate
estimate of the model's ability to generalize to new data and select
the best model for a given problem.
● However, LOOCV should be used judiciously, as it can be
computationally expensive and may not be necessary for all
datasets and models.
2.
Q9. Write a short note on the Distribution Chart. (P4 - Appeared 1 Time)
(5-10 Marks)
Ans: A distribution chart is a graphical representation of the frequency
distribution of a dataset. It shows how the data is distributed across
different values or ranges. There are different types of distribution charts,
each of which is used for a specific purpose:
1. Histogram: A histogram is a type of distribution chart that shows the
frequency distribution of continuous data. It is used to visualize the
shape, center, and spread of the distribution of the data.
2. Box plot: A box plot is a type of distribution chart that shows the
distribution of continuous data using quartiles. It is used to identify
outliers, skewness, and the spread of the data.
3. Stem and leaf plot: A stem and leaf plot is a type of distribution
chart that shows the distribution of discrete data. It is used to
visualize the frequency distribution of the data.
4. Probability density function: A probability density function is a type
of distribution chart that shows the probability of a continuous
random variable falling within a certain range. It is used to model
and analyze continuous data.
5. Bar chart: A bar chart is a type of distribution chart that shows the
frequency distribution of categorical data. It is used to visualize the
distribution of data across different categories.
Distribution charts are an important tool in data analysis and are used to
gain insights into the distribution and characteristics of a dataset. They are
useful for identifying patterns, trends, and outliers in the data, and can be
used to inform decision-making in various fields, including finance,
healthcare, and marketing.
Q10. Discuss in detail Scatter Plot. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: A scatter plot is a type of data visualization that displays the
relationship between two variables in a dataset.
● It is a graph with points that represent individual data points and
show their positions along two axes.
● Here is an example of a scatter plot:
● In this scatter plot, the x-axis represents one variable, and the y-axis
represents the other variable. Each point represents a single data
point and its value for both variables. The pattern of the points on
the plot provides insight into the relationship between the two
variables.
Q11. Write a short note on Scatter Matrix. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: A scatter matrix, also known as a pair plot or pair scatter plot.
● It is a type of data visualization that allows for the exploration of the
relationships between multiple variables in a dataset.
● It consists of a matrix of scatter plots, with each scatter plot
showing the relationship between two variables.
● Here's an example of a scatter matrix:
Q12. Write a short note on the Bubble chart. (P4 - Appeared 1 Time)
(5-10 Marks)
Ans: A bubble chart is a type of data visualization that is used to display
three dimensions of data in a two-dimensional space.
● In a bubble chart, data points are represented by bubbles of
varying sizes and colors. The x and y axes represent two variables,
and the size and color of the bubbles represent a third variable.
● Bubble charts are useful for displaying data with multiple
dimensions, as they allow for the visualization of relationships
between three variables.
● They are often used in business and economics to show market
trends, and in social science to visualize relationships between
multiple variables.
● When creating a bubble chart, it is important to choose an
appropriate size and color scheme to ensure that the bubbles are
easy to read and interpret.
● It is also important to label the axes and provide a clear legend to
explain the meaning of the bubble sizes and colors.
● In conclusion, bubble charts are a powerful tool for displaying data
with multiple dimensions. By using bubble charts to visualize and
analyze data, researchers and analysts can gain deeper insights
and make more informed decisions.
MODULE-4
Q1. Write a short note on Outliers. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: In statistics, outliers are data points that are significantly different from
other observations in a dataset.
● Outliers can have a significant impact on statistical analysis, as
they can affect the mean, standard deviation, and other measures
of central tendency and dispersion.
● Outliers can occur due to various reasons, such as measurement
errors, natural variation, extreme events, and data manipulation.
● Outliers can be detected using various techniques, such as box
plots, scatter plots, and statistical tests like Z-score, Tukey's method,
and Grubbs' test.
● However, identifying an outlier does not necessarily mean it should
be removed from the dataset.
● Outliers can sometimes provide valuable insights into the data and
need to be analyzed further.
● It is essential to understand the cause of the outliers and their
impact on the data before deciding whether to remove them or not.
● To summarize this we can say, outliers are data points that are
significantly different from other observations in a dataset, and they
can occur due to various reasons. Identifying and analyzing outliers
is essential for effective statistical analysis, but it is equally
important to understand their cause and impact before deciding
whether to remove them or not.
Q7. Write a short note on SMOT. (P4 - Appeared 1 Time) (5-10 Marks)
Ans: SMOTE (Synthetic Minority Over-sampling Technique) is a technique
used in machine learning to address the problem of imbalanced datasets.
● In imbalanced datasets, the number of instances in the minority
class (e.g., fraud cases) is much smaller than the number of
instances in the majority class (e.g., non-fraud cases).
● This can lead to biased models that are unable to accurately
predict the minority class.
● SMOTE is a technique that generates synthetic samples of the
minority class to balance the dataset.
● The algorithm works by selecting a minority class instance and
computing its k nearest neighbors in the feature space.
● Synthetic instances are then created by interpolating between the
selected instance and its neighbors.
MODULE-5
Q3. Write a short note on the Average method. (P4 - Appeared 1 Time)
(5-10 Marks)
Ans:The average method is a simple and widely used forecasting
technique that involves calculating the average of past observations and
using it as a forecast for future periods.
● This method is based on the assumption that future values will be
similar to past values, and that the average provides a reasonable
estimate of the future trend.
● To use the average method, the first step is to collect historical data
for the time series. Then, the average of the past observations is
calculated and used as the forecast for the next period.
● This process is repeated for each future period, with the forecast for
each period equal to the average of the past observations.
● The average method is easy to use and does not require any
complex mathematical calculations or statistical analysis.
● It is often used as a benchmark or baseline method for comparing
the performance of more advanced forecasting techniques.
● However, the average method has several limitations, including:
1. It does not capture any trend or seasonal patterns in the
data, and assumes that future values will be the same as
past values.
2. It is sensitive to outliers and extreme values in the data,
which can affect the accuracy of the forecast.
3. It does not take into account any external factors or events
that may affect the time series, such as changes in the
economy or market conditions.
● Despite these limitations, the average method can be useful for
short-term forecasting of stable and relatively predictable time
series.
● The formula for calculating the moving average for each point in
the series is:
MA(t) = (Y(t) + Y(t-1) + ... + Y(t-k+1)) / k
where Y(t) is the value at time t, k is the window size, and MA(t) is
the moving average at time t.
● MA smoothing has several advantages. It can help to reduce noise
in the data, making it easier to identify trends and patterns. It is also
a simple and intuitive method that requires only basic
mathematical skills.
● However, MA smoothing also has some limitations. One of the main
limitations is that it is sensitive to the choice of window size.
● A small window size may not smooth the data enough to reveal
meaningful patterns, while a large window size may smooth the
data too much and obscure important features.
● Another limitation is that it is a backward-looking method and may
not be suitable for predicting future values beyond the range of the
data used to calculate the moving averages.
Q5. Explain Time series analysis using linear regression. (P4 - Appeared
1 Time) (5-10 Marks)
Ans: Time series analysis is a statistical method used to analyze data that
is collected over a period of time.
● In time series analysis, the data is often plotted over time to identify
patterns, trends, and other useful information.
● Linear regression is a statistical method that can be used to analyze
time series data. Linear regression is a technique used to model the
relationship between a dependent variable and one or more
independent variables.
● In time series analysis, the dependent variable is usually the
variable of interest that changes over time, and the independent
Q6. Write a short note on ARIMA Model. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: ARIMA (Autoregressive Integrated Moving Average) model is a popular
statistical method used for time series analysis and forecasting.
Q7. Discuss in detail Mean Absolute Error. (P4 - Appeared 1 Time) (5-10
Marks)
Ans: Mean Absolute Error (MAE) is a popular metric used to measure the
accuracy of a predictive model.
● It is particularly useful for evaluating models that predict
continuous variables, such as regression models.
● MAE measures the average absolute difference between the actual
and predicted values of a variable.
● It is calculated by taking the absolute value of the difference
between each predicted value and its corresponding actual value,
and then taking the average of these differences.
● The formula for calculating MAE is:
MAE = (1/n) * Σ| yi - ŷi |
where yi is the actual value of the variable, ŷi is the predicted value
of the variable, n is the total number of observations, and Σ is the
summation symbol.
● MAE is a useful metric because it is easy to understand and
interpret. It measures the average size of the errors made by the
model, with larger errors contributing more to the overall score than
smaller errors.
● MAE is expressed in the same units as the variable being predicted,
which makes it easy to compare the accuracy of different models.
● One limitation of MAE is that it treats all errors as equally important,
regardless of whether they are positive or negative.
● This means that a model that consistently underestimates the
variable of interest will have the same MAE as a model that
consistently overestimates the variable, even though these errors
may have different implications for the practical use of the model.
● Despite this limitation, MAE is a widely used metric for evaluating the
accuracy of predictive models.
Q8. Write a short note on Root Mean Square Error. (P4 - Appeared 1
Time) (5-10 Marks)
Ans: Root Mean Square Error (RMSE) is a commonly used metric for
evaluating the accuracy of a predictive model.
● It measures the average magnitude of the errors made by the
model, with larger errors contributing more to the overall score than
smaller errors. RMSE is particularly useful for evaluating models that
predict continuous variables, such as regression models.
● The formula for calculating RMSE is:
RMSE = sqrt(1/n * Σ(yi - ŷi)2)
where yi is the actual value of the variable,
ŷi is the predicted value of the variable,
n is the total number of observations,
Σ is the summation symbol, and sqrt is the square root function.
● RMSE is expressed in the same units as the variable being predicted,
which makes it easy to compare the accuracy of different models.
● One advantage of RMSE over Mean Absolute Error (MAE) is that it
gives more weight to larger errors, which may be more important to
consider in certain applications.
● However, one limitation of RMSE is that it is sensitive to outliers in the
data, which can inflate the value of the metric. Another limitation is
that it can be difficult to interpret in practical terms, since it does
not have a direct relationship to the performance of the model.
● Despite these limitations, RMSE is widely used as a metric for
evaluating the accuracy of predictive models, especially in cases
MODULE-6
as mean absolute error (MAE), mean squared error (MSE), and root
mean squared error (RMSE). This helps to ensure that the model is
accurate and reliable.
6. Forecasting: Once the time series model has been evaluated, it can
be used to make predictions about future values of the
time-dependent variable. This can help businesses and
organizations make better decisions and plan for the future.
Time series forecasting has many practical applications, including
predicting stock prices, forecasting demand for products, and estimating
energy consumption. It is widely used in industries such as finance, retail,
and manufacturing to gain insights from historical data and make better
decisions.