You are on page 1of 28

Mid Sem Exam Question Bank Predictive Analytics

Unit 1: Introduction to Predictive Analytics

1) What is Predictive Analytics?.

Predictive analytics is a type of data analytics that uses statistical, machine


learning, and artificial intelligence (AI) techniques to identify patterns in data and
predict future outcomes. It can be used to make predictions about a wide range of
events, including customer behavior, product demand, fraud, and equipment
failure.

2) Explain Analytics lifecycle in detail.

https://intellipaat.com/blog/tutorial/data-analytics-tutorial/data-analytics-lifecycle/
#:~:text=What%20is%20Data%20Analytics%20Life,working%20on%20data
%20analytics%20initiatives.
3) What is Machine Learning? Explain Supervised and Unsupervised learning
in brief.
Arthur Saumel: He described it as: “The field of study that gives computers the
ability to learn without being explicitly programmed."
Machine learning focuses on the development of computer programs that can
access data and use it to learn for themselves.
Tom Mitchell provides a more modern definition: "A computer program is said to
learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with
experience E."

Supervised Learning
● Supervised Learning is a concept of Machine Learning,wherein the
preconceived data sets are labeled.
● The data sets consists of a set of the training examples. So, it comprises of
the input(data sets or vector) and the output(supervisory signal), which goes
hand in hand.
● Supervised Learning is the concept of Machine Learning,which is handled
by predetermined set of training examples( features) to predict the future
output values for the same patterned approach.

Unsupervised Learning
● Data set is not labelled or every data item has the same label
● we find a structure or pattern for the problem set without labeling the data
set.
● we don’t know what exactly the data is labeled for, we tend to find the
pattern and not the right or wrong answer about the problem.
● There is no potential signal for error and correctness of the solution of the
problem.
● We derive this structure by clustering the data based on relationships among
the variables in the data.

4) Differentiate between Covariance and Correlation.

COVARIANCE CORRELATION

Definition It’s a statistical term correlation tells us the


demonstrating a direction and strength of
systematic association the relationship between
between two random multiple variables.
variables, where the Correlation assesses the
change in the other extent to which two or
mirrors the change in one more random variables
variable. progress in sequence

Interpretation and Scale The change in scale On the contrary, the


of Values changes the value of correlation value remains
Covariance. A higher unaffected by the change
number in Covariance in scale. The correlation
means higher coefficients range from -1
dependency. Interpreting to 1, which allows for a
Covariance is difficult. more straightforward
interpretation, unlike
Covariance.

Relationship to the Units The variables’ On the other hand,


of Measurement measurement units affect correlation coefficients
Covariance, making don’t have any units and
COVARIANCE CORRELATION

comparing covariance do not rely on the units of


values across various measurement, allowing
datasets or variables with comparisons between
different units variables with various
challenging. units.

Standardization and since Covariance doesn’t While correlation


Comparison Across have standardization, coefficients are
Datasets comparing covariances standardized. Therefore,
across various datasets is comparing it directly
challenging. across variables, datasets,
or contexts is easy.

Robustness to Outliers Outlier hugely impacts On the contrary,


the value of Covariance. correlation coefficients
Hence, it is sensitive to offer a more robust
the presence of outliers. standard of the
relationship between
variables, as correlation
coefficients are less
susceptible to outliers.

Values Lie between -infinity to Values lie between -1 to


+infinity 1

Unit It’s a product of the unit It’s a unit-free measure


of variables

Change in Scale Even minor changes in There won’t be any


scale affect Covariance change in correlation
because of the scale
COVARIANCE CORRELATION

Measure of Correlation The scaled version of


Covariance

Application Market Research, Medical Research, Data


Portfolio Analysis, and Analysis, and Forecasting
Risk Assistance

5) What is regression? Explain Simple Linear regression in detail


Regression is a statistical technique used in data analysis to model the relationship
between a dependent variable (also known as the target or response variable) and
one or more independent variables (also known as predictor or explanatory
variables). The primary goal of regression analysis is to understand and quantify
the relationship between these variables, which can help in making predictions,
drawing inferences, and understanding the underlying patterns in the data.

The two basic types of regression are simple linear regression and multiple linear
regression, although there are non-linear regression methods for more complicated
data and analysis. Simple linear regression uses one independent variable to
explain or predict the outcome of the dependent variable Y, while multiple linear
regression uses two or more independent variables to predict the outcome (while
holding all others constant).

Simple linear regression


Simple Linear Regression is a specific type of regression analysis that focuses on
modeling the relationship between a single independent variable and a single
dependent variable. It is called "simple" because it deals with only one predictor
variable. The mathematical equation for simple linear regression is typically
represented as:
Y=β0+β1X+ϵ
The goal in simple linear regression is to estimate the values of β0 and β1 that best
fit the data. This is typically done using a method called least squares, which
minimizes the sum of the squared differences between the actual values of Y and
the predicted values based on the linear equation.

6) Differentiate: Simple Linear and Multiple Linear regression.

Simple Linear
Aspect Regression Multiple Linear Regression

Number of
Independent 1 More than 1 (usually 2 or more)
Variables

Equation Y=β0+β1X+ϵ Y=β0+β1X1+β2X2+…+βpXp+ϵ

Modeling the
relationship between a Modeling the relationship between
single independent multiple independent variables (
Purpose variable (
X1,X2,…,Xp) and the dependent
X) and the dependent variable (Y).
variable (Y).

Simpler to implement
More complex due to handling
and understand due to a
Complexity multiple independent variables,
single independent
requiring multivariate techniques.
variable.

Represented as a hyperplane in a
Typically represented as
multidimensional space, making
Visualization a straight line on a
visualization challenging beyond
scatterplot.
three dimensions.

Easy to interpret: More complex interpretation as each


Interpretation
of Coefficients β0 represents the β coefficient represents the change
intercept, β1 represents in Y for a one-unit change in the
corresponding X, while holding
the slope.
other variables constant.

Predicting a student's Predicting a house's price based on


Example Use test score based on the features such as square footage,
Case number of hours spent number of bedrooms, and
studying. neighborhood.

Can handle more complex


Limited ability to
relationships but may suffer from
capture complex
Limitations multicollinearity (high correlation
relationships between
between predictors) and overfitting
variables.
if not properly managed.

7) What is Autoregression?
Autoregression (AR) is a statistical modeling technique used in time series
analysis. It models the relationship between a variable and its own past values,
essentially using its own historical data to make predictions about future values. In
autoregressive models, the dependent variable (the variable being predicted) is
regressed on its own lagged (past) values.
The key idea behind autoregression is that past values of a time series can provide
valuable information for predicting future values, especially in cases where there is
some inherent temporal structure or autocorrelation in the data. Autoregressive
models are particularly useful when dealing with time-dependent data, such as
stock prices, weather patterns, or economic indicators.
The autoregressive model of order p, often denoted as AR(p), can be
mathematically represented as follows:
Yt=c+ϕ1Yt−1+ϕ2Yt−2+…+ϕpYt−p+ϵt

The autoregressive order p represents how many lagged values are included in the
model. For example, in an AR(1) model, only the immediately preceding value
(Yt−1) is used to predict Yt. In an AR(2) model, both Yt−1 and Yt−2 are used, and
so on.
To use autoregressive models effectively, one needs to determine the appropriate
order (p) of the model, which often involves statistical methods and diagnostics to
analyze the autocorrelation structure in the data. Autoregressive models can be
extended and combined with other time series models, such as moving average
(MA) models and autoregressive integrated moving average (ARIMA) models, to
handle more complex time series data with trends and seasonality.

8) Explain TimeSeries Regression.


Time series analysis is a specific way of analyzing a sequence of data points
collected over time. In TSA, analysts record data points at consistent intervals over
a set period rather than just recording the data points intermittently or randomly.
Time series regression is a statistical method for predicting a future response based on the response
history (known as autoregressive dynamics) and the transfer of dynamics from relevant predictors.
Time series regression can help you understand and predict the behavior of dynamic systems from
experimental or observational data. Common uses of time series regression include modeling and
forecasting of economic, financial, biological, and engineering systems.

You can start a time series analysis by building a design matrix ( Xt��), also called a feature or
regressor matrix, which can include current and past observations of predictors ordered by time (t).
Then, apply ordinary least squares (OLS) to the multiple linear regression (MLR) model
yt=Xtβ+et��=���+��
to get an estimate of a linear relationship of the response (yt)(��) to the design
matrix. β� represents the linear parameter estimates to be computed and (et)(��) represents the
innovation terms. This form can be generalized to multivariate cases vector (yt)(��), including
exogenous inputs such as control signals, and correlation effects in the residues. For more difficult
cases, the linear relationship can be replaced by a nonlinear one yt=f(Xt,et)��=�(��,��),
where f()�() is a nonlinear function such as a neural network.
Typically, time series modeling involves picking a model structure (such as an ARMA form or a
transfer function) and incorporating known attributes of the system such as non-stationarities. Some
examples are:

 Autoregressive integrated moving average with exogenous predictors (ARIMAX)


 Distributed lag models (transfer functions)
 State space models
 Spectral models
 Nonlinear ARX models
https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-to-time-
series-analysis/

9) Describe ANOVA.

 ANOVA stands for Analysis of Variance. It is a statistical test that is used to compare the
means of two or more groups.
 ANOVA is a parametric test, which means that it makes certain assumptions about the
data, such as that the data is normally distributed and that the variances of the groups are
equal.
 There are different types of ANOVA, depending on the number of independent variables
and the number of levels of each independent variable.
 The most common type of ANOVA is the one-way ANOVA, which is used to compare
the means of two or more groups when there is only one independent variable.
 Another common type of ANOVA is the two-way ANOVA, which is used to compare
the means of two or more groups when there are two independent variables.
 ANOVA can be used to test a variety of hypotheses, such as whether the means of two or
more groups are equal, or whether the mean of one group is different from the mean of
another group.
 The results of an ANOVA test are typically reported in a table that shows the F-statistic,
the p-value, and the degrees of freedom.
 The F-statistic is a measure of the variance between the groups, and the p-value is a
measure of the significance of the difference between the groups.
 If the p-value is less than the significance level, then the null hypothesis is rejected,
which means that there is a significant difference between the means of the groups.

Here are some of the assumptions of ANOVA:

 The data is normally distributed.


 The variances of the groups are equal.
 The data is independent.

If any of these assumptions are violated, then the results of the ANOVA test may not be valid.

ANOVA is a powerful statistical tool that can be used to compare the means of two or more
groups. However, it is important to understand the assumptions of ANOVA and to make sure
that the data meets these assumptions before conducting the test.
Unit 2: Data Pre-processing

10) Define Data preprocessing.


https://monkeylearn.com/blog/data-preprocessing/#transformation

What Is Data Preprocessing?


Data preprocessing is a step in the data mining and data analysis process that takes
raw data and transforms it into a format that can be understood and analyzed by
computers and machine learning.

Raw, real-world data in the form of text, images, video, etc., is messy. Not only may
it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a
regular, uniform design.

Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy.
However, unstructured data, in the form of text and images must first be cleaned
and formatted before analysis.

Data Preprocessing Steps


Let’s take a look at the established steps you’ll need to go through to make sure
your data is successfully preprocessed.

1. Data quality assessment


2. Data cleaning
3. Data transformation
4. Data reduction

Data Preprocessing Importance


When using data sets to train machine learning models, you’ll often hear the
phrase “garbage in, garbage out” This means that if you use bad or “dirty” data to
train your model, you’ll end up with a bad, improperly trained model that won’t
actually be relevant to your analysis.
Good, preprocessed data is even more important than the most powerful
algorithms, to the point that machine learning models trained with bad data could
actually be harmful to the analysis you’re trying to do – giving you “garbage” results.

Data preprocessing is the process of cleaning, transforming, and formatting data so that
it can be used for analysis and modeling. It is an important step in machine learning and
data science, as it can help to improve the accuracy and performance of models.

Data preprocessing can be divided into the following steps:

 Data cleaning: This involves identifying and correcting errors or inconsistencies


in the data. This can include removing duplicate records, correcting typos, and
filling in missing values.
 Data standardization: This involves converting the data into a common format so
that it can be easily manipulated and analyzed. This can include converting text
to numbers, converting dates to a standard format, and removing outliers.
 Data transformation: This involves changing the way the data is presented or
stored. This can include creating new features, aggregating data, and reducing
the dimensionality of the data.
 Feature selection: This involves selecting the most important features for the
analysis. This can be done using statistical methods or machine learning
algorithms.
 Data sampling: This involves selecting a subset of the data for analysis. This can
be done to reduce the size of the data or to improve the performance of the
analysis.

The specific steps involved in data preprocessing will vary depending on the type of
data and the goals of the analysis. However, the steps listed above are some of the
most common and important steps involved in data preprocessing.

Here are some examples of data preprocessing:

 Removing duplicate records from a dataset.


 Correcting typos in a dataset.
 Filling in missing values in a dataset.
 Converting text to numbers in a dataset.
 Converting dates to a standard format in a dataset.
 Removing outliers from a dataset.
 Creating new features from a dataset.
 Aggregating data in a dataset.
 Reducing the dimensionality of a dataset.
 Selecting the most important features for a machine learning model.
 Sampling a subset of a dataset for analysis.

Data preprocessing is an essential step in machine learning and data science. By


following the right steps, you can improve the quality and usability of your data, which
will lead to better insights and decisions.

11) Explain various steps taken for pre-processing data.


https://monkeylearn.com/blog/data-preprocessing/#transformation

Data Preprocessing Steps


Let’s take a look at the established steps you’ll need to go through to make sure
your data is successfully preprocessed.

1. Data quality assessment


2. Data cleaning
3. Data transformation
4. Data reduction

1. Data quality assessment

Take a good look at your data and get an idea of its overall quality, relevance to
your project, and consistency. There are a number of data anomalies and inherent
problems to look out for in almost any data set, for example:

 Mismatched data types: When you collect data from many different
sources, it may come to you in different formats. While the ultimate goal of
this entire process is to reformat your data for machines, you still need to
begin with similarly formatted data. For example, if part of your analysis
involves family income from multiple countries, you’ll have to convert each
income amount into a single currency.
 Mixed data values: Perhaps different sources use different descriptors for
features – for example, man or male. These value descriptors should all be
made uniform.
 Data outliers: Outliers can have a huge impact on data analysis results. For
example if you're averaging test scores for a class, and one student didn’t
respond to any of the questions, their 0% could greatly skew the results.
 Missing data: Take a look for missing data fields, blank spaces in text, or
unanswered survey questions. This could be due to human error or
incomplete data. To take care of missing data, you’ll have to perform data
cleaning.

2. Data cleaning
Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most
important step of preprocessing because it will ensure that your data is ready to go
for your downstream needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are
a number of possible cleaners you’ll need to run your data through.

Missing data

There are a number of ways to correct for missing data, but the two most common
are:

 Ignore the tuples: A tuple is an ordered list or sequence of numbers or


entities. If multiple values are missing within tuples, you may simply discard
the tuples with that missing information. This is only recommended for large
data sets, when a few ignored tuples won’t harm further analysis.
 Manually fill in missing data: This can be tedious, but is definitely
necessary when working with smaller data sets.

Noisy data

Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group
together.

 Binning: Binning sorts data of a wide data set into smaller groups of more
similar data. It’s often used when analyzing demographics. Income, for
example, could be grouped: $35,000-$50,000, $50,000-$75,000, etc.
 Regression: Regression is used to decide which variables will actually apply
to your analysis. Regression analysis is used to smooth large amounts of
data. This will help you get a handle on your data, so you’re not
overburdened with unnecessary data.
 Clustering: Clustering algorithms are used to properly group data, so that it
can be analyzed with like data. They’re generally used in unsupervised
learning, when not a lot is known about the relationships within your data.

If you’re working with text data, for example, some things you should consider
when cleaning your data are:

 Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
 Translate all text into the language you’ll be working in
 Remove HTML tags
 Remove boilerplate email text
 Remove unnecessary blank text between words
 Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand.
At this point you can also perform data wrangling or data enrichment to add new
data sets and run them through quality assessment and cleaning again before
adding them to your original data.

3. Data transformation

With data cleaning, we’ve already begun to modify our data, but data
transformation will begin the process of turning the data into the proper format(s)
you’ll need for analysis and other downstream processes.

This generally happens in one or more of the below:

1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation
 Aggregation: Data aggregation combines all of your data together in a
uniform format.
 Normalization: Normalization scales your data into a regularized range so
that you can compare it more accurately. For example, if you’re comparing
employee loss or gain within a number of companies (some with just a dozen
employees and some with 200+), you’ll have to scale them within a specified
range, like -1.0 to 1.0 or 0.0 to 1.0.
 Feature selection: Feature selection is the process of deciding which
variables (features, characteristics, categories, etc.) are most important to
your analysis. These features will be used to train ML models. It’s important
to remember, that the more features you choose to use, the longer the
training process and, sometimes, the less accurate your results, because
some feature characteristics may overlap or be less present in the data.

Data preprocessing is the process of cleaning, transforming, and formatting data so that
it can be used for analysis and modeling. It is an important step in machine learning and
data science, as it can help to improve the accuracy and performance of models.

The following are some of the most common steps involved in data preprocessing:

1. Data cleaning: This involves identifying and correcting errors or inconsistencies


in the data. This can include removing duplicate records, correcting typos, and
filling in missing values.
2. Data standardization: This involves converting the data into a common format so
that it can be easily manipulated and analyzed. This can include converting text
to numbers, converting dates to a standard format, and removing outliers.
3. Data transformation: This involves changing the way the data is presented or
stored. This can include creating new features, aggregating data, and reducing
the dimensionality of the data.
4. Feature selection: This involves selecting the most important features for the
analysis. This can be done using statistical methods or machine learning
algorithms.
5. Data sampling: This involves selecting a subset of the data for analysis. This can
be done to reduce the size of the data or to improve the performance of the
analysis.

The specific steps involved in data preprocessing will vary depending on the type of
data and the goals of the analysis. However, the steps listed above are some of the
most common and important steps involved in data preprocessing.

Here are some additional details about each of the data preprocessing steps mentioned
above:
 Data cleaning: This is the most basic step in data preprocessing, and it involves
identifying and correcting errors or inconsistencies in the data. This can be a
time-consuming and challenging task, but it is essential for ensuring the quality of
the data.
 Data standardization: This step involves converting the data into a common
format so that it can be easily manipulated and analyzed. This can be done by
converting text to numbers, converting dates to a standard format, and removing
outliers.
 Data transformation: This step involves changing the way the data is presented
or stored. This can be done by creating new features, aggregating data, and
reducing the dimensionality of the data.
 Feature selection: This step involves selecting the most important features for the
analysis. This can be done using statistical methods or machine learning
algorithms.
 Data sampling: This step involves selecting a subset of the data for analysis. This
can be done to reduce the size of the data or to improve the performance of the
analysis.

Data preprocessing is an essential step in machine learning and data science. By


following the right steps, you can improve the quality and usability of your data, which
will lead to better insights and decisions.

12) What is Data cleaning? Explain various types of data cleaning methods.
https://monkeylearn.com/blog/data-preprocessing/#transformation

Data cleaning
Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most
important step of preprocessing because it will ensure that your data is ready to go
for your downstream needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are
a number of possible cleaners you’ll need to run your data through.

Missing data

There are a number of ways to correct for missing data, but the two most common
are:

 Ignore the tuples: A tuple is an ordered list or sequence of numbers or


entities. If multiple values are missing within tuples, you may simply discard
the tuples with that missing information. This is only recommended for large
data sets, when a few ignored tuples won’t harm further analysis.
 Manually fill in missing data: This can be tedious, but is definitely
necessary when working with smaller data sets.

Noisy data

Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group
together.

 Binning: Binning sorts data of a wide data set into smaller groups of more
similar data. It’s often used when analyzing demographics. Income, for
example, could be grouped: $35,000-$50,000, $50,000-$75,000, etc.
 Regression: Regression is used to decide which variables will actually apply
to your analysis. Regression analysis is used to smooth large amounts of
data. This will help you get a handle on your data, so you’re not
overburdened with unnecessary data.
 Clustering: Clustering algorithms are used to properly group data, so that it
can be analyzed with like data. They’re generally used in unsupervised
learning, when not a lot is known about the relationships within your data.

If you’re working with text data, for example, some things you should consider
when cleaning your data are:

 Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
 Translate all text into the language you’ll be working in
 Remove HTML tags
 Remove boilerplate email text
 Remove unnecessary blank text between words
 Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand.
At this point you can also perform data wrangling or data enrichment to add new
data sets and run them through quality assessment and cleaning again before
adding them to your original data.
Data cleaning is the process of identifying and correcting errors or inconsistencies in
data. It is an important step in data preparation and analysis, as it can help to ensure
that the data is accurate and reliable.

There are many different types of data cleaning methods, each with its own advantages
and disadvantages. Some of the most common methods include:

 Data validation: This involves checking the data for errors such as typos, missing
values, and out-of-range values.
 Data standardization: This involves converting the data into a common format so
that it can be easily manipulated and analyzed.
 Data cleansing: This involves removing noise from the data, such as outliers and
duplicate records.
 Data integration: This involves combining data from different sources into a
single dataset.
 Data transformation: This involves changing the way the data is presented or
stored.

The best data cleaning method for a particular dataset will depend on the specific needs
of the project. However, some general guidelines can be followed:

 Start by identifying the most common errors in the data.


 Use a combination of automated and manual methods to clean the data.
 Test the data to make sure that the errors have been corrected.
 Document the data cleaning process so that it can be repeated if necessary.

Data cleaning is an essential step in data preparation and analysis. By following the
right methods, you can ensure that your data is accurate and reliable, which will lead to
better insights and decisions.
Here are some additional details about each of the data cleaning methods mentioned
above:

 Data validation: This is the most basic type of data cleaning, and it involves
checking the data for errors such as typos, missing values, and out-of-range
values. This can be done manually or using a data validation tool.
 Data standardization: This involves converting the data into a common format so
that it can be easily manipulated and analyzed. For example, if you have data in
different date formats, you can standardize it by converting it all to the same
format.
 Data cleansing: This involves removing noise from the data, such as outliers and
duplicate records. Outliers are data points that are significantly different from the
rest of the data. Duplicate records are data points that appear multiple times in
the dataset.
 Data integration: This involves combining data from different sources into a
single dataset. This can be a complex task, as it requires ensuring that the data
is compatible and that the different sources are aligned.
 Data transformation: This involves changing the way the data is presented or
stored. For example, you might want to convert the data into a different format or
create a summary of the data.

Data cleaning can be a time-consuming and challenging task, but it is essential for
ensuring the quality and usability of data. By following the right methods, you can
improve the accuracy and reliability of your data, which will lead to better insights and
decisions.

13) What is Data Transformation?


https://monkeylearn.com/blog/data-preprocessing/#transformation
Data transformation

With data cleaning, we’ve already begun to modify our data, but data
transformation will begin the process of turning the data into the proper format(s)
you’ll need for analysis and other downstream processes.

This generally happens in one or more of the below:

1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation
 Aggregation: Data aggregation combines all of your data together in a
uniform format.
 Normalization: Normalization scales your data into a regularized range so
that you can compare it more accurately. For example, if you’re comparing
employee loss or gain within a number of companies (some with just a dozen
employees and some with 200+), you’ll have to scale them within a specified
range, like -1.0 to 1.0 or 0.0 to 1.0.
 Feature selection: Feature selection is the process of deciding which
variables (features, characteristics, categories, etc.) are most important to
your analysis. These features will be used to train ML models. It’s important
to remember, that the more features you choose to use, the longer the
training process and, sometimes, the less accurate your results, because
some feature characteristics may overlap or be less present in the data.
 Discreditization: Discreditiization pools data into smaller intervals. It’s
somewhat similar to binning, but usually happens after data has been
cleaned. For example, when calculating average daily exercise, rather than
using the exact minutes and seconds, you could join together data to fall into
0-15 minutes, 15-30, etc.
 Concept hierarchy generation: Concept hierarchy generation can add a
hierarchy within and between your features that wasn’t present in the
original data. If your analysis contains wolves and coyotes, for example, you
could add the hierarchy for their genus: canis.

14) What is Data? Explain Categorical and Numerical data in detail.

Data is a collection of facts or figures that can be used to answer questions or make decisions. It
can be qualitative or quantitative.

 Qualitative data is non-numerical data that describes characteristics or attributes of a


population. It is often collected through surveys, interviews, or observations. Examples of
qualitative data include:

o Gender
o Marital status
o Race
o Occupation
o Education level
o Product preference
o Customer satisfaction
 Quantitative data is numerical data that can be counted or measured. It is often collected
through experiments or surveys. Examples of quantitative data include:

o Height
o Weight
o Income
o Sales
o Temperature
o Number of employees
o Number of customers

Categorical data and numerical data are two main types of data.

 Categorical data is data that can be classified into categories. It is often used to describe
characteristics of a population. Examples of categorical data include:
o Gender (male, female)
o Marital status (single, married, divorced)
o Occupation (doctor, lawyer, teacher)
o Product preference (Apple, Samsung, Google)
o Customer satisfaction (very satisfied, satisfied, dissatisfied, very dissatisfied)

Categorical data can be further divided into two types: * Nominal data is categorical data that
does not have a natural order. Examples of nominal data include gender, marital status, and
product preference. * Ordinal data is categorical data that has a natural order. Examples of
ordinal data include customer satisfaction and educational level.

 Numerical data is data that can be measured or counted. It is often used to describe
quantities or variables. Examples of numerical data include:
o Height
o Weight
o Income
o Sales
o Temperature
o Number of employees
o Number of customers

Numerical data can be further divided into two types: * Discrete data is numerical data that can
be counted. Examples of discrete data include the number of employees or the number of
customers.

15) Define features.


Features refer to the individual measurable properties or characteristics of data that
are used as input variables for modeling, analysis, or prediction. Features are
essentially the variables or attributes that describe and represent the data points in a
dataset

16) Justify the importance of Data reduction in data pre-processing.


Data reduction is a vital step in data pre-processing with several significant
justifications:
1. **Improved Efficiency**: High-dimensional datasets with a large number of
features can be computationally expensive and time-consuming to process. Data
reduction techniques, such as dimensionality reduction, can significantly speed up
the training and evaluation of machine learning models. Fewer features mean less
computation, which is especially important when working with large datasets.
2. **Overfitting Prevention**: High-dimensional data can lead to overfitting,
where a model performs well on the training data but poorly on new, unseen data.
Overfitting occurs when a model learns noise or spurious relationships in the data.
By reducing the number of features, you reduce the model's complexity and its
tendency to overfit.
3. **Simpler Models**: Models with fewer features are simpler and easier to
interpret. Simplicity is desirable because it allows you to gain insights into the
relationships between variables and the model's decision-making process. Simpler
models are also more likely to generalize well to new data.
4. **Noise Reduction**: High-dimensional data often contains noisy or irrelevant
features that do not contribute to the prediction or analysis. Data reduction
techniques help eliminate or reduce the impact of such noise, improving the signal-
to-noise ratio in the data.
5. **Visualization**: Reducing the dimensionality of data makes it easier to
visualize. Two- or three-dimensional data can be visualized on a scatterplot or a
3D plot, making it more accessible for exploratory data analysis and
communication of results.
6. **Collinearity Management**: In some datasets, features are highly correlated
(collinear), which can lead to multicollinearity issues in regression models. Data
reduction can help manage collinearity by selecting a subset of uncorrelated or less
correlated features.
7. **Memory Usage**: Large datasets can consume significant amounts of
memory. Reducing the data's dimensionality reduces memory usage, making it
more manageable for storage and processing.
8. **Feature Engineering Focus**: Data reduction allows data scientists and
analysts to focus their efforts on feature engineering, where they create or select
the most informative and relevant features. This can lead to better model
performance and insights.
9. **Preserving Key Information**: Effective data reduction techniques aim to
retain as much important information as possible while discarding less valuable
information. Dimensionality reduction methods like Principal Component Analysis
(PCA) accomplish this by capturing the most significant patterns in the data.
10. **Comprehensibility**: In some cases, reducing data dimensionality leads to a
more interpretable dataset. This can be especially important in fields like
healthcare, finance, and social sciences, where interpretability is a key
requirement.
In summary, data reduction is essential in data pre-processing because it addresses
the challenges associated with high-dimensional data, leading to more efficient,
interpretable, and accurate data analysis and modeling. By selecting or creating a
subset of relevant features, data reduction simplifies the data while preserving its
essential characteristics, ultimately enhancing the quality of the analysis or
prediction tasks.
17. how to handle missing values

Handling missing values in a dataset is a crucial step in data preprocessing, as missing


data can lead to biased or inaccurate results when analyzing or building machine
learning models. Here are several common strategies to handle missing values in a
dataset:

1. Identify Missing Values:


 Begin by identifying and understanding the extent of missing data in your
dataset. You can use summary statistics or data visualization techniques to
visualize missing values.
2. Remove Rows with Missing Values:
 One straightforward approach is to remove rows (samples) with missing
values. This is a simple solution but may lead to a loss of valuable
information, especially if the missing values are not randomly distributed.
3. Remove Columns with Many Missing Values:
 If a column (feature) has a high percentage of missing values and it's not
crucial for your analysis or modeling, you can consider removing it from
the dataset.
4. Impute Missing Values:
 Imputation involves filling in missing values with estimated or calculated
values. Here are some common imputation techniques:
a. Mean/Median/Mode Imputation: Fill missing values in a numeric column
with the mean, median, or mode of that column.
b. Constant Value Imputation: Replace missing values with a predefined
constant value (e.g., 0 or -1).
c. Interpolation: For time series data or data with a natural order, you can use
interpolation methods to estimate missing values based on neighboring values.
d. Regression Imputation: Use regression models to predict missing values
based on other features in the dataset.
5. Use Advanced Imputation Techniques:
 Some advanced imputation methods, such as K-nearest neighbors
imputation, matrix factorization techniques (e.g., SVD), or machine learning
models (e.g., decision trees or random forests), can be more effective in
capturing complex relationships in the data.
6. Create Indicator Variables:
 For categorical data, create an indicator variable (dummy variable) to
represent missing values in a separate category. This allows the model to
learn whether missingness is informative.
7. Consider Domain Knowledge:
 Depending on the nature of your data and the problem you're solving,
consider consulting domain experts to determine the best approach for
handling missing values.
8. Multiple Imputations:
 In some cases, it may be appropriate to perform multiple imputations to
account for the uncertainty introduced by imputing missing values. This
can be especially useful in cases where missingness is not completely
random.
9. Regularization Techniques:
 In machine learning, some algorithms, such as Lasso or Ridge regression,
handle missing values as a part of their regularization process. These
models can be useful when you have a large number of features with
missing data.
10. Record Missingness Information:
 Create an additional binary column to indicate whether a value in a
particular feature was missing or not. This can provide additional
information to the model.
18. Differentiate between Underfitting and Overfitting.

Aspect Underfitting Overfitting

Model is too simple to capture underlying


Definition Model is too complex and captures noise.
patterns.

High training error, high bias, poor Low training error, high variance, poor
Characteristics
generalization. generalization.

Very simple model, insufficient training, Overly complex model, excessive training, too few
Causes
over-regularization. examples.

Use a more complex model, increase training Use a simpler model, increase training data, apply
Solutions
data, reduce regularization. regularization, use cross-validation.

training Error High training error, indicating poor fit to Low training error, suggesting a good fit to training
training data. data.

Typically high test error; poor performance High test error; poor performance on new data due
Test Error
on new data. to memorization of noise.

Bias-Variance Underfitting is often associated with high Overfitting is characterized by low bias and high

Tradeoff bias and low variance. variance.

You might also like