Professional Documents
Culture Documents
https://intellipaat.com/blog/tutorial/data-analytics-tutorial/data-analytics-lifecycle/
#:~:text=What%20is%20Data%20Analytics%20Life,working%20on%20data
%20analytics%20initiatives.
3) What is Machine Learning? Explain Supervised and Unsupervised learning
in brief.
Arthur Saumel: He described it as: “The field of study that gives computers the
ability to learn without being explicitly programmed."
Machine learning focuses on the development of computer programs that can
access data and use it to learn for themselves.
Tom Mitchell provides a more modern definition: "A computer program is said to
learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with
experience E."
Supervised Learning
● Supervised Learning is a concept of Machine Learning,wherein the
preconceived data sets are labeled.
● The data sets consists of a set of the training examples. So, it comprises of
the input(data sets or vector) and the output(supervisory signal), which goes
hand in hand.
● Supervised Learning is the concept of Machine Learning,which is handled
by predetermined set of training examples( features) to predict the future
output values for the same patterned approach.
Unsupervised Learning
● Data set is not labelled or every data item has the same label
● we find a structure or pattern for the problem set without labeling the data
set.
● we don’t know what exactly the data is labeled for, we tend to find the
pattern and not the right or wrong answer about the problem.
● There is no potential signal for error and correctness of the solution of the
problem.
● We derive this structure by clustering the data based on relationships among
the variables in the data.
COVARIANCE CORRELATION
The two basic types of regression are simple linear regression and multiple linear
regression, although there are non-linear regression methods for more complicated
data and analysis. Simple linear regression uses one independent variable to
explain or predict the outcome of the dependent variable Y, while multiple linear
regression uses two or more independent variables to predict the outcome (while
holding all others constant).
Simple Linear
Aspect Regression Multiple Linear Regression
Number of
Independent 1 More than 1 (usually 2 or more)
Variables
Modeling the
relationship between a Modeling the relationship between
single independent multiple independent variables (
Purpose variable (
X1,X2,…,Xp) and the dependent
X) and the dependent variable (Y).
variable (Y).
Simpler to implement
More complex due to handling
and understand due to a
Complexity multiple independent variables,
single independent
requiring multivariate techniques.
variable.
Represented as a hyperplane in a
Typically represented as
multidimensional space, making
Visualization a straight line on a
visualization challenging beyond
scatterplot.
three dimensions.
7) What is Autoregression?
Autoregression (AR) is a statistical modeling technique used in time series
analysis. It models the relationship between a variable and its own past values,
essentially using its own historical data to make predictions about future values. In
autoregressive models, the dependent variable (the variable being predicted) is
regressed on its own lagged (past) values.
The key idea behind autoregression is that past values of a time series can provide
valuable information for predicting future values, especially in cases where there is
some inherent temporal structure or autocorrelation in the data. Autoregressive
models are particularly useful when dealing with time-dependent data, such as
stock prices, weather patterns, or economic indicators.
The autoregressive model of order p, often denoted as AR(p), can be
mathematically represented as follows:
Yt=c+ϕ1Yt−1+ϕ2Yt−2+…+ϕpYt−p+ϵt
The autoregressive order p represents how many lagged values are included in the
model. For example, in an AR(1) model, only the immediately preceding value
(Yt−1) is used to predict Yt. In an AR(2) model, both Yt−1 and Yt−2 are used, and
so on.
To use autoregressive models effectively, one needs to determine the appropriate
order (p) of the model, which often involves statistical methods and diagnostics to
analyze the autocorrelation structure in the data. Autoregressive models can be
extended and combined with other time series models, such as moving average
(MA) models and autoregressive integrated moving average (ARIMA) models, to
handle more complex time series data with trends and seasonality.
You can start a time series analysis by building a design matrix ( Xt��), also called a feature or
regressor matrix, which can include current and past observations of predictors ordered by time (t).
Then, apply ordinary least squares (OLS) to the multiple linear regression (MLR) model
yt=Xtβ+et��=���+��
to get an estimate of a linear relationship of the response (yt)(��) to the design
matrix. β� represents the linear parameter estimates to be computed and (et)(��) represents the
innovation terms. This form can be generalized to multivariate cases vector (yt)(��), including
exogenous inputs such as control signals, and correlation effects in the residues. For more difficult
cases, the linear relationship can be replaced by a nonlinear one yt=f(Xt,et)��=�(��,��),
where f()�() is a nonlinear function such as a neural network.
Typically, time series modeling involves picking a model structure (such as an ARMA form or a
transfer function) and incorporating known attributes of the system such as non-stationarities. Some
examples are:
9) Describe ANOVA.
ANOVA stands for Analysis of Variance. It is a statistical test that is used to compare the
means of two or more groups.
ANOVA is a parametric test, which means that it makes certain assumptions about the
data, such as that the data is normally distributed and that the variances of the groups are
equal.
There are different types of ANOVA, depending on the number of independent variables
and the number of levels of each independent variable.
The most common type of ANOVA is the one-way ANOVA, which is used to compare
the means of two or more groups when there is only one independent variable.
Another common type of ANOVA is the two-way ANOVA, which is used to compare
the means of two or more groups when there are two independent variables.
ANOVA can be used to test a variety of hypotheses, such as whether the means of two or
more groups are equal, or whether the mean of one group is different from the mean of
another group.
The results of an ANOVA test are typically reported in a table that shows the F-statistic,
the p-value, and the degrees of freedom.
The F-statistic is a measure of the variance between the groups, and the p-value is a
measure of the significance of the difference between the groups.
If the p-value is less than the significance level, then the null hypothesis is rejected,
which means that there is a significant difference between the means of the groups.
If any of these assumptions are violated, then the results of the ANOVA test may not be valid.
ANOVA is a powerful statistical tool that can be used to compare the means of two or more
groups. However, it is important to understand the assumptions of ANOVA and to make sure
that the data meets these assumptions before conducting the test.
Unit 2: Data Pre-processing
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may
it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a
regular, uniform design.
Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy.
However, unstructured data, in the form of text and images must first be cleaned
and formatted before analysis.
Data preprocessing is the process of cleaning, transforming, and formatting data so that
it can be used for analysis and modeling. It is an important step in machine learning and
data science, as it can help to improve the accuracy and performance of models.
The specific steps involved in data preprocessing will vary depending on the type of
data and the goals of the analysis. However, the steps listed above are some of the
most common and important steps involved in data preprocessing.
Take a good look at your data and get an idea of its overall quality, relevance to
your project, and consistency. There are a number of data anomalies and inherent
problems to look out for in almost any data set, for example:
Mismatched data types: When you collect data from many different
sources, it may come to you in different formats. While the ultimate goal of
this entire process is to reformat your data for machines, you still need to
begin with similarly formatted data. For example, if part of your analysis
involves family income from multiple countries, you’ll have to convert each
income amount into a single currency.
Mixed data values: Perhaps different sources use different descriptors for
features – for example, man or male. These value descriptors should all be
made uniform.
Data outliers: Outliers can have a huge impact on data analysis results. For
example if you're averaging test scores for a class, and one student didn’t
respond to any of the questions, their 0% could greatly skew the results.
Missing data: Take a look for missing data fields, blank spaces in text, or
unanswered survey questions. This could be due to human error or
incomplete data. To take care of missing data, you’ll have to perform data
cleaning.
2. Data cleaning
Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most
important step of preprocessing because it will ensure that your data is ready to go
for your downstream needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are
a number of possible cleaners you’ll need to run your data through.
Missing data
There are a number of ways to correct for missing data, but the two most common
are:
Noisy data
Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group
together.
Binning: Binning sorts data of a wide data set into smaller groups of more
similar data. It’s often used when analyzing demographics. Income, for
example, could be grouped: $35,000-$50,000, $50,000-$75,000, etc.
Regression: Regression is used to decide which variables will actually apply
to your analysis. Regression analysis is used to smooth large amounts of
data. This will help you get a handle on your data, so you’re not
overburdened with unnecessary data.
Clustering: Clustering algorithms are used to properly group data, so that it
can be analyzed with like data. They’re generally used in unsupervised
learning, when not a lot is known about the relationships within your data.
If you’re working with text data, for example, some things you should consider
when cleaning your data are:
Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
Translate all text into the language you’ll be working in
Remove HTML tags
Remove boilerplate email text
Remove unnecessary blank text between words
Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand.
At this point you can also perform data wrangling or data enrichment to add new
data sets and run them through quality assessment and cleaning again before
adding them to your original data.
3. Data transformation
With data cleaning, we’ve already begun to modify our data, but data
transformation will begin the process of turning the data into the proper format(s)
you’ll need for analysis and other downstream processes.
1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation
Aggregation: Data aggregation combines all of your data together in a
uniform format.
Normalization: Normalization scales your data into a regularized range so
that you can compare it more accurately. For example, if you’re comparing
employee loss or gain within a number of companies (some with just a dozen
employees and some with 200+), you’ll have to scale them within a specified
range, like -1.0 to 1.0 or 0.0 to 1.0.
Feature selection: Feature selection is the process of deciding which
variables (features, characteristics, categories, etc.) are most important to
your analysis. These features will be used to train ML models. It’s important
to remember, that the more features you choose to use, the longer the
training process and, sometimes, the less accurate your results, because
some feature characteristics may overlap or be less present in the data.
Data preprocessing is the process of cleaning, transforming, and formatting data so that
it can be used for analysis and modeling. It is an important step in machine learning and
data science, as it can help to improve the accuracy and performance of models.
The following are some of the most common steps involved in data preprocessing:
The specific steps involved in data preprocessing will vary depending on the type of
data and the goals of the analysis. However, the steps listed above are some of the
most common and important steps involved in data preprocessing.
Here are some additional details about each of the data preprocessing steps mentioned
above:
Data cleaning: This is the most basic step in data preprocessing, and it involves
identifying and correcting errors or inconsistencies in the data. This can be a
time-consuming and challenging task, but it is essential for ensuring the quality of
the data.
Data standardization: This step involves converting the data into a common
format so that it can be easily manipulated and analyzed. This can be done by
converting text to numbers, converting dates to a standard format, and removing
outliers.
Data transformation: This step involves changing the way the data is presented
or stored. This can be done by creating new features, aggregating data, and
reducing the dimensionality of the data.
Feature selection: This step involves selecting the most important features for the
analysis. This can be done using statistical methods or machine learning
algorithms.
Data sampling: This step involves selecting a subset of the data for analysis. This
can be done to reduce the size of the data or to improve the performance of the
analysis.
12) What is Data cleaning? Explain various types of data cleaning methods.
https://monkeylearn.com/blog/data-preprocessing/#transformation
Data cleaning
Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most
important step of preprocessing because it will ensure that your data is ready to go
for your downstream needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are
a number of possible cleaners you’ll need to run your data through.
Missing data
There are a number of ways to correct for missing data, but the two most common
are:
Noisy data
Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group
together.
Binning: Binning sorts data of a wide data set into smaller groups of more
similar data. It’s often used when analyzing demographics. Income, for
example, could be grouped: $35,000-$50,000, $50,000-$75,000, etc.
Regression: Regression is used to decide which variables will actually apply
to your analysis. Regression analysis is used to smooth large amounts of
data. This will help you get a handle on your data, so you’re not
overburdened with unnecessary data.
Clustering: Clustering algorithms are used to properly group data, so that it
can be analyzed with like data. They’re generally used in unsupervised
learning, when not a lot is known about the relationships within your data.
If you’re working with text data, for example, some things you should consider
when cleaning your data are:
Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
Translate all text into the language you’ll be working in
Remove HTML tags
Remove boilerplate email text
Remove unnecessary blank text between words
Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand.
At this point you can also perform data wrangling or data enrichment to add new
data sets and run them through quality assessment and cleaning again before
adding them to your original data.
Data cleaning is the process of identifying and correcting errors or inconsistencies in
data. It is an important step in data preparation and analysis, as it can help to ensure
that the data is accurate and reliable.
There are many different types of data cleaning methods, each with its own advantages
and disadvantages. Some of the most common methods include:
Data validation: This involves checking the data for errors such as typos, missing
values, and out-of-range values.
Data standardization: This involves converting the data into a common format so
that it can be easily manipulated and analyzed.
Data cleansing: This involves removing noise from the data, such as outliers and
duplicate records.
Data integration: This involves combining data from different sources into a
single dataset.
Data transformation: This involves changing the way the data is presented or
stored.
The best data cleaning method for a particular dataset will depend on the specific needs
of the project. However, some general guidelines can be followed:
Data cleaning is an essential step in data preparation and analysis. By following the
right methods, you can ensure that your data is accurate and reliable, which will lead to
better insights and decisions.
Here are some additional details about each of the data cleaning methods mentioned
above:
Data validation: This is the most basic type of data cleaning, and it involves
checking the data for errors such as typos, missing values, and out-of-range
values. This can be done manually or using a data validation tool.
Data standardization: This involves converting the data into a common format so
that it can be easily manipulated and analyzed. For example, if you have data in
different date formats, you can standardize it by converting it all to the same
format.
Data cleansing: This involves removing noise from the data, such as outliers and
duplicate records. Outliers are data points that are significantly different from the
rest of the data. Duplicate records are data points that appear multiple times in
the dataset.
Data integration: This involves combining data from different sources into a
single dataset. This can be a complex task, as it requires ensuring that the data
is compatible and that the different sources are aligned.
Data transformation: This involves changing the way the data is presented or
stored. For example, you might want to convert the data into a different format or
create a summary of the data.
Data cleaning can be a time-consuming and challenging task, but it is essential for
ensuring the quality and usability of data. By following the right methods, you can
improve the accuracy and reliability of your data, which will lead to better insights and
decisions.
With data cleaning, we’ve already begun to modify our data, but data
transformation will begin the process of turning the data into the proper format(s)
you’ll need for analysis and other downstream processes.
1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation
Aggregation: Data aggregation combines all of your data together in a
uniform format.
Normalization: Normalization scales your data into a regularized range so
that you can compare it more accurately. For example, if you’re comparing
employee loss or gain within a number of companies (some with just a dozen
employees and some with 200+), you’ll have to scale them within a specified
range, like -1.0 to 1.0 or 0.0 to 1.0.
Feature selection: Feature selection is the process of deciding which
variables (features, characteristics, categories, etc.) are most important to
your analysis. These features will be used to train ML models. It’s important
to remember, that the more features you choose to use, the longer the
training process and, sometimes, the less accurate your results, because
some feature characteristics may overlap or be less present in the data.
Discreditization: Discreditiization pools data into smaller intervals. It’s
somewhat similar to binning, but usually happens after data has been
cleaned. For example, when calculating average daily exercise, rather than
using the exact minutes and seconds, you could join together data to fall into
0-15 minutes, 15-30, etc.
Concept hierarchy generation: Concept hierarchy generation can add a
hierarchy within and between your features that wasn’t present in the
original data. If your analysis contains wolves and coyotes, for example, you
could add the hierarchy for their genus: canis.
Data is a collection of facts or figures that can be used to answer questions or make decisions. It
can be qualitative or quantitative.
o Gender
o Marital status
o Race
o Occupation
o Education level
o Product preference
o Customer satisfaction
Quantitative data is numerical data that can be counted or measured. It is often collected
through experiments or surveys. Examples of quantitative data include:
o Height
o Weight
o Income
o Sales
o Temperature
o Number of employees
o Number of customers
Categorical data and numerical data are two main types of data.
Categorical data is data that can be classified into categories. It is often used to describe
characteristics of a population. Examples of categorical data include:
o Gender (male, female)
o Marital status (single, married, divorced)
o Occupation (doctor, lawyer, teacher)
o Product preference (Apple, Samsung, Google)
o Customer satisfaction (very satisfied, satisfied, dissatisfied, very dissatisfied)
Categorical data can be further divided into two types: * Nominal data is categorical data that
does not have a natural order. Examples of nominal data include gender, marital status, and
product preference. * Ordinal data is categorical data that has a natural order. Examples of
ordinal data include customer satisfaction and educational level.
Numerical data is data that can be measured or counted. It is often used to describe
quantities or variables. Examples of numerical data include:
o Height
o Weight
o Income
o Sales
o Temperature
o Number of employees
o Number of customers
Numerical data can be further divided into two types: * Discrete data is numerical data that can
be counted. Examples of discrete data include the number of employees or the number of
customers.
High training error, high bias, poor Low training error, high variance, poor
Characteristics
generalization. generalization.
Very simple model, insufficient training, Overly complex model, excessive training, too few
Causes
over-regularization. examples.
Use a more complex model, increase training Use a simpler model, increase training data, apply
Solutions
data, reduce regularization. regularization, use cross-validation.
training Error High training error, indicating poor fit to Low training error, suggesting a good fit to training
training data. data.
Typically high test error; poor performance High test error; poor performance on new data due
Test Error
on new data. to memorization of noise.
Bias-Variance Underfitting is often associated with high Overfitting is characterized by low bias and high