Professional Documents
Culture Documents
Regression analysis
Regression analysis focuses on finding a relationship between a dependent variable
and
one or more independent variables.
Predicts the value of a dependent variable based on the value of at least one
independent
variable.
Explains the impact of changes in an independent variable on the dependent
variable.
We use linear or logistic regression technique for developing accurate models for
predicting an outcome of interest.
Often, we create separate models for separate segments.
Y = f(X, β) where Y is the dependent variable X is the independent variable β is the
unknown coefficient.
Widely used in prediction and forecasting.
o Regression estimates the relationship between the target and the independent
variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most
important factor, the least important factor, and how each factor is
affecting the other factors.
Overall, the purpose of regression analysis is to provide insight into the relationship
between the independent and dependent variables, allowing researchers and analysts
to make informed decisions and predictions based on the data.
Application of Regression
Regression is a very popular technique, and it has wide applications in businesses and
industries. The regression procedure involves the predictor variable and response
variable. The major application of regression is given below.
o Environmental modeling
o Analyzing Business and marketing behavior
o Financial predictors or forecasting
o Analyzing the new trends and patterns.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all
the regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
2. Classification
https://www.youtube.com/watch?v=0FLmrC3-P1A
nhttps://www.geeksforgeeks.org/getting-started-with-classification/
Classification: It is a data analysis task, i.e. the process of finding a
model that describes and distinguishes data classes and concepts.
Classification is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs to, on the basis of a
training set of data containing observations and whose categories
membership is known.
Classification is a technique in data analytics used to categorize data into specific
groups or classes based on a set of pre-defined criteria. In classification, the goal is
to accurately predict the class of new or previously unseen data points based on a
model trained on existing data.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
There are three main types of Naive Bayes that are used in practice:
Multinomial
Multinomial Naive Bayes assumes that each P(xn|y) follows a multinomial
distribution. It is mainly used in document classification problems and looks at the
frequency of words, similar to the example above.
Bernoulli
Bernoulli Naive Bayes is similar to Multinomial Naive Bayes, except that the
predictors are boolean (True/False), like the “Windy” variable in the example above.
Gaussian
Gaussian Naive Bayes assumes that continuous values are sampled from a gaussian
distribution
4. Logistic Regression
Logistic Regression in Machine Learning - Javatpoint
What is Logistic regression? | IBM
Understanding Logistic Regression - GeeksforGeeks
DA Unit-1_merged (1).pdf page 34
Logistic regression is basically a supervised classification algorithm. In
a classification problem, the target variable(or output), y, can take only
discrete values for a given set of features(or inputs), X.
Contrary to popular belief, logistic regression is a regression model. The
model builds a regression model to predict the probability that a given
data entry belongs to the category numbered as “1”.
Logistic regression models the data using the sigmoid function.
Logistic regression is a statistical algorithm used for binary classification, where the
goal is to predict whether a binary outcome (e.g., yes/no, true/false) occurs based on
one or more input features. It models the relationship between the input features and
the probability of the binary outcome using a logistic function
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
where p(diabetic) is the probability of the patient being diabetic, b0, b1, b2, b3, and
b4 are the parameters to be estimated, and exp() is the exponential function.
3. Evaluating the model: We evaluate the performance of the model on the test
set by comparing the predicted probabilities to the true labels. We can use
metrics such as accuracy, precision, recall, and F1-score to evaluate the
model's performance.
4. Using the model: Once we are satisfied with the model's performance, we can
use it to predict the probability of a new patient being diabetic based on their
age, BMI, blood pressure, and glucose level. If the probability is above a
certain threshold (e.g., 0.5), we predict that the patient is diabetic, otherwise
we predict that they are non-diabetic.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
For example,
5. Classification methods
6. Analysis of variance
https://www.geeksforgeeks.org/one-way-anova/
Analysis of Variance (ANOVA): Types, Examples & Uses (formpl.us)
ANOVA is a parametric statistical technique that helps in finding out if there is a significant
difference between the mean of three or more groups. It checks the impact of various factors
by comparing groups (samples) on the basis of their respective mean.
We can use this only when:
• the samples have a normal distribution.
• the samples are selected at random and should be independent of one another.
• all groups have equal standard deviations.
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an
observed aggregate variability found inside a data set into two parts: systematic
factors and random factors. The systematic factors have a statistical influence on the
given data set, while the random factors do not. Analysts use the ANOVA test to
determine the influence that independent variables have on the dependent variable in
a regression study.
Analysis of variance (ANOVA) is a statistical method used to test whether there are significant
differences between two or more groups. ANOVA is used to determine whether the mean
differences between groups are likely due to chance or to some other factor, such as a treatment
or intervention.
The basic idea of ANOVA is to partition the total variation in the data into two components: the
variation between the groups and the variation within the groups. If the variation between the
groups is significantly greater than the variation within the groups, then there is evidence to
suggest that the groups are different.
1. One-way ANOVA: This is used when there is only one factor that is being tested. For
example, a study may compare the effectiveness of three different treatments for a
medical condition.
2. Two-way ANOVA: This is used when there are two factors that are being tested. For
example, a study may compare the effectiveness of two different treatments for a medical
condition in both men and women.
3. Mixed ANOVA: This is used when there are both within-subjects and between-subjects
factors being tested. For example, a study may compare the effects of a treatment over
time (within-subjects) and between different groups of participants (between-subjects).
1. Determine the null and alternative hypotheses: The null hypothesis is that there is no
difference between the groups, while the alternative hypothesis is that there is a
difference.
2. Calculate the mean square: This involves calculating the variation between the groups
and the variation within the groups.
3. Calculate the F-statistic: The F-statistic is the ratio of the variation between the groups to
the variation within the groups.
4. Determine the p-value: The p-value is the probability of obtaining a result as extreme as
the observed result, assuming that the null hypothesis is true.
5. Draw conclusions: If the p-value is less than the significance level (usually set at 0.05),
then we reject the null hypothesis and conclude that there is a significant difference
between the groups.
In summary, ANOVA is a useful statistical method for testing whether there are significant
differences between two or more groups. It can be used in a variety of contexts and involves
partitioning the total variation in the data into two components to determine whether the
variation between the groups is significantly greater than the variation within the groups.
• Sample from population #1: 12, 9, 12. This has a sample mean of 11.
• Sample from population #2: 7, 10, 13. This has a sample mean of 10.
• Sample from population #3: 5, 8, 11. This has a sample mean of 8.
• Sample from population #4: 5, 8, 8. This has a sample mean of 7.
• For the sample from population #1: (12 – 11)2 + (9– 11)2 +(12 – 11)2 = 6
• For the sample from population #2: (7 – 10)2 + (10– 10)2 +(13 – 10)2 =
18
• For the sample from population #3: (5 – 8)2 + (8 – 8)2 +(11 – 8)2 = 18
• For the sample from population #4: (5 – 7)2 + (8 – 7)2 +(8 – 7)2 = 6.
Degrees of Freedom
Before proceeding to the next step, we need the degrees of freedom. There are
12 data values and four samples. Thus the number of degrees of freedom of
treatment is 4 – 1 = 3. The number of degrees of freedom of error is 12 – 4 = 8.
Mean Squares
We now divide our sum of squares by the appropriate number of degrees of
freedom in order to obtain the mean squares.
The F-statistic
The final step of this is to divide the mean square for treatment by the mean
square for error. This is the F-statistic from the data. Thus for our example F =
10/6 = 5/3 = 1.667.
7. Data Analytics
What is Data Analysis? Process, Types, Methods and Techniques (simplilearn.com)
What is Data Analysis? - GeeksforGeeks
Data Analysis Examples (careerkarma.com)
Data analysis is the systematic process of acquiring data, evaluating it, and drawing
conclusions through visual tools like charts and graphs. It’s largely used in business,
manufacturing, and technological industries to help in their daily operations. Research
firms, universities, and laboratories also apply data analytics and statistical techniques
in their academic and scientific endeavors. Data analysis is important because of the
valuable insights that it provides through various data gathering techniques and
examination. This helps organizations improve their business performance and
provides an effective analysis of what should be their next move. Advanced analysis
can predict patterns and define phenomena that are crucial in creating business
strategies and making informed decisions.
• Business Processes
• Technology
• Healthcare
• Engineering
• Academics
• The main aim of descriptive analysis is to shed light on what happened over
the respective period being analyzed. For instance, how many sales of certain
products were realized in the previous week/month, did they increase or
decrease, etc. However, this type of analysis ends here and does not further
elaborate on the root cause of what has happened, as that is done through
diagnostic analysis, explained in the next section.
• Diagnostic Analysis
• As already indicated, a diagnostic analysis is interested in finding out the root
cause that impacted the happening of a particular cause, for instance, an
increase/decrease in sales. This can either be a specific season of time when
the increase or decrease happened, the latest marketing campaign of the
company, or any other reason.
• Predictive Analysis
•
• After the descriptive and diagnostic analysis has taken place, data is oriented
into predictive analysis, through which data analysts try to predict what will
happen in the near future or how a process will develop. This analysis occurs
through a combination of statistics and data mining that ends up with the
creation of a visual representation in order to make it understandable and
useful.
• Prescriptive Analysis
• Finally, a prescriptive analysis gives suggestions. Taking on board the
findings of predictive analysis, it suggests a particular course of action to be
taken and likewise assesses the potential implications that would come with
it.
5. Statistical Analysis
Statistical Analysis is a statistical approach or technique for analyzing data
sets in order to summarize their important and main characteristics generally
by using some visual aids. This approach can be used to gather knowledge
about the following aspects of data:
1. Main characteristics or features of the data.
2. The variables and their relationships.
3. Finding out the important variables that can be used in our problem.
zzTools fek dena
8. Probability Distribution
Probability Distribution - GeeksforGeeks
GRE Data Analysis | Distribution of Data, Random Variables, and Probability
Distributions - GeeksforGeeks
5 Probability distribution you should know as a data scientist | by Harsh Maheshwari
| Towards Data Science
Types Of Distribution In Statistics | Probability Distribution Explained | Statistics |
Simplilearn - YouTube
Probability Distribution - Definition, Types and Formulas (vedantu.com)
In the field of Statistics, Probability Distribution plays a major role in giving out
the possibility of every outcome pertaining to a random experiment or event. It
gives forth the probabilities of various possible occurrences. One is already
aware that Probability refers to the measure of the uncertainties found in
different phenomenons.
Types of Probability Distribution:
There are two types of Probability Distribution which are used for distinct purposes
and various types of data generation processes.
These are just a few examples of the many different types of probability distributions
that are used in data analysis. By understanding the characteristics and parameters of
these distributions, analysts can make more informed decisions and predictions based
on the data.
Randomization test:
Example: Suppose you want to test whether there is a significant difference in weight
between male and female students in a school. You randomly sample 50 male and 50
female students and record their weights. The null hypothesis is that there is no
significant difference in weight between the two groups. You can perform a
randomization test by randomly assigning the weights to two new groups, calculating
the difference in mean weight between the groups, and repeating the process many
times to generate a distribution of test statistics. If the difference in mean weight
between the original male and female groups falls outside the range of test statistics
generated by the randomization, you can reject the null hypothesis and conclude that
there is a significant difference in weight between male and female students.
Permutation test:
Example: Suppose you want to test whether there is a significant difference in IQ scores
between left-handed and right-handed individuals. You randomly sample 50 left-
handed and 50 right-handed individuals and record their IQ scores. The null hypothesis
is that there is no significant difference in IQ scores between the two groups. You can
perform a permutation test by permuting the labels of the IQ scores, recalculating the
difference in mean IQ between the groups, and repeating the process many times to
generate a distribution of test statistics. If the difference in mean IQ between the
original left-handed and right-handed groups falls outside the range of test statistics
generated by the permutations, you can reject the null hypothesis and conclude that
there is a significant difference in IQ scores between left-handed and right-handed
individuals.
10. Summary modern analytics tool
What are Analytics Tools? | Anodot
Modern data analytic tools | Data Analytics - Infinity Lectures
Modern Digital Analytics Tools - Ambition Data
Data Analytics is an important aspect of many organizations nowadays. Real-
time data analytics is essential for the success of a major organization and
helps drive decision making. There are many tools that are used for deriving
useful insights from the given data. Some are programming based and others
are non-programming based. Some of the most popular tools are:
(1) Python- is a powerful high-level programming language that is used for
general purpose programming. Python supports both structured and
functional programming methods. It’s an extensive collection of
libraries make it very useful in data analysis.
(2) Power BI Microsoft’s Power BI is the most widely used Data Analysis tool.
Power BI has been in the market since the very beginning of the data
revolution. While many Data Analysis Tools faded out, Microsoft has
ensured Power BI kept on evolving and catering to changing business
needs. Started as a straightforward Analytics tool, Power BI is now
equipped with Machine Learning
(3) Rapidminer - It’s a fully automated visual workflow design tool used for
data analytics. It’s a no-code platform and users aren’t required to code
for segregating data. Today, it is being heavily used in many industries
such as ed-tech, training, research, etc. Though it’s an open-source
platform
(4) Qlik Sense- Qlik is a part of Data Analysis Tools that is helping organizations
harness the power of data since the early ’90s with its end-to-end data
analytics tools.
(5) R -It is one of the leading programming languages for performing
complex statistical computations and graphics. It is a free and open-
source language that can be run on various UNIX platforms, Windows
and MacOS. It also has a command line interface which is easy to use.
(6) Excel -Microsoft Excel :
It is an important spreadsheet application that can be useful for
recording expenses, charting data and performing easy manipulation
and lookup and or generating pivot tables to provide the desired
summarized reports of large datasets that contain significant data
findings. It is written in C#, C++ and . . It is relatively useful for
performing somewhat complex analyses of data when compared to
other tools such as R or python. It is a common tool among financial
analysts and sales managers to solve complex business problems.
(7) KNIME- Knime, the Konstanz Information Miner is a free and open-
source data analytics software. It is also used as a reporting and
integration platform. It involves the integration of various components
for Machine Learning and data mining through the modular data-pipe
lining. It is written in Java and developed by KNIME.com AG
(8) Tableau -Tableau Public :
Tableau Public is free software developed by the public company
“Tableau Software” that allows users to connect to any spreadsheet or
file and create interactive data visualizations. It can also be used to
create maps, dashboards along with real-time updation for easy
presentation on the web.
(9) SAS - SAS was a programming language developed by the SAS Institute
for performed advanced analytics, multivariate analyses, business
intelligence, data management and predictive analytics.
It is proprietary software written in C and its software suite contains
more than 200 components. Its programming language is considered
to be high level t
(10) Apache
(a) Spark- APACHE Spark is another framework that is used to process
data and perform numerous tasks on a large scale. It is also used
to process data via multiple computers with the help of distributing
tools. It is widely used among data analysts as it offers easy-to-use
APIs that provide easy data pulling methods and it is capable of
handling multi-petabytes of data as well. Recently
(b) Hadoop- t’s a Java-based open-source platform that is being used to
store and process big data. It is built on a cluster system that allows
the system to process data efficiently and let the data run parallel.
It can process both structured and unstructured data from one
server to multiple computers. Hadoop also offers cross-
platform support for its users. Today, it is the best big data
analytic tool and is popularly used by many tech giants such as
Amazon, Microsoft, IBM, etc.
(11) OpenRefine
(12) Cassandra APACHE Cassandra is an open-source NoSQL
distributed database that is used to fetch large amounts of data. It’s one
of the most popular tools for data analytics and has been praised by
many tech companies due to its high scalability and availability without
compromising speed and performance.
(13) MongoDB- C ame in limelight in 2010, is a free, open-source
platform and a document-oriented (NoSQL) database that is used to
store a high volume of data. It uses collections and documents for
storage and its document consists of key-value pairs which are
considered a basic unit of Mongo DB. It is so popular among
developers due to its availability for multi-programming languages such
as Python, Jscript, and Ruby.
(14) Qubole- t’s an open-source big data tool that helps in fetching
data in a value of chain using ad-hoc analysis in machine learning.
Qubole is a data lake platform that offers end-to-end service with
reduced time and effort which are required in moving data pipelines. It
is capable of configuring multi-cloud services
(15) SAP Analytics Cloud
Since SAP has penetrated blue-chip companies for enterprise resource
planning (ERP), SAP Analytics Cloud (SAC) has become one of the natural Data
Analysis Tools for gaining insights into data.
Google Analytics is one of the most effective Data Analysing Tools to
analyze website traffic and user behavior. Unlike other Data Analysis
Tools that require data cleaning before finding insights, Google Analytics
can be used for streaming analytics without the need for data
engineers to create data pipelines. A simple JavaScript code pulls the
data from the website to analyze information specific to business
requirements.
Statistical inference is a core concept in data analytics that involves using statistical
methods to draw conclusions about a population based on a sample of data. Here
are some key statistical concepts related to statistical inference in data analytics:
"The average number of bikes a Dutch person owns is between 3.5 and 6."
• Business Analysis
• Artificial Intelligence
• Financial Analysis
• Fraud Detection
• Machine Learning
• Pharmaceutical Sector
• Share market.