You are on page 1of 20

1.

Regression analysis
Regression analysis focuses on finding a relationship between a dependent variable
and
one or more independent variables.
Predicts the value of a dependent variable based on the value of at least one
independent
variable.
Explains the impact of changes in an independent variable on the dependent
variable.
We use linear or logistic regression technique for developing accurate models for
predicting an outcome of interest.
Often, we create separate models for separate segments.
Y = f(X, β) where Y is the dependent variable X is the independent variable β is the
unknown coefficient.
Widely used in prediction and forecasting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends, etc., for such
case we need some technology which can make predictions more accurately. So for
such case we need Regression analysis which is a statistical method and used in
machine learning and data science. Below are some other reasons for using Regression
analysis:

o Regression estimates the relationship between the target and the independent
variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most
important factor, the least important factor, and how each factor is
affecting the other factors.

Regression analysis is a statistical method used to explore the relationship between a


dependent variable and one or more independent variables. The purpose of regression
analysis is to determine how well the independent variables explain or predict the
variation in the dependent variable.

Regression analysis can be used for several purposes, including:


1. Prediction: Regression analysis can be used to predict the values of the
dependent variable based on the values of the independent variables. For
example, a marketing department might use regression analysis to predict the
sales of a product based on its price, advertising spend, and other factors.
2. Causation: Regression analysis can help establish causal relationships between
the independent and dependent variables. For example, a healthcare researcher
might use regression analysis to determine if a new drug is effective in reducing
blood pressure.
3. Control: Regression analysis can help control for the effects of other variables
when examining the relationship between the independent and dependent
variables. For example, a social scientist might use regression analysis to
examine the relationship between income and education while controlling for
other factors such as age, gender, and race.
4. Explanation: Regression analysis can help explain the relationship between the
independent and dependent variables. For example, an economist might use
regression analysis to determine how changes in interest rates affect the
demand for housing.

Overall, the purpose of regression analysis is to provide insight into the relationship
between the independent and dependent variables, allowing researchers and analysts
to make informed decisions and predictions based on the data.
Application of Regression
Regression is a very popular technique, and it has wide applications in businesses and
industries. The regression procedure involves the predictor variable and response
variable. The major application of regression is given below.

o Environmental modeling
o Analyzing Business and marketing behavior
o Financial predictors or forecasting
o Analyzing the new trends and patterns.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all
the regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:

o Linear Regression
o Logistic Regression
o Polynomial Regression

o Decision tree etc

What is Regression Analysis? Types and Applications | Analytics Steps


Regression Analysis - Formulas, Explanation, Examples and Definitions
(corporatefinanceinstitute.com)
Quora

2. Classification
https://www.youtube.com/watch?v=0FLmrC3-P1A
nhttps://www.geeksforgeeks.org/getting-started-with-classification/
Classification: It is a data analysis task, i.e. the process of finding a
model that describes and distinguishes data classes and concepts.
Classification is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs to, on the basis of a
training set of data containing observations and whose categories
membership is known.
Classification is a technique in data analytics used to categorize data into specific
groups or classes based on a set of pre-defined criteria. In classification, the goal is
to accurately predict the class of new or previously unseen data points based on a
model trained on existing data.

Classification is used in a variety of applications, such as:

1. Image recognition: Classifying images into categories such as animals, objects,


or people based on their visual features.
2. Sentiment analysis: Classifying text into positive, negative, or neutral
sentiment based on the words and phrases used.
3. Fraud detection: Classifying financial transactions as legitimate or fraudulent
based on a set of features such as transaction amount, location, and time.
4. Medical diagnosis: Classifying medical images or patient data into specific
diseases or conditions based on a set of symptoms or biomarkers.

Some common algorithms used in classification include:

1. Logistic Regression: A statistical algorithm used to predict the probability of a


binary outcome (e.g., yes or no, true or false).
2. Decision Trees: A tree-like model that classifies data by recursively partitioning
the feature space into smaller and smaller regions.
3. Support Vector Machines (SVM): A machine learning algorithm that finds the
optimal hyperplane in a high-dimensional feature space to separate data
points into different classes.
4. Random Forest: An ensemble learning method that combines multiple
decision trees to improve classification accuracy.

Overall, classification is a powerful technique in data analytics that helps us better


understand and organize complex data by automatically categorizing it into different
classes or categories.
Basic Terminology in Classification Algorithms

• Classifier: An algorithm that maps the input data to a specific category.


• Classification model: A classification model tries to draw some conclusion from
the input values given for training. It will predict the class labels/categories for the
new data.
• Feature: A feature is an individual measurable property of a phenomenon being
observed.
• Binary Classification: Classification task with two possible outcomes. Eg: Gender
classification (Male / Female)
• Multi-class classification: Classification with more than two classes. In multi-
class classification, each sample is assigned to one and only one target label. Eg:
An animal can be a cat or dog but not both at the same time.
• Learning Step (Training Phase): Construction of Classification
Model
Different Algorithms are used to build a classifier by making the model
learn using the training set available. The model has to be trained for
the prediction of accurate results.
• Classification Step: Model used to predict class labels and testing the
constructed model on test data and hence estimate the accuracy of the
classification rules.

Another categorization of machine-learning tasks arises when one considers the


desired output of a machine-learned system. In classification, inputs are divided into
two or more classes, and the learner must produce a model that assigns unseen inputs
to one (or multi-label classification) or more of these classes. This is typically tackled
in a supervised way. Spam filtering is an example of classification, where the inputs
are email (or other) messages and the classes are "spam" and "not spam".

3. Naïve based algorithm


Naive Bayes Classifiers - GeeksforGeeks
Naive Bayes Classifier in Machine Learning - Javatpoint
Naive Bayes is a probabilistic algorithm used for classification in data analysis. It is
based on Bayes' theorem, which states that the probability of a hypothesis (in this case,
a class label) given the evidence (the input features) is proportional to the probability
of the evidence given the hypothesis, multiplied by the prior probability of the
hypothesis. Naive Bayes assumes that the input features are independent of each other,
hence the name "naive".

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Therefore, the Bayes theorem of conditional probability can be


restated as:
Posterior = Likelihood * Prior / Evidence
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target variable


"Play". So using this dataset we need to decide that whether we should play or not on
a particular day according to the weather conditions. So to solve this problem, we need
to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Overall, Naive Bayes is a simple yet effective algorithm for classification in data analysis,
especially for text classification applications. It is computationally efficient and requires
relatively little training data compared to other machine learning algorithms. However,
its assumption of feature independence may not hold in all cases, which can lead to
lower accuracy in some applications.

Types of Naive Bayes

There are three main types of Naive Bayes that are used in practice:

Multinomial
Multinomial Naive Bayes assumes that each P(xn|y) follows a multinomial
distribution. It is mainly used in document classification problems and looks at the
frequency of words, similar to the example above.

Bernoulli
Bernoulli Naive Bayes is similar to Multinomial Naive Bayes, except that the
predictors are boolean (True/False), like the “Windy” variable in the example above.

Gaussian
Gaussian Naive Bayes assumes that continuous values are sampled from a gaussian
distribution

4. Logistic Regression
Logistic Regression in Machine Learning - Javatpoint
What is Logistic regression? | IBM
Understanding Logistic Regression - GeeksforGeeks
DA Unit-1_merged (1).pdf page 34
Logistic regression is basically a supervised classification algorithm. In
a classification problem, the target variable(or output), y, can take only
discrete values for a given set of features(or inputs), X.
Contrary to popular belief, logistic regression is a regression model. The
model builds a regression model to predict the probability that a given
data entry belongs to the category numbered as “1”.
Logistic regression models the data using the sigmoid function.
Logistic regression is a statistical algorithm used for binary classification, where the
goal is to predict whether a binary outcome (e.g., yes/no, true/false) occurs based on
one or more input features. It models the relationship between the input features and
the probability of the binary outcome using a logistic function
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:

Here's an example to illustrate how logistic regression works. Suppose we have a


dataset of patients with diabetes, where each patient is characterized by their age,
BMI, blood pressure, and glucose level, and is labeled as either diabetic or non-
diabetic. We want to build a model to predict whether a new patient with similar
characteristics is diabetic or not.
1. Preparing the data: We split the dataset into a training set and a test set, and
normalize the input features to ensure they have similar scales.
2. Training the model: We use the training set to estimate the parameters of the
logistic function, which relates the input features to the probability of the
patient being diabetic. The logistic function takes the form:

p(diabetic) = 1 / (1 + exp(-(b0 + b1 * age + b2 * BMI + b3 * blood pressure + b4 *


glucose)))

where p(diabetic) is the probability of the patient being diabetic, b0, b1, b2, b3, and
b4 are the parameters to be estimated, and exp() is the exponential function.

3. Evaluating the model: We evaluate the performance of the model on the test
set by comparing the predicted probabilities to the true labels. We can use
metrics such as accuracy, precision, recall, and F1-score to evaluate the
model's performance.
4. Using the model: Once we are satisfied with the model's performance, we can
use it to predict the probability of a new patient being diabetic based on their
age, BMI, blood pressure, and glucose level. If the probability is above a
certain threshold (e.g., 0.5), we predict that the patient is diabetic, otherwise
we predict that they are non-diabetic.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

For example,

• To predict whether an email is spam (1) or (0)


• Whether the tumor is malignant (1) or not (0)

5. Classification methods

6. Analysis of variance

https://www.geeksforgeeks.org/one-way-anova/
Analysis of Variance (ANOVA): Types, Examples & Uses (formpl.us)
ANOVA is a parametric statistical technique that helps in finding out if there is a significant
difference between the mean of three or more groups. It checks the impact of various factors
by comparing groups (samples) on the basis of their respective mean.
We can use this only when:
• the samples have a normal distribution.
• the samples are selected at random and should be independent of one another.
• all groups have equal standard deviations.
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an
observed aggregate variability found inside a data set into two parts: systematic
factors and random factors. The systematic factors have a statistical influence on the
given data set, while the random factors do not. Analysts use the ANOVA test to
determine the influence that independent variables have on the dependent variable in
a regression study.

Analysis of variance (ANOVA) is a statistical method used to test whether there are significant
differences between two or more groups. ANOVA is used to determine whether the mean
differences between groups are likely due to chance or to some other factor, such as a treatment
or intervention.

The basic idea of ANOVA is to partition the total variation in the data into two components: the
variation between the groups and the variation within the groups. If the variation between the
groups is significantly greater than the variation within the groups, then there is evidence to
suggest that the groups are different.

ANOVA can be used in several different contexts, including:

1. One-way ANOVA: This is used when there is only one factor that is being tested. For
example, a study may compare the effectiveness of three different treatments for a
medical condition.
2. Two-way ANOVA: This is used when there are two factors that are being tested. For
example, a study may compare the effectiveness of two different treatments for a medical
condition in both men and women.
3. Mixed ANOVA: This is used when there are both within-subjects and between-subjects
factors being tested. For example, a study may compare the effects of a treatment over
time (within-subjects) and between different groups of participants (between-subjects).

The basic steps in conducting an ANOVA analysis are:

1. Determine the null and alternative hypotheses: The null hypothesis is that there is no
difference between the groups, while the alternative hypothesis is that there is a
difference.
2. Calculate the mean square: This involves calculating the variation between the groups
and the variation within the groups.
3. Calculate the F-statistic: The F-statistic is the ratio of the variation between the groups to
the variation within the groups.
4. Determine the p-value: The p-value is the probability of obtaining a result as extreme as
the observed result, assuming that the null hypothesis is true.
5. Draw conclusions: If the p-value is less than the significance level (usually set at 0.05),
then we reject the null hypothesis and conclude that there is a significant difference
between the groups.

In summary, ANOVA is a useful statistical method for testing whether there are significant
differences between two or more groups. It can be used in a variety of contexts and involves
partitioning the total variation in the data into two components to determine whether the
variation between the groups is significantly greater than the variation within the groups.

Data and Sample Means


Suppose we have four independent populations that satisfy the conditions for
single factor ANOVA. We wish to test the null hypothesis H0: μ1 = μ2 = μ3 = μ4.
For purposes of this example, we will use a sample of size three from each of
the populations being studied. The data from our samples is:

• Sample from population #1: 12, 9, 12. This has a sample mean of 11.
• Sample from population #2: 7, 10, 13. This has a sample mean of 10.
• Sample from population #3: 5, 8, 11. This has a sample mean of 8.
• Sample from population #4: 5, 8, 8. This has a sample mean of 7.

The mean of all of the data is 9.

Sum of Squares of Error


We now calculate the sum of the squared deviations from each sample mean.
This is called the sum of squares of error.

• For the sample from population #1: (12 – 11)2 + (9– 11)2 +(12 – 11)2 = 6
• For the sample from population #2: (7 – 10)2 + (10– 10)2 +(13 – 10)2 =
18
• For the sample from population #3: (5 – 8)2 + (8 – 8)2 +(11 – 8)2 = 18
• For the sample from population #4: (5 – 7)2 + (8 – 7)2 +(8 – 7)2 = 6.

We then add all of these sum of squared deviations and obtain 6 + 18 + 18 + 6


= 48.
Sum of Squares of Treatment
Now we calculate the sum of squares of treatment. Here we look at the
squared deviations of each sample mean from the overall mean, and multiply
this number by one less than the number of populations:

3[(11 – 9)2 + (10 – 9)2 +(8 – 9)2 + (7 – 9)2] = 3[4 + 1 + 1 + 4] = 30.

Degrees of Freedom
Before proceeding to the next step, we need the degrees of freedom. There are
12 data values and four samples. Thus the number of degrees of freedom of
treatment is 4 – 1 = 3. The number of degrees of freedom of error is 12 – 4 = 8.

Mean Squares
We now divide our sum of squares by the appropriate number of degrees of
freedom in order to obtain the mean squares.

• The mean square for treatment is 30 / 3 = 10.


• The mean square for error is 48 / 8 = 6.

The F-statistic
The final step of this is to divide the mean square for treatment by the mean
square for error. This is the F-statistic from the data. Thus for our example F =
10/6 = 5/3 = 1.667.

Tables of values or software can be used to determine how likely it is to obtain


a value of the F-statistic as extreme as this value by chance alone.

7. Data Analytics
What is Data Analysis? Process, Types, Methods and Techniques (simplilearn.com)
What is Data Analysis? - GeeksforGeeks
Data Analysis Examples (careerkarma.com)
Data analysis is the systematic process of acquiring data, evaluating it, and drawing
conclusions through visual tools like charts and graphs. It’s largely used in business,
manufacturing, and technological industries to help in their daily operations. Research
firms, universities, and laboratories also apply data analytics and statistical techniques
in their academic and scientific endeavors. Data analysis is important because of the
valuable insights that it provides through various data gathering techniques and
examination. This helps organizations improve their business performance and
provides an effective analysis of what should be their next move. Advanced analysis
can predict patterns and define phenomena that are crucial in creating business
strategies and making informed decisions.

Where Is Data Analysis Used?

• Business Processes
• Technology
• Healthcare
• Engineering
• Academics

At its very core, data analytics is an intersection of information technology,


statistics, and business. Further on, it is a multiple staged process that breaks down
into phases such as:

• Data grouping. This phase encompasses the categorization of data that


intends to be collected, according to parameters such as age, sex, and
income.
• Data collection This phase is conducted via different sources for data
collection, including computers, cameras, companies/organizations’
personnel, and more.
• Data organization. This is done by organizing the data into spreadsheets or
other forms of software .
• Data cleaning and preparation. This process intends to ensure that no error or
duplication has remained and paves the way for analysts to begin with the
actual data analysis.
• Specify Data Requirements
• Collect Data
• Clean and Process the Data
• Analyse the Data
• Interpretation
• Report
There are four types of data analysis: descriptive, diagnostic, predictive, and
prescriptive. The conduct of each helps businesses to make smarter and safer
decisions. Let’s go through them in more detail. Data analysis has the ability to
transform raw available data into meaningful insights for your business and your
decision-making. While there are several different ways of collecting and
interpreting this data, most data-analysis processes follow the same six general
steps.
Descriptive Analysis

• The main aim of descriptive analysis is to shed light on what happened over
the respective period being analyzed. For instance, how many sales of certain
products were realized in the previous week/month, did they increase or
decrease, etc. However, this type of analysis ends here and does not further
elaborate on the root cause of what has happened, as that is done through
diagnostic analysis, explained in the next section.
• Diagnostic Analysis
• As already indicated, a diagnostic analysis is interested in finding out the root
cause that impacted the happening of a particular cause, for instance, an
increase/decrease in sales. This can either be a specific season of time when
the increase or decrease happened, the latest marketing campaign of the
company, or any other reason.
• Predictive Analysis

• After the descriptive and diagnostic analysis has taken place, data is oriented
into predictive analysis, through which data analysts try to predict what will
happen in the near future or how a process will develop. This analysis occurs
through a combination of statistics and data mining that ends up with the
creation of a visual representation in order to make it understandable and
useful.
• Prescriptive Analysis
• Finally, a prescriptive analysis gives suggestions. Taking on board the
findings of predictive analysis, it suggests a particular course of action to be
taken and likewise assesses the potential implications that would come with
it.

5. Statistical Analysis
Statistical Analysis is a statistical approach or technique for analyzing data
sets in order to summarize their important and main characteristics generally
by using some visual aids. This approach can be used to gather knowledge
about the following aspects of data:
1. Main characteristics or features of the data.
2. The variables and their relationships.
3. Finding out the important variables that can be used in our problem.
zzTools fek dena

8. Probability Distribution
Probability Distribution - GeeksforGeeks
GRE Data Analysis | Distribution of Data, Random Variables, and Probability
Distributions - GeeksforGeeks
5 Probability distribution you should know as a data scientist | by Harsh Maheshwari
| Towards Data Science
Types Of Distribution In Statistics | Probability Distribution Explained | Statistics |
Simplilearn - YouTube
Probability Distribution - Definition, Types and Formulas (vedantu.com)
In the field of Statistics, Probability Distribution plays a major role in giving out
the possibility of every outcome pertaining to a random experiment or event. It
gives forth the probabilities of various possible occurrences. One is already
aware that Probability refers to the measure of the uncertainties found in
different phenomenons.
Types of Probability Distribution:
There are two types of Probability Distribution which are used for distinct purposes
and various types of data generation processes.

Normal Probability Distribution


In this Distribution, the set of all possible outcomes can take their values on a
continuous range. It is also known as Continuous or cumulative Probability
Distribution.
For example- Set of real Numbers, set of prime numbers, are the Normal Distribution
examples as they provide all possible outcomes of real Numbers and Prime Numbers.
Real-life scenarios such as the temperature of a day is an example of Continuous
Distribution.
Binomial / Discrete Probability Distribution
The Binomial Distribution is also termed as a Discrete Probability Function where the
set of outcomes is Discrete in nature. For example: if a dice is rolled, then all its
possible outcomes will be Discrete in nature and it gives the mass of outcome. It is
also considered a Probability mass Function

Probability distribution refers to the mathematical function that describes the


likelihood of different outcomes in a random event. There are many different types of
probability distributions, each of which has its own set of parameters and
characteristics. Here are a few examples:

1. Normal distribution: The normal distribution, also known as the Gaussian


distribution, is a probability distribution that is often used to describe real-world
phenomena that are normally distributed. The distribution is characterized by a
bell-shaped curve, with the majority of the data falling within one standard
deviation of the mean. For example, the heights of a large group of people
might be normally distributed.
2. Poisson distribution: The Poisson distribution is used to describe the probability
of a certain number of events occurring in a fixed interval of time or space. For
example, the number of customers who visit a store during a given hour might
be modeled using a Poisson distribution.
3. Binomial distribution: The binomial distribution is used to describe the
probability of a certain number of successes in a fixed number of trials, where
each trial has only two possible outcomes. For example, the probability of
flipping heads on a coin toss might be modeled using a binomial distribution.
4. Exponential distribution: The exponential distribution is used to describe the
probability of the time between events occurring in a Poisson process. For
example, the time between customer arrivals at a store might be modeled using
an exponential distribution.
5. uniform distribution is the simplest probability distribution
which is also known as rectangular distribution. This
distribution has a constant probability. The most common
example for this type of distribution could be tossing a coin or
rolling a dice.

These are just a few examples of the many different types of probability distributions
that are used in data analysis. By understanding the characteristics and parameters of
these distributions, analysts can make more informed decisions and predictions based
on the data.

9. Randomization and permutation


Permutation and Randomization Tests=1See last slide for copyright information. -
STA442/2101 Fall 2012 (toronto.edu)
Permutation and randomization tests for network analysis - ScienceDirect
Youtubenbb
How to use Permutation Tests. A walkthrough of permutation tests and… | by Michael
Berk | Towards Data Science
Randomization in Statistics and Experimental Design - Statistics How To

Randomization and permutation tests are non-parametric statistical methods used to


analyze the significance of differences between groups. Here's how each test is
performed:

Randomization test:

1. Identify the null hypothesis and the alternative hypothesis.


2. Combine the data from the two groups.
3. Randomly assign the data into two new groups, with the same sample sizes as
the original groups.
4. Calculate the test statistic (e.g., the difference in means between the two
groups) for each new group.
5. Repeat steps 3 and 4 many times (e.g., 10,000 times).
6. Compare the distribution of test statistics generated by the randomization to
the test statistic calculated from the original data.
7. If the test statistic calculated from the original data falls outside of the range of
test statistics generated by the randomization, reject the null hypothesis and
conclude that there is a significant difference between the groups.

Randomization tests, also known as permutation tests, are a non-parametric statistical


technique used in data analytics to test the significance of an observed effect or
difference between two groups. This technique is particularly useful when the
assumptions of traditional parametric tests, such as normality or equal variances, are
not met or when the sample size is small.
The basic idea behind the randomization test is to randomly assign the observations
into two groups repeatedly and compute the test statistic for each permutation. This
creates a null distribution of the test statistic under the assumption that there is no
difference between the two groups. The observed test statistic is then compared to
this null distribution to determine the probability of obtaining such a result by chance.

Example: Suppose you want to test whether there is a significant difference in weight
between male and female students in a school. You randomly sample 50 male and 50
female students and record their weights. The null hypothesis is that there is no
significant difference in weight between the two groups. You can perform a
randomization test by randomly assigning the weights to two new groups, calculating
the difference in mean weight between the groups, and repeating the process many
times to generate a distribution of test statistics. If the difference in mean weight
between the original male and female groups falls outside the range of test statistics
generated by the randomization, you can reject the null hypothesis and conclude that
there is a significant difference in weight between male and female students.

Permutation test:

1. Identify the null hypothesis and the alternative hypothesis.


2. Calculate the test statistic (e.g., the difference in means between the two
groups) for the original data.
3. Permute (shuffle) the labels of the data (e.g., randomly assign the data points
to one of the two groups).
4. Recalculate the test statistic for the permuted data.
5. Repeat steps 3 and 4 many times (e.g., 10,000 times).
6. Compare the distribution of test statistics generated by the permutations to the
test statistic calculated from the original data.
7. If the test statistic calculated from the original data falls outside of the range of
test statistics generated by the permutations, reject the null hypothesis and
conclude that there is a significant difference between the groups.

Example: Suppose you want to test whether there is a significant difference in IQ scores
between left-handed and right-handed individuals. You randomly sample 50 left-
handed and 50 right-handed individuals and record their IQ scores. The null hypothesis
is that there is no significant difference in IQ scores between the two groups. You can
perform a permutation test by permuting the labels of the IQ scores, recalculating the
difference in mean IQ between the groups, and repeating the process many times to
generate a distribution of test statistics. If the difference in mean IQ between the
original left-handed and right-handed groups falls outside the range of test statistics
generated by the permutations, you can reject the null hypothesis and conclude that
there is a significant difference in IQ scores between left-handed and right-handed
individuals.
10. Summary modern analytics tool
What are Analytics Tools? | Anodot
Modern data analytic tools | Data Analytics - Infinity Lectures
Modern Digital Analytics Tools - Ambition Data
Data Analytics is an important aspect of many organizations nowadays. Real-
time data analytics is essential for the success of a major organization and
helps drive decision making. There are many tools that are used for deriving
useful insights from the given data. Some are programming based and others
are non-programming based. Some of the most popular tools are:
(1) Python- is a powerful high-level programming language that is used for
general purpose programming. Python supports both structured and
functional programming methods. It’s an extensive collection of
libraries make it very useful in data analysis.
(2) Power BI Microsoft’s Power BI is the most widely used Data Analysis tool.
Power BI has been in the market since the very beginning of the data
revolution. While many Data Analysis Tools faded out, Microsoft has
ensured Power BI kept on evolving and catering to changing business
needs. Started as a straightforward Analytics tool, Power BI is now
equipped with Machine Learning
(3) Rapidminer - It’s a fully automated visual workflow design tool used for
data analytics. It’s a no-code platform and users aren’t required to code
for segregating data. Today, it is being heavily used in many industries
such as ed-tech, training, research, etc. Though it’s an open-source
platform
(4) Qlik Sense- Qlik is a part of Data Analysis Tools that is helping organizations
harness the power of data since the early ’90s with its end-to-end data
analytics tools.
(5) R -It is one of the leading programming languages for performing
complex statistical computations and graphics. It is a free and open-
source language that can be run on various UNIX platforms, Windows
and MacOS. It also has a command line interface which is easy to use.
(6) Excel -Microsoft Excel :
It is an important spreadsheet application that can be useful for
recording expenses, charting data and performing easy manipulation
and lookup and or generating pivot tables to provide the desired
summarized reports of large datasets that contain significant data
findings. It is written in C#, C++ and . . It is relatively useful for
performing somewhat complex analyses of data when compared to
other tools such as R or python. It is a common tool among financial
analysts and sales managers to solve complex business problems.
(7) KNIME- Knime, the Konstanz Information Miner is a free and open-
source data analytics software. It is also used as a reporting and
integration platform. It involves the integration of various components
for Machine Learning and data mining through the modular data-pipe
lining. It is written in Java and developed by KNIME.com AG
(8) Tableau -Tableau Public :
Tableau Public is free software developed by the public company
“Tableau Software” that allows users to connect to any spreadsheet or
file and create interactive data visualizations. It can also be used to
create maps, dashboards along with real-time updation for easy
presentation on the web.
(9) SAS - SAS was a programming language developed by the SAS Institute
for performed advanced analytics, multivariate analyses, business
intelligence, data management and predictive analytics.
It is proprietary software written in C and its software suite contains
more than 200 components. Its programming language is considered
to be high level t
(10) Apache
(a) Spark- APACHE Spark is another framework that is used to process
data and perform numerous tasks on a large scale. It is also used
to process data via multiple computers with the help of distributing
tools. It is widely used among data analysts as it offers easy-to-use
APIs that provide easy data pulling methods and it is capable of
handling multi-petabytes of data as well. Recently
(b) Hadoop- t’s a Java-based open-source platform that is being used to
store and process big data. It is built on a cluster system that allows
the system to process data efficiently and let the data run parallel.
It can process both structured and unstructured data from one
server to multiple computers. Hadoop also offers cross-
platform support for its users. Today, it is the best big data
analytic tool and is popularly used by many tech giants such as
Amazon, Microsoft, IBM, etc.
(11) OpenRefine
(12) Cassandra APACHE Cassandra is an open-source NoSQL
distributed database that is used to fetch large amounts of data. It’s one
of the most popular tools for data analytics and has been praised by
many tech companies due to its high scalability and availability without
compromising speed and performance.
(13) MongoDB- C ame in limelight in 2010, is a free, open-source
platform and a document-oriented (NoSQL) database that is used to
store a high volume of data. It uses collections and documents for
storage and its document consists of key-value pairs which are
considered a basic unit of Mongo DB. It is so popular among
developers due to its availability for multi-programming languages such
as Python, Jscript, and Ruby.
(14) Qubole- t’s an open-source big data tool that helps in fetching
data in a value of chain using ad-hoc analysis in machine learning.
Qubole is a data lake platform that offers end-to-end service with
reduced time and effort which are required in moving data pipelines. It
is capable of configuring multi-cloud services
(15) SAP Analytics Cloud
Since SAP has penetrated blue-chip companies for enterprise resource
planning (ERP), SAP Analytics Cloud (SAC) has become one of the natural Data
Analysis Tools for gaining insights into data.
Google Analytics is one of the most effective Data Analysing Tools to
analyze website traffic and user behavior. Unlike other Data Analysis
Tools that require data cleaning before finding insights, Google Analytics
can be used for streaming analytics without the need for data
engineers to create data pipelines. A simple JavaScript code pulls the
data from the website to analyze information specific to business
requirements.

11. How would you compose statistical concept in inference

Theory of estimation and testing of hypothesis

Statistical inference is a core concept in data analytics that involves using statistical
methods to draw conclusions about a population based on a sample of data. Here
are some key statistical concepts related to statistical inference in data analytics:

1. Population: The population is the entire group of individuals or objects that


you are interested in studying. For example, if you want to study the heights
of all humans in the world, the population would be all humans in the world.
2. Sample: A sample is a subset of the population that you have actually
collected data on. For example, if you want to study the heights of all humans
in the world, you might collect data on a sample of 1,000 people.
3. Sampling distribution: The sampling distribution is the distribution of a
statistic (such as the mean or standard deviation) for all possible samples of a
given size from a population. The sampling distribution is important because
it allows us to estimate the population parameter using the sample statistic.
4. Central Limit Theorem: The Central Limit Theorem states that as the sample
size increases, the sampling distribution of the sample mean approaches a
normal distribution regardless of the shape of the population distribution.
5. Confidence interval: A confidence interval is a range of values around a
sample statistic (such as the mean or proportion) that is likely to contain the
true population parameter with a certain level of confidence. For example, a
95% confidence interval means that we can be 95% confident that the true
population parameter falls within the range of values.
6. Hypothesis testing: Hypothesis testing involves testing a hypothesis about a
population parameter using sample data. The null hypothesis is the hypothesis
that there is no difference between the sample and the population, while the
alternative hypothesis is the hypothesis that there is a difference.
7. Type I and Type II errors: Type I error is rejecting the null hypothesis when it is
actually true. Type II error is failing to reject the null hypothesis when it is
actually false. These errors are important to consider when interpreting the
results of a hypothesis test.

Overall, understanding these statistical concepts is crucial for conducting meaningful


statistical inference in data analytics.
Using data analysis and statistics to make conclusions about
a population is called statistical inference.

tatistics from a sample are used to estimate population parameters.

The most likely value is called a point estimate.

There is always uncertainty when estimating.

The uncertainty is often expressed as confidence intervals defined by a


likely lowest and highest value for the parameter.

An example could be a confidence interval for the number of bicycles a Dutch


person owns:

"The average number of bikes a Dutch person owns is between 3.5 and 6."

Importance of Statistical Inference


Statistical Inference is significant to examine the data properly. To make
an effective solution, accurate data analysis is important to interpret the
results of the research. Inferential statistics is used in the future prediction
for varied observations in different fields. It enables us to make inferences
about the data. It also helps us to deliver a probable range of values for
the true value of something in the population. Statical inference is used in
different fields such as:

• Business Analysis
• Artificial Intelligence
• Financial Analysis
• Fraud Detection
• Machine Learning
• Pharmaceutical Sector
• Share market.

You might also like