You are on page 1of 5

ANOVA

The one-way analysis of variance (ANOVA) is used to determine whether there are
any statistically significant differences between the means of two or more
independent (unrelated) groups (although you tend to only see it used when there
are a minimum of three, rather than two groups). For example, you could use a one-
way ANOVA to understand whether exam performance differed based on test
anxiety levels amongst students, dividing students into three independent groups
(e.g., low, medium and high-stressed students). Also, it is important to realize that
the one-way ANOVA is an omnibus test statistic and cannot tell you which specific
groups were statistically significantly different from each other; it only tells you that
at least two groups were different. Since you may have three, four, five or more
groups in your study design, determining which of these groups differ from each
other is important.

Dependent variable
Your dependent variable should be measured at the interval or ratio level (i.e.,
they are continuous). Examples of variables that meet this criterion include
revision time (measured in hours), intelligence (measured using IQ score), exam
performance (measured from 0 to 100), weight (measured in kg), and so forth.

Independent variable
Your independent variable should consist of two or more
categorical, independent groups. Typically, a one-way ANOVA is used when you
have three or more categorical, independent groups, but it can be used for just two
groups (but an independent-samples t-test is more commonly used for two groups).
Example independent variables that meet this criterion include ethnicity (e.g., 3
groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4
groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon,
doctor, nurse, dentist, therapist), and so forth.

WHAT DEPENDENT INDEPENDENT EXAMPLE


VARIABLE VARIABLE
(Outcome)
ANOVA is a Continuous Categorical and To understand whether
statistical tool continuous exam performance
applied on (multiple (continuous dependent)
unrelated variables) differed based on test
groups to find anxiety levels amongst
out whether students, dividing
they have a students into three
common independent groups (e.g.,
mean. low, medium and high-
stressed students)
aov (categorical
independent).
REGRESSION Continuous Continuous A paint company uses
(linear) is a (all are one of the derivatives of
statistical continuous) crude solvent &
method to monomers as its raw
establish the material, we can run a
relationship regression analysis
between sets of between the price of that
variables in raw material (continuous
order to make dependent) and the price
predictions of of Brent crude prices
the dependent (continuous
variable with independent).
the help of
independent
variables

lm

LOGISTIC Categorical Continuous/ Only Binomial


REGRESSION Categorical (it can (low or high)
be any of these)
glm

https://www.wallstreetmojo.com/regression-vs-anova/
So if you're trying to predict the duration of breastfeeding in weeks (continuous
dependent) using mother's age (continuous independent) as a predictor variable,
then you would use a regression model.

If you are trying to predict the duration of breastfeeding in weeks (continuous


dependent) using mother's marital status (single, married, divorced, widowed)
(categorical independent), the you would use an ANOVA model.

If you are trying to predict the duration of breastfeeding in weeks using prenatal
smoking status (smoked during pregnancy, did not smoke during pregnancy), then
you would use a two-sample t-test. If you added delivery type (vaginal/c-section) to
prenatal smoking status, then the two binary predictor variables would be analyzed
using an ANOVA model.

If what we want to predict is continuous (some percentage or number), then check if


the independent one is categorical or continuous.
Categorical -> Low/med/high ANOVA
Continuous  Numbers  Regression

If what we want to predict is categorical (yes/no ,low/med/high), then use logistic


regression.

Decision tree
Without bootstrap.
Bagging and Random Forest has bootstrap
File - Hitters
Codes

Description Code
# Loading Data d<-read.csv("D:/IIM K/Quarter
5/PA NEW/2. Linear
model/Advertising
(1).csv",header=T)

attach(d)
# Scatter Plots par(mfrow=c(1,3))
# For having three scatter plots in a single
row
## Split the plotting panel into a 1 x 3 grid
# Plotting Sales vs TV plot(TV,Sales,col="red",lwd=2)

# correlation coefficient cor(TV,Sales)

#variance var(x,y)

#covariance cov(x,y)

# Fitting the Simple Linear Regression mod_1=lm(Sales~TV,data=d)


Models
# Finding summary (mean, median, summary(mod_1)
quartile etc)
#omitting missing values na.omit()
#To check dimensions dim()
#makes it categorical (likert scale) as.factor()
# Regression Tree Hitters_tree=tree(log(Salary)~Years
+Hits+Runs+Walks,data=Hitters,su
bset=train)

Rsquare
Rsquare = 90
Implies 90% of changes in y value is explained by changes in x value

Exam point of view to write a code


Cleaning data
Step # 1
Load library
Eg: ISLR, MASS,

Step # 2
Reading of data
d<-read.csv(_________, header=”T”)

Step # 3
names (d)
This is to understand the headings in the data set
Step # 4
dim(d)
This is to understand the dimension of the data

Step # 5
str(d)
This will let us know what kind of data is being used or interpreted by R.
Eg: integer
On a likert scale there are numbers. But these are categorical. R might take it as
integers. This command will let us know how R interprets this. We then use
as.factor().

Step # 6
na.omit()
To omit missing data

Step # 7
dim()
Doing this again will let us know if any rows have been changed or not. We will now
know how many data points are left.
Eg: If there 1000 data points before na.omit() and 200 after na.omit(), we will check
the data if there is any particular column that has led to this. Then delete that
column.

colSums(is.na(d))
This can help us know null values of each column.

Working on data - descriptive analytics


summary(d)
This gives summary of the data – mean, median, mode, quartile etc

Then plot all data to understand if there are interaction effects etc.

Regression
Status~age This means status in comparison with age
Status~. This mean status in comparison with everything

mod_1=lm()
summary(mod_1)
This will give us summary of the model.
From this take only the ones with low p value, less than 0.005.

Using these ones, make a new model.


Eg:
Age 0.06
Gender 0.001
BP 0.1
Sugar 0.001
Here P value of BP is 0.01 which is greater than 0.05. So we remove this as we make
a new model. To make a new model,

Mod_2=lm(status~age+gender+sugar, data=d)
summary(mod_2)

You might also like