Professional Documents
Culture Documents
The one-way analysis of variance (ANOVA) is used to determine whether there are
any statistically significant differences between the means of two or more
independent (unrelated) groups (although you tend to only see it used when there
are a minimum of three, rather than two groups). For example, you could use a one-
way ANOVA to understand whether exam performance differed based on test
anxiety levels amongst students, dividing students into three independent groups
(e.g., low, medium and high-stressed students). Also, it is important to realize that
the one-way ANOVA is an omnibus test statistic and cannot tell you which specific
groups were statistically significantly different from each other; it only tells you that
at least two groups were different. Since you may have three, four, five or more
groups in your study design, determining which of these groups differ from each
other is important.
Dependent variable
Your dependent variable should be measured at the interval or ratio level (i.e.,
they are continuous). Examples of variables that meet this criterion include
revision time (measured in hours), intelligence (measured using IQ score), exam
performance (measured from 0 to 100), weight (measured in kg), and so forth.
Independent variable
Your independent variable should consist of two or more
categorical, independent groups. Typically, a one-way ANOVA is used when you
have three or more categorical, independent groups, but it can be used for just two
groups (but an independent-samples t-test is more commonly used for two groups).
Example independent variables that meet this criterion include ethnicity (e.g., 3
groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4
groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon,
doctor, nurse, dentist, therapist), and so forth.
lm
https://www.wallstreetmojo.com/regression-vs-anova/
So if you're trying to predict the duration of breastfeeding in weeks (continuous
dependent) using mother's age (continuous independent) as a predictor variable,
then you would use a regression model.
If you are trying to predict the duration of breastfeeding in weeks using prenatal
smoking status (smoked during pregnancy, did not smoke during pregnancy), then
you would use a two-sample t-test. If you added delivery type (vaginal/c-section) to
prenatal smoking status, then the two binary predictor variables would be analyzed
using an ANOVA model.
Decision tree
Without bootstrap.
Bagging and Random Forest has bootstrap
File - Hitters
Codes
Description Code
# Loading Data d<-read.csv("D:/IIM K/Quarter
5/PA NEW/2. Linear
model/Advertising
(1).csv",header=T)
attach(d)
# Scatter Plots par(mfrow=c(1,3))
# For having three scatter plots in a single
row
## Split the plotting panel into a 1 x 3 grid
# Plotting Sales vs TV plot(TV,Sales,col="red",lwd=2)
#variance var(x,y)
#covariance cov(x,y)
Rsquare
Rsquare = 90
Implies 90% of changes in y value is explained by changes in x value
Step # 2
Reading of data
d<-read.csv(_________, header=”T”)
Step # 3
names (d)
This is to understand the headings in the data set
Step # 4
dim(d)
This is to understand the dimension of the data
Step # 5
str(d)
This will let us know what kind of data is being used or interpreted by R.
Eg: integer
On a likert scale there are numbers. But these are categorical. R might take it as
integers. This command will let us know how R interprets this. We then use
as.factor().
Step # 6
na.omit()
To omit missing data
Step # 7
dim()
Doing this again will let us know if any rows have been changed or not. We will now
know how many data points are left.
Eg: If there 1000 data points before na.omit() and 200 after na.omit(), we will check
the data if there is any particular column that has led to this. Then delete that
column.
colSums(is.na(d))
This can help us know null values of each column.
Then plot all data to understand if there are interaction effects etc.
Regression
Status~age This means status in comparison with age
Status~. This mean status in comparison with everything
mod_1=lm()
summary(mod_1)
This will give us summary of the model.
From this take only the ones with low p value, less than 0.005.
Mod_2=lm(status~age+gender+sugar, data=d)
summary(mod_2)