Professional Documents
Culture Documents
Types of Variables
2. Ordinal Variable
• Order is significant
• Size of the difference between categories is inconsistent
• For example, you might ask patients to express the amount of pain they are feeling on a scale of 1 to 10
3. Interval Variable
• Intervals between the values of the
interval variable are equally spaced
• Zero point in an interval scale is
arbitrary
4. Ratio Variable
• Clear definition of 0.0
• Doesn’t have a negative number, unlike interval scale
• For example, Variables like height, weight and age
The temperature in an air-conditioned room is 16-degree Celsius and the temperature outside the AC room is 32-degree
2
Celsius. It is reasonable to say that the temperature outside is 16 degrees higher than inside the room.
But, if you said that it is twice as hot outside than inside, then you would thermodynamically incorrect.
Types of Variables
hsb2 Dataset
: A data frame with 200 observations and 11 variables
t.test(write, mu = 50)
t.test(write ~ female)
5
Dependent \ Categorical Continues
T3: IS THERE A RELATIONSHIP BETWEEN Independent
T4: DOES THE MEAN OF WRITE DIFFER (when you have a categorical independent variable (with more than two categories))
BETWEEN
THE THREE PROGRAM TYPES (PROG)
summary(aov(write ~ prog))
install.packages("data.table")
library(data.table)
dt <‐ data.table(hsb2)
dt[,list(mean=mean(write),sd=sd(write)),by=prog]
T5: DOES THE MEAN OF READ EQUAL THE
MEAN
OF WRITE
t.test(write, read, paired = TRUE)
For example: Analysis of performance of
employees before and after the training
program
T6: TEST WHETHER THE OBSERVED A chi‐square goodness of fit
PROPORTIONS
FROM OUR SAMPLE DIFFER
SIGNIFICANTLY FROM
HYPOTHESIZED
table(race) PROPORTIONS
chisq.test(table(race), p = c(10, 10, 10, 70)/100)
T7‐ HOW DOES WRITE SCORES VARY WITHA factorial ANOVA -Two or more categorical independent variables
GENDER AND SES LEVEL
Correlation
T8‐ VALIDATE RELATIONSHIP BETWEEN READ
&
WRITE
cor.test(read, write)
For % of Variability Let’s round 0.597 to be 0.6, which
when squared would be .36, multiplied by 100 would be
36%. Hence read shares about 36% of its variability with
write.
Partial correlation is a measure of the strength and
direction of a linear relationship between
two continuous variables whilst controlling for the effect of one
or more other continuous
variables
cor.test(read, write)
mm1 = lm(read~female)
res1 = mm1$residuals
mm2 = lm(write~female)
res2 = mm2$residuals
cor.test(res1,res2)
Machine Learning
Application of AI which enables
computer to learn automatically without
being explicitly programmed to do so.
Deep Learning
Deep learning is a subset of machine
learning in artificial intelligence (AI) that
has networks capable of learning
unsupervised from data that is
unstructured or unlabeled
Supervised Learning
A learning in which we train the machine using data which is well labeled (already mapped with correct output). Then, the machine is
provided with a new set of data so that supervised learning algorithm analyses the training data and produces a correct outcome from
labeled data.
This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors
(independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process
continues until the model achieves a desired level of accuracy on the training data.
Unsupervised Learning
Training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information
without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences
without any prior training of data.
In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in
different groups, which is widely used for segmenting customers in different groups for specific intervention.
Linear Regression
Steps
1. Create a bootstrapped dataset
2. Create a decision tree using bootstrapped dataset but only use a random subset of variables at each step
3. Repeat step 1 and create a new bootstrapped dataset
4. Evaluate the no. of votes for each tree
5. Measure the accuracy of random forest by the proportion of Out-of-Bag samples that it was able to classify correctly
Naive Bayes
Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other
feature.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
P(No/Rainy) = P(Rainy/No)*P(No)/P(Rainy)
P(Rainy/No) = 3/5
P(No) = 5/14
P(Rainy) = 5/14
KNN
•It can be used for both classification
and regression problems. K nearest
neighbors is a simple algorithm that
stores all available cases and
classifies new cases by a majority
vote of its k neighbors. The case being
assigned to the class is most common
amongst its K nearest neighbors
measured by a distance function.