You are on page 1of 22

Text Analytics

Types of Variables

1. Nominal Variable / Categorical Variable


• No intrinsic ordering
• Two or more categories

2. Ordinal Variable
• Order is significant
• Size of the difference between categories is inconsistent
• For example, you might ask patients to express the amount of pain they are feeling on a scale of 1 to 10

3. Interval Variable
• Intervals between the values of the
interval variable are equally spaced
• Zero point in an interval scale is
arbitrary

4. Ratio Variable
• Clear definition of 0.0
• Doesn’t have a negative number, unlike interval scale
• For example, Variables like height, weight and age

The temperature in an air-conditioned room is 16-degree Celsius and the temperature outside the AC room is 32-degree
2
Celsius. It is reasonable to say that the temperature outside is 16 degrees higher than inside the room.
But, if you said that it is twice as hot outside than inside, then you would thermodynamically incorrect.
Types of Variables

DOES IT MATTER IF MY DEPENDENT


VARIABLE IS
NORMALLY DISTRIBUTED?
While doing a t‐test or ANOVA, we take assumption that the distribution of the sample means are normally
distributed. But in general we need not be bothered about it very much.
This is due to the “central limit theorem” that shows that even when a population is non‐normally distributed,
the distribution of the “sample means” will be normally distributed when the sample
size is 30 or more 3
Stats Refresher

hsb2 Dataset
: A data frame with 200 observations and 11 variables

The attach function allows to access


variables of a data.frame without
calling the data.frame

Table for choosing the correct statistical test

Dependent \ Independent Categorical Continues


Categorical Chi-Square Logistic Regression
Continuous Anova / T-Test Correlation
4
Dependent \ Categorical Continues
Independent
Tests Performed on the Database Categorical Chi-Square Logistic
Regression

Continuous Anova / T-Test Correlation

: IS THE MEAN OF WRITING SCORE = 50

t.test(write, mu = 50)

T2: DOES THE MEAN OF WRITE SCORE VARY


ACROSS GENDER

t.test(write ~ female)

5
Dependent \ Categorical Continues
T3: IS THERE A RELATIONSHIP BETWEEN Independent

THE TYPE OF Categorical Chi-Square Logistic


Regression
SCHOOL ATTENDED (SCHTYP) AND Continuous Anova / T-Test Correlation
STUDENTS’ GENDER
mytable <‐ table(female, schtyp)
(FEMALE)
mytable
chisq.test(table(female, schtyp))

T4: DOES THE MEAN OF WRITE DIFFER (when you have a categorical independent variable (with more than two categories))
BETWEEN
THE THREE PROGRAM TYPES (PROG)
summary(aov(write ~ prog))
install.packages("data.table")
library(data.table)
dt <‐ data.table(hsb2)
dt[,list(mean=mean(write),sd=sd(write)),by=prog]
T5: DOES THE MEAN OF READ EQUAL THE
MEAN
OF WRITE
t.test(write, read, paired = TRUE)
For example: Analysis of performance of
employees before and after the training
program
T6: TEST WHETHER THE OBSERVED A chi‐square goodness of fit
PROPORTIONS
FROM OUR SAMPLE DIFFER
SIGNIFICANTLY FROM
HYPOTHESIZED
table(race) PROPORTIONS
chisq.test(table(race), p = c(10, 10, 10, 70)/100)

T7‐ HOW DOES WRITE SCORES VARY WITHA factorial ANOVA -Two or more categorical independent variables
GENDER AND SES LEVEL

anova(lm(write ~ female * ses, data = hsb2))

Correlation
T8‐ VALIDATE RELATIONSHIP BETWEEN READ
&
WRITE
cor.test(read, write)
For % of Variability Let’s round 0.597 to be 0.6, which
when squared would be .36, multiplied by 100 would be
36%. Hence read shares about 36% of its variability with
write.
Partial correlation is a measure of the strength and
direction of a linear relationship between
two continuous variables whilst controlling for the effect of one
or more other continuous
variables

cor.test(read, write)
mm1 = lm(read~female)
res1 = mm1$residuals
mm2 = lm(write~female)
res2 = mm2$residuals
cor.test(res1,res2)

Linear regression allows us to look at the linear relationship


lm(write ~ read)

T9‐ HOW DOES WRITE (AS THE DEPENDENT


VARIABLE) VARY WITH PROG AND READ (AS
INDEPENDENT VARIABLES)

In Ancova, in addition to the categorical predictors


you also have continuous predictors as well
summary(aov(write ~ prog + read))
Artificial Intelligence
Simulation of human intelligence using
techniques. This can include
information acquisition, reasoning and
self correction.

Machine Learning
Application of AI which enables
computer to learn automatically without
being explicitly programmed to do so.

Deep Learning
Deep learning is a subset of machine
learning in artificial intelligence (AI) that
has networks capable of learning
unsupervised from data that is
unstructured or unlabeled
Supervised Learning
A learning in which we train the machine using data which is well labeled (already mapped with correct output). Then, the machine is
provided with a new set of data so that supervised learning algorithm analyses the training data and produces a correct outcome from
labeled data.

This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors
(independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process
continues until the model achieves a desired level of accuracy on the training data.

Regression, Decision Tree, Random Forest, KNN, Logistic Regression

Unsupervised Learning
Training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information
without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences
without any prior training of data.

In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in
different groups, which is widely used for segmenting customers in different groups for specific intervention.

Apriori algorithm, K‐means.


Linear Regression
It is used to estimate real values. It helps us establish relationship between independent and dependent variables by fitting
a bestline. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

Linear Regression

Simple Linear Multiple Linear


Regression Regression

One independent Multiple (more than


variable 1) independent
variables
Decision Tree

● Used for classification problems.


● Works for both categorical and continuous
dependent variables.
● The population gets split into two or more
homogeneous sets. This is done based on most
significant attributes/ independent variables to make
as distinct groups as possible.
● To split the population into different heterogeneous
groups, it uses various techniques like Gini,
Information Gain, Chi‐square, entropy.
Random Forests
● It’s a collection of decision trees
● To classify a new object based on attributes, each tree gives a classification called votes. The forest chooses the
classification having the most votes (over all the trees in the forest)
● key concepts
1. Random sampling of training data points when building trees
2. Random subsets of features considered when splitting nodes

Steps
1. Create a bootstrapped dataset
2. Create a decision tree using bootstrapped dataset but only use a random subset of variables at each step
3. Repeat step 1 and create a new bootstrapped dataset
4. Evaluate the no. of votes for each tree
5. Measure the accuracy of random forest by the proportion of Out-of-Bag samples that it was able to classify correctly
Naive Bayes
Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other
feature.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
P(No/Rainy) = P(Rainy/No)*P(No)/P(Rainy)
P(Rainy/No) = 3/5
P(No) = 5/14
P(Rainy) = 5/14
KNN
•It can be used for both classification
and regression problems. K nearest
neighbors is a simple algorithm that
stores all available cases and
classifies new cases by a majority
vote of its k neighbors. The case being
assigned to the class is most common
amongst its K nearest neighbors
measured by a distance function.

•These distance functions can be


Euclidean, Manhattan, Minkowski and
Hamming distance. If K = 1, then the
case is simply assigned to the class of
its nearest neighbor.
K-means
It is a type of unsupervised algorithm used for clustering problem. Its procedure follows a simple and easy way to classify
a given data set through a certain number of clusters. Data points inside a cluster are homogeneous and heterogeneous
to peer groups.
THANK YOU!

You might also like