You are on page 1of 22

Text Analytics

Types of Variables
1. Nominal Variable / Categorical Variable
• No intrinsic ordering
• Two or more categories

2. Ordinal Variable
• Order is significant
• Size of the difference between categories is inconsistent
• For example, you might ask patients to express the amount of pain they are feeling on a scale of 1 to 10

3. Interval Variable
• Intervals between the values of the
interval variable are equally spaced
• Zero point in an interval scale is
arbitrary

4. Ratio Variable
• Clear definition of 0.0
• Doesn’t have a negative number, unlike interval scale
• For example, Variables like height, weight and age

The temperature in an air-conditioned room is 16-degree Celsius and the temperature outside the AC room is 32-degree
Celsius. It is reasonable to say that the temperature outside is 16 degrees higher than inside the room.
But, if you said that it is twice as hot outside than inside, then you would thermodynamically incorrect.
Types of Variables

DOES IT MATTER IF MY DEPENDENT


VARIABLE IS
NORMALLY
While doing a t‐test DISTRIBUTED?
or ANOVA, we take assumption that the distribution of the sample means are normally
distributed. But in general we need not be bothered about it very much.
This is due to the “central limit theorem” that shows that even when a population is non‐normally distributed, the
distribution of the “sample means” will be normally distributed when the sample size is 30 or more
Stats Refresher

hsb2 Dataset: A data frame with 200 observations and 11 variables

The attach function allows to access


variables of a data.frame without
calling the data.frame

Table for choosing the correct statistical test

Dependent \ Independent Categorical Continues


Categorical Chi-Square Logistic Regression
Continuous Anova / T-Test Correlation
Dependent \ Categorical Continues

Tests Performed on the Database Independent


Categorical Chi-Square Logistic
Regression
Continuous Anova / T-Test Correlation

T1: IS THE MEAN OF WRITING SCORE = 50

t.test(write, mu = 50)

T2: DOES THE MEAN OF WRITE SCORE VARY


ACROSS GENDER

t.test(write ~ female)
Dependent \ Categorical Continues
T3: IS THERE A RELATIONSHIP BETWEEN THE Independent

TYPE OF Categorical Chi-Square Logistic


Regression
SCHOOL ATTENDED (SCHTYP) AND STUDENTS’ Continuous Anova / T-Test Correlation
GENDER
mytable <‐ table(female, schtyp)
(FEMALE)
mytable
chisq.test(table(female, schtyp))

T4: DOES THE MEAN OF WRITE DIFFER BETWEEN


(when you have a categorical independent variable (with more than two categories))
THE THREE PROGRAM TYPES (PROG)
summary(aov(write ~ prog))
install.packages("data.table")
library(data.table)
dt <‐ data.table(hsb2)
dt[,list(mean=mean(write),sd=sd(write)),by=prog]

T5: DOES THE MEAN OF READ EQUAL THE MEAN


OF WRITE

t.test(write, read, paired = TRUE)


For example: Analysis of performance of
employees before and after the training
program
T6: TEST WHETHER THE OBSERVED A chi‐square goodness of fit
PROPORTIONS
FROM OUR SAMPLE DIFFER SIGNIFICANTLY
FROM
table(race)
HYPOTHESIZED PROPORTIONS
chisq.test(table(race), p = c(10, 10, 10, 70)/100)

T7‐ HOW DOES WRITE SCORES VARY WITH A factorial ANOVA -Two or more categorical independent variables
GENDER AND SES LEVEL

anova(lm(write ~ female * ses, data = hsb2))

Correlation
T8‐ VALIDATE RELATIONSHIP BETWEEN READ &
WRITE
cor.test(read, write)
For % of Variability Let’s round 0.597 to be 0.6, which
when squared would be .36, multiplied by 100 would be
36%. Hence read shares about 36% of its variability with
write.
Partial correlation is a measure of the strength and direction
of a linear relationship between
two continuous variables whilst controlling for the effect of one
or more other continuous
variables

cor.test(read, write)
mm1 = lm(read~female)
res1 = mm1$residuals
mm2 = lm(write~female)
res2 = mm2$residuals
cor.test(res1,res2)

Linear regression allows us to look at the linear relationship


lm(write ~ read)

T9‐ HOW DOES WRITE (AS THE DEPENDENT


VARIABLE) VARY WITH PROG AND READ (AS
INDEPENDENT VARIABLES)

In Ancova, in addition to the categorical predictors


you also have continuous predictors as well
summary(aov(write ~ prog + read))
Artificial Intelligence
Simulation of human intelligence using
techniques. This can include
information acquisition, reasoning and
self correction.

Machine Learning
Application of AI which enables
computer to learn automatically without
being explicitly programmed to do so.

Deep Learning
Deep learning is a subset of machine
learning in artificial intelligence (AI) that
has networks capable of learning
unsupervised from data that is
unstructured or unlabeled
Supervised Learning
A learning in which we train the machine using data which is well labeled (already mapped with correct output). Then, the machine is
provided with a new set of data so that supervised learning algorithm analyses the training data and produces a correct outcome from
labeled data.

This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors
(independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process
continues until the model achieves a desired level of accuracy on the training data.

Regression, Decision Tree, Random Forest, KNN, Logistic Regression

Unsupervised Learning
Training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information
without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences
without any prior training of data.

In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in
different groups, which is widely used for segmenting customers in different groups for specific intervention.

Apriori algorithm, K‐means.


Linear Regression
It is used to estimate real values. It helps us establish relationship between independent and dependent variables by fitting
a bestline. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

Linear Regression

Simple Linear Multiple Linear


Regression Regression

One independent Multiple (more than


variable 1) independent
variables
Decision Tree

● Used for classification problems.


● Works for both categorical and continuous
dependent variables.
● The population gets split into two or more
homogeneous sets. This is done based on most
significant attributes/ independent variables to make
as distinct groups as possible.
● To split the population into different heterogeneous
groups, it uses various techniques like Gini,
Information Gain, Chi‐square, entropy.
Random Forests
● It’s a collection of decision trees
● To classify a new object based on attributes, each tree gives a classification called votes. The forest chooses the
classification having the most votes (over all the trees in the forest)
● key concepts
1. Random sampling of training data points when building trees
2. Random subsets of features considered when splitting nodes

Steps
1. Create a bootstrapped dataset
2. Create a decision tree using bootstrapped dataset but only use a random subset of variables at each step
3. Repeat step 1 and create a new bootstrapped dataset
4. Evaluate the no. of votes for each tree
5. Measure the accuracy of random forest by the proportion of Out-of-Bag samples that it was able to classify correctly
Naive Bayes
Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any
other feature.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
P(No/Rainy) = P(Rainy/No)*P(No)/P(Rainy)
P(Rainy/No) = 3/5
P(No) = 5/14
P(Rainy) = 5/14
KNN
• It can be used for both
classification and regression
problems. K nearest neighbors is a
simple algorithm that stores all
available cases and classifies new
cases by a majority vote of its k
neighbors. The case being assigned
to the class is most common
amongst its K nearest neighbors
measured by a distance function.

• These distance functions can be


Euclidean, Manhattan, Minkowski
and Hamming distance. If K = 1,
then the case is simply assigned to
the class of its nearest neighbor.
K-means
It is a type of unsupervised algorithm used for clustering problem. Its procedure follows a simple and easy way to classify
a given data set through a certain number of clusters. Data points inside a cluster are homogeneous and heterogeneous
to peer groups.
THANK YOU!

You might also like