Text Analytics

Text Analytics
Types of Variables
1. Nominal Variable / Categorical Variable
• No intrinsic ordering
• Two or more categories
2. Ordinal Variable
• Order is significant
• Size of the difference between categories is inconsistent
• For example, you might ask patients to express the amount of pain they are feeling on a scale of 1 to 10
3. Interval Variable
• Intervals between the values of the
interval variable are equally spaced
• Zero point in an interval scale is
arbitrary
4. Ratio Variable
• Clear definition of 0.0
• Doesn’t have a negative number, unlike interval scale
• For example, Variables like height, weight and age
The temperature in an air-conditioned room is 16-degree Celsius and the temperature outside the AC room is 32-degree
Celsius. It is reasonable to say that the temperature outside is 16 degrees higher than inside the room.
But, if you said that it is twice as hot outside than inside, then you would thermodynamically incorrect.
Types of Variables
DOES IT MATTER IF MY DEPENDENT

VARIABLE IS
NORMALLY
While doing a t‐test DISTRIBUTED?
or ANOVA, we take assumption that the distribution of the sample means are normally
distributed. But in general we need not be bothered about it very much.
This is due to the “central limit theorem” that shows that even when a population is non‐normally distributed, the
distribution of the “sample means” will be normally distributed when the sample size is 30 or more
Stats Refresher
hsb2 Dataset: A data frame with 200 observations and 11 variables
The attach function allows to access

variables of a data.frame without
calling the data.frame
Table for choosing the correct statistical test
Dependent \ Independent Categorical Continues

Categorical Chi-Square Logistic Regression
Continuous Anova / T-Test Correlation
Dependent \ Categorical Continues
Tests Performed on the Database Independent

Categorical Chi-Square Logistic
Regression
Continuous Anova / T-Test Correlation
T1: IS THE MEAN OF WRITING SCORE = 50
t.test(write, mu = 50)
T2: DOES THE MEAN OF WRITE SCORE VARY

ACROSS GENDER
t.test(write ~ female)
Dependent \ Categorical Continues
T3: IS THERE A RELATIONSHIP BETWEEN THE Independent
TYPE OF Categorical Chi-Square Logistic

Regression
SCHOOL ATTENDED (SCHTYP) AND STUDENTS’ Continuous Anova / T-Test Correlation
GENDER
mytable <‐ table(female, schtyp)
(FEMALE)
mytable
chisq.test(table(female, schtyp))
T4: DOES THE MEAN OF WRITE DIFFER BETWEEN

(when you have a categorical independent variable (with more than two categories))
THE THREE PROGRAM TYPES (PROG)
summary(aov(write ~ prog))
install.packages("data.table")
library(data.table)
dt <‐ data.table(hsb2)
dt[,list(mean=mean(write),sd=sd(write)),by=prog]
T5: DOES THE MEAN OF READ EQUAL THE MEAN

OF WRITE
t.test(write, read, paired = TRUE)

For example: Analysis of performance of
employees before and after the training
program
T6: TEST WHETHER THE OBSERVED A chi‐square goodness of fit
PROPORTIONS
FROM OUR SAMPLE DIFFER SIGNIFICANTLY
FROM
table(race)
HYPOTHESIZED PROPORTIONS
chisq.test(table(race), p = c(10, 10, 10, 70)/100)
T7‐ HOW DOES WRITE SCORES VARY WITH A factorial ANOVA -Two or more categorical independent variables
GENDER AND SES LEVEL
anova(lm(write ~ female * ses, data = hsb2))
Correlation
T8‐ VALIDATE RELATIONSHIP BETWEEN READ &
WRITE
cor.test(read, write)
For % of Variability Let’s round 0.597 to be 0.6, which
when squared would be .36, multiplied by 100 would be
36%. Hence read shares about 36% of its variability with
write.
Partial correlation is a measure of the strength and direction
of a linear relationship between
two continuous variables whilst controlling for the effect of one
or more other continuous
variables
cor.test(read, write)
mm1 = lm(read~female)
res1 = mm1$residuals
mm2 = lm(write~female)
res2 = mm2$residuals
cor.test(res1,res2)
Linear regression allows us to look at the linear relationship

lm(write ~ read)
T9‐ HOW DOES WRITE (AS THE DEPENDENT

VARIABLE) VARY WITH PROG AND READ (AS
INDEPENDENT VARIABLES)
In Ancova, in addition to the categorical predictors

you also have continuous predictors as well
summary(aov(write ~ prog + read))
Artificial Intelligence
Simulation of human intelligence using
techniques. This can include
information acquisition, reasoning and
self correction.
Machine Learning
Application of AI which enables
computer to learn automatically without
being explicitly programmed to do so.
Deep Learning
Deep learning is a subset of machine
learning in artificial intelligence (AI) that
has networks capable of learning
unsupervised from data that is
unstructured or unlabeled
Supervised Learning
A learning in which we train the machine using data which is well labeled (already mapped with correct output). Then, the machine is
provided with a new set of data so that supervised learning algorithm analyses the training data and produces a correct outcome from
labeled data.
This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors
(independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process
continues until the model achieves a desired level of accuracy on the training data.
Regression, Decision Tree, Random Forest, KNN, Logistic Regression
Unsupervised Learning
Training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information
without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences
without any prior training of data.
In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in
different groups, which is widely used for segmenting customers in different groups for specific intervention.
Apriori algorithm, K‐means.

Linear Regression
It is used to estimate real values. It helps us establish relationship between independent and dependent variables by fitting
a bestline. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.
Linear Regression
Simple Linear Multiple Linear

Regression Regression
One independent Multiple (more than

variable 1) independent
variables
Decision Tree
● Used for classification problems.

● Works for both categorical and continuous
dependent variables.
● The population gets split into two or more
homogeneous sets. This is done based on most
significant attributes/ independent variables to make
as distinct groups as possible.
● To split the population into different heterogeneous
groups, it uses various techniques like Gini,
Information Gain, Chi‐square, entropy.
Random Forests
● It’s a collection of decision trees
● To classify a new object based on attributes, each tree gives a classification called votes. The forest chooses the
classification having the most votes (over all the trees in the forest)
● key concepts
1. Random sampling of training data points when building trees
2. Random subsets of features considered when splitting nodes
Steps
1. Create a bootstrapped dataset
2. Create a decision tree using bootstrapped dataset but only use a random subset of variables at each step
3. Repeat step 1 and create a new bootstrapped dataset
4. Evaluate the no. of votes for each tree
5. Measure the accuracy of random forest by the proportion of Out-of-Bag samples that it was able to classify correctly
Naive Bayes
Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any
other feature.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
P(No/Rainy) = P(Rainy/No)*P(No)/P(Rainy)
P(Rainy/No) = 3/5
P(No) = 5/14
P(Rainy) = 5/14
KNN
• It can be used for both
classification and regression
problems. K nearest neighbors is a
simple algorithm that stores all
available cases and classifies new
cases by a majority vote of its k
neighbors. The case being assigned
to the class is most common
amongst its K nearest neighbors
measured by a distance function.
• These distance functions can be

Euclidean, Manhattan, Minkowski
and Hamming distance. If K = 1,
then the case is simply assigned to
the class of its nearest neighbor.
K-means
It is a type of unsupervised algorithm used for clustering problem. Its procedure follows a simple and easy way to classify
a given data set through a certain number of clusters. Data points inside a cluster are homogeneous and heterogeneous
to peer groups.
THANK YOU!

Text Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Analytics

Uploaded by

Copyright:

Available Formats

Text Analytics

DOES IT MATTER IF MY DEPENDENT

hsb2 Dataset: A data frame with 200 observations and 11 variables

The attach function allows to access

Table for choosing the correct statistical test

Dependent \ Independent Categorical Continues

Tests Performed on the Database Independent

T1: IS THE MEAN OF WRITING SCORE = 50

T2: DOES THE MEAN OF WRITE SCORE VARY

TYPE OF Categorical Chi-Square Logistic

T4: DOES THE MEAN OF WRITE DIFFER BETWEEN

T5: DOES THE MEAN OF READ EQUAL THE MEAN

t.test(write, read, paired = TRUE)

anova(lm(write ~ female * ses, data = hsb2))

Linear regression allows us to look at the linear relationship

T9‐ HOW DOES WRITE (AS THE DEPENDENT

In Ancova, in addition to the categorical predictors

Regression, Decision Tree, Random Forest, KNN, Logistic Regression

Apriori algorithm, K‐means.

Simple Linear Multiple Linear

One independent Multiple (more than

● Used for classification problems.

• These distance functions can be

You might also like