Naive Bayes Vs SVM Vs Ruled

Tweet Classification
Mentor: Romil Bansal

Guided by : Dr. Vasudeva Varma
GROUP NO-37
Manish
Jindal(201305578)
Trilok
Sharma(201206527)
Problem Statement : To automatically

classify Tweets from Twitter into various genres
based on predefined Wikipedia Categories.
Motivation:
o Twitter is a major social networking service with
over 200 million tweets made every day.

o Twitter provides a list of Trending Topics in real
time, but it is often hard to understand what these
trending topics are about.
o It is important and necessary to classify these
topics into general categories with high accuracy
for better information retrieval.
Data
Dataset :
o Input Data is the static / real-time data consisting of
the user tweets.

o Training dataset :
Fetched from twitter with twitter4j api.
Final Deliverable:
o It will return list of all categories to which the input
tweet belongs.
o It will also give the accuracy of the algorithm used
for classifying tweets.
Categories
We took following categories into consideration
for classifying twitter data.
1)Business
2)Education
5)Law
9)Politics
6)Lifestyle 10)Sports
3)Entertainment 7)Nature11)Technology
4)Health
8)Places
Concepts used for better performance

Outliers removal
To remove low frequent and high frequent
words using Bag of words approach .
Stop words removal
To remove most common words, such as the, is,
at, which, and on.
Keyword Stemming
To reduce inflected words to their stem, base or
root form using porter stemming
Cleaning crawl data
Other Concepts used ..

Spelling Correction
To correct spellings using Edit distance method.
Named Entity Recognition:
For ranking result category and finding most
appropriate.
Synonym form
If feature(word) of test query not found as one of
dimension in feature space than replace that word

with its synonym. Done using WordNet.
Tweets Classification Algorithms

We used 3 algorithms for classification
1) Nave based
2) SVM basedSupervised
3) Rule based
Crawl
tweeter
data
Tweets
Cleaning,
Stop word
removal
Remove Outliers
Create feature
vector for each
tweet
Create Index file
Of feature
vectors
Training
Output
Category
Rank result
category
Apply Named
Entity
Recognition
Create
/ Apply
Model files
Testing
Extract Features
(Unique
wordlist)
Test
Query/
Tweet
Tweets
Cleaning,
Stop word
removal
Edit Distance,
WordNet
(synonyms)
Create feature
vector for test
tweet
Create Index
file
Of feature
vector
Main idea for Supervised Learning

Assumption: training set consists of
instances of different classes described cj as

conjunctions of attributes values
Task: Classify a new instance d based on a
tuple of attribute values into one of the

classes cj C
Key idea: assign the most probable class
using supervised learning algorithm.
Method 1 : Bayes Classifier

Bayes rule states :
Likelihood
Prior
Normalization
Constant
We used WEKA library for machine learning in
Bayes Classifier for our project.
Method 2 : SVM Classifier

(Support Vector Machine)
Given a new point x, we can score its
projection onto the hyperplane normal:

I.e., compute score: wTx + b = iyixiTx +
b
Decide class based on whether < or > 0
Can set confidence threshold t.
Score > t:
yes
Score < -t:
no
0
-1
1
11
Multi-class SVM
12
Multi-class SVM Approaches

1-against-all
Each of the SVMs separates a single class from

all remaining classes (Cortes and Vapnik, 1995)
1-against-1
Pair-wise. k(k-1)/2, k Y SVMs are trained. Each

SVM separates a pair of classes (Fridman, 1996)
13
Advantages of SVM
High dimensional input space
Few irrelevant features (dense concept)
Sparse document vectors (sparse instances)
Text categorization problems are linearly
separable
For linearly inseparable data we can use kernels
to map data into high dimensional space, so that

it becomes linearly separable with hyperplane.
Method 3 : Rule Based

We defined set of rule to classify a tweet
based on term frequency.

a. Extract the features of a tweet.
b. Count term frequency of each feature , the
feature having maximum term frequency from
all categories mentioned above will be our
first classification.
c. As it cannot be right all time so now we
maintain count of categories in which tweet
falls , category which is near to tweet will be
our next classification.
ExampleTweet=sachin is a good player, who eats apple
and banana which is good for health.

Feature- sachin,player,eats,apple,health,banana
Stop word-is,a,good,he,was,for,which,and,who
Classification- Feature-category termfrequency
sachin-sports 2000
player-sports 900
eating-health
500
apple-technology
1000
health-health
800
banana-health
700
Max term-frequency - sachin

So our category is - sports
2nd approximation Max feature is laying in health i.e. 3 times ,
So our second approximation would be health.
If both of these are in same category then we
have only one category.
i.e. if here max feature would be laying in sports
than we have only one result that is sports.
Cross-validation (Accuracy)
Steps for k-fold cross-validation :
Step 1: split data into k subsets of

equal size
Step 2 : use each subset in turn for
testing, the remainder for training
Often the subsets are stratified before the
cross-validation is performed
The error estimates are averaged to
yield an overall error estimate
Accuracy Results ( 10 folds)

Accuracy of Algorithm in %
Categories\ Algo.
SVM
Nave
Rule
Business
86.6
81.44
98.30
Education
85.71
76.07
81.8
Entertainment
86.8
79.1
87.49
Health
95.67
84.62
90.93
Law
81.17
73.38
75.25
Lifestyle
93.27
89.71
82.42
Nature
87.0
78.64
84.24
Places
81.01
75.35
80.73
Politics
81.91
81.88
76.31
Sports
87.11
83.57
81.87
Technology
83.64
82.44
77.05
Unique features
Worked on latest Crawled tweeter data using tweeter4j
api
Worked on Eleven different Categories.
Applied three different method of supervised learning
to classify in different categories.
Achieved high performance speed with accuracy in
range of 85 to 95 %
Done Tweets Cleaning , Stemming , Stop Word removal.
Used Edit distance for spelling correction.
Used Named entity recognition for ranking.
Used WordNet for Query Expansion and Synonyms
finding.
Validated using CrossFold (10 fold) validation.
Snapshot
Result
Accuracy
Thank You!

Naive Bayes Vs SVM Vs Ruled

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naive Bayes Vs SVM Vs Ruled

Uploaded by

Copyright:

Available Formats

Tweet Classification

Mentor: Romil Bansal

Problem Statement : To automatically

over 200 million tweets made every day.

the user tweets.

Concepts used for better performance

at, which, and on.

Other Concepts used ..

dimension in feature space than replace that word

Tweets Classification Algorithms

Main idea for Supervised Learning

instances of different classes described cj as

tuple of attribute values into one of the

using supervised learning algorithm.

Method 1 : Bayes Classifier

We used WEKA library for machine learning in

Bayes Classifier for our project.

Method 2 : SVM Classifier

projection onto the hyperplane normal:

Decide class based on whether < or > 0

Can set confidence threshold t.

Multi-class SVM Approaches

Each of the SVMs separates a single class from

Pair-wise. k(k-1)/2, k Y SVMs are trained. Each

to map data into high dimensional space, so that

Method 3 : Rule Based

based on term frequency.

ExampleTweet=sachin is a good player, who eats apple

and banana which is good for health.

Max term-frequency - sachin

Step 1: split data into k subsets of

Accuracy Results ( 10 folds)

You might also like