Professional Documents
Culture Documents
Manish
Jindal(201305578)
Trilok
Sharma(201206527)
Motivation:
o Twitter is a major social networking service with
Data
Dataset :
o Input Data is the static / real-time data consisting of
Final Deliverable:
o It will return list of all categories to which the input
tweet belongs.
o It will also give the accuracy of the algorithm used
for classifying tweets.
Categories
We took following categories into consideration
for classifying twitter data.
1)Business
2)Education
5)Law
9)Politics
6)Lifestyle 10)Sports
3)Entertainment 7)Nature11)Technology
4)Health
8)Places
Keyword Stemming
To reduce inflected words to their stem, base or
root form using porter stemming
Cleaning crawl data
Crawl
tweeter
data
Tweets
Cleaning,
Stop word
removal
Remove Outliers
Create feature
vector for each
tweet
Create Index file
Of feature
vectors
Training
Output
Category
Rank result
category
Apply Named
Entity
Recognition
Create
/ Apply
Model files
Testing
Extract Features
(Unique
wordlist)
Test
Query/
Tweet
Tweets
Cleaning,
Stop word
removal
Edit Distance,
WordNet
(synonyms)
Create feature
vector for test
tweet
Create Index
file
Of feature
vector
Prior
Normalization
Constant
Score > t:
yes
Score < -t:
no
0
-1
1
11
Multi-class SVM
12
13
Advantages of SVM
High dimensional input space
Few irrelevant features (dense concept)
Sparse document vectors (sparse instances)
Text categorization problems are linearly
separable
For linearly inseparable data we can use kernels
Cross-validation (Accuracy)
Steps for k-fold cross-validation :
cross-validation is performed
The error estimates are averaged to
yield an overall error estimate
SVM
Nave
Rule
Business
86.6
81.44
98.30
Education
85.71
76.07
81.8
Entertainment
86.8
79.1
87.49
Health
95.67
84.62
90.93
Law
81.17
73.38
75.25
Lifestyle
93.27
89.71
82.42
Nature
87.0
78.64
84.24
Places
81.01
75.35
80.73
Politics
81.91
81.88
76.31
Sports
87.11
83.57
81.87
Technology
83.64
82.44
77.05
Unique features
Worked on latest Crawled tweeter data using tweeter4j
api
Worked on Eleven different Categories.
Applied three different method of supervised learning
to classify in different categories.
Achieved high performance speed with accuracy in
range of 85 to 95 %
Done Tweets Cleaning , Stemming , Stop Word removal.
Used Edit distance for spelling correction.
Used Named entity recognition for ranking.
Used WordNet for Query Expansion and Synonyms
finding.
Validated using CrossFold (10 fold) validation.
Snapshot
Result
Accuracy
Thank You!