Professional Documents
Culture Documents
1
Why social media data is essential
3
Why social media data is essential
Social Media has achieved huge reach and user base (More than 2.7 B users of
FaceBook)
4
The concept of eWOM
5
Organizations and their attitude to online discussion
6
What can we know from social media text
7
Types of Analysis
Prediction
Understanding
8
Quality Issues with Social Media Data
• Social data is dynamic in nature and their sheer size and velocity pose
significant computing (storage and analysis) challenges.
• Often contains unstructured data and metadata - not readily treated using
traditional analysis tools.
• How social media providers does the sampling and filtering of data streams is
unknown.
1
1
Social Media Data Analytics Process
1
2
Corpus Identification
• A good corpus will usually have some metadata (or annotation), giving
information about the text.
1
4
Venue Identification
• Similar human functions can have different meaning across platforms. We cannot
assume that a Facebook like has the same value or impact as an Instagram like.
• Twitter is preferred for real-time dissemination of content, while Facebook and Yelp
(similar to mouthshut.com) are preferred for non-real-time content dissemination.
• Customers actively choose and use social media in response to their specific needs
and motivations.
• Managers need to monitor their customers’ preferred social media platforms
• Selection of social media platforms depends upon the impact of those platforms on
the business (e.g., number of customers that use them) and accessibility of content
on those social media platforms.
1
5
Attractiveness of Twitter as a Venue
• One of the largest social networks. Users include head of states, company CEOs,
NGO, Celebrities, academics and lots of normal people
• In USA, majority of Twitter users are young (18-49 yrs of age), most of them earn
more than 50,000$/ year, have at least one college degree and evenly distributed
among urban, suburban and rural areas.
• Facebook is preferred medium for connecting with friends and acquiesces for self-
presentation ( Mosca & Quaranta, 2015, Grover et al., 2019).
• Twitter on the other hand acts more as a social broadcasting tool (Sundararajan et al.
2013)
• By allowing connection and information sharing with any users (i.e. non-personal
engagement), Twitter has the ability to extend one’s individual network by effective
opinion and information dissemination.
1
7
Descriptive Statistics
• User Analysis
• No of Unique Users
• User with most number of followers
• Users with maximum no of tweets
• Content Analysis
• Most liked Tweet
• Day/week/month wise distribution
• % of Tweets having URL 1
8
DATA
Transformation:
Preprocessing Lower case /Remove URL/Parse HTML
• Normalization Tokenization:
Stemming / Lemmatizations
Connect, Connected, Connecting : Connect
N-Gram:
POS Tagging
Transformation
• Parse html will detect html tags and parse out text only.
<a href. . . >Some text</a>→ Some text
• Word & Punctuation- split the text by words and keep punctuation symbols.
This example. (This), (example)(.)
• Sentence will split the text by full stop, retaining only full sentences. This example. Another
example.
→ (This example.), (Another example.)
• Tweet will split the text by pre-trained Twitter model, which keeps hashtags, emoticons and other
special symbols.
This example. :-) #simple→(This), (example), (.), (:-)), (#simple)
2
1
Data Reduction
Stemming
Lemmatization
It works by cutting off the end or the Takes into consideration the
beginning of the word, taking into account morphological analysis of the
a list of common prefixes and suffixes that words.
can be found in an inflected word. To do so, it is necessary to
This indiscriminate cutting can be have detailed dictionaries which
successful in some occasions, but not the algorithm can look through to
always, and that is why we affirm that link the form back to its lemma.
this approach presents some limitations
2
2
Porters Stemmer…
Porter’s Stemmer
• It reduces the size of the dictionary (number of words used in the corpus) two or
three-fold.
• Having the same corpus, but less input dimensions, ML will work better.
2
5
Filtering
2
6
Filtering
Document frequency keeps tokens that appear in not less than and not more than the specified
number / percentage of documents
E.g. DF = (3, 5) keeps only tokens that appear in 3 or more and 5 or less documents.
If you provide floats as parameters, it keeps only tokens that appear in the specified
percentage of documents. E.g. DF = (0.3, 0.5) keeps only tokens that appear in 30% to 50% of
documents.
Most frequent tokens keeps only the specified number of most frequent tokens.
Default is a 100 most frequent tokens
2
7
N-Gram
• They are basically a set of co-occurring words within a given window
• Unigram, Bi-Gram, Tri-gram and ……
• "The cow jumps over the moon".
Bigram
Trigram
the cow
the cow jumps
cow jumps
cow jumps over
jumps over
jumps over the
over the
over the moon
the moon 2
8
N-Gram
2
9
POS Tagging
• POS tags provide linguistic signal on how a word is being used within the scope
of a phrase, sentence, or document.
3
1
Word Cloud
3
2
Bag of Words
TF: Term Frequency, which measures how frequently a term occurs in a document.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
IDF: Inverse Document Frequency, which measures how important a term is. However it is
known that certain terms, such as "is", "of", and "that", may appear a lot of times but have
little importance.
3
4
TF-IDF
• The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03.
• Then, the inverse document frequency (i.e., idf) for cat is calculated as log(10,000,000 /
1,000) = 4.
• Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
For a particular word, IDF value remains unchanged in the entire corpus. 3
5