1.social Media Data Analytics Process - 07092020

Varieties in Social Media
1
Why social media data is essential
 Un-willingness of Consumers to fill in questionnaires of MR
 Low cost of data collection from internet
 Advancement in Big Data Technologies
3
Why social media data is essential
 Social Media has achieved huge reach and user base (More than 2.7 B users of
FaceBook)
 Consumers are spending long hours on the net (Netizens !).
 declining trust in institutions and organizations (Allsop et al., 2007). As a result,

Customers put more value to comments of other users and friends (referrals) than ads.
 Reviews and comments of buyers are mostly unsolicited (high reliability)
4
The concept of eWOM
An informal, noncommercial digital communication (primarily positive or negative
statements) about a brand, product, service or company made by actual or former
customers (Harrison-Walker, 2001; Hennig-Thurau et al., 2004).
5
Organizations and their attitude to online discussion
6
What can we know from social media text
Text as a reflection of the Producers Text s impact on Receivers

 Insight about the individuals Context  Reviews impact customer s purchase
 Their attitude towards other objects  Platform decision
 About social groups or institution (how  Social Norm  More on experientail good compared to
old and young people view happeiness  Prior Experience search goods
differently)  Negative review has more impact
7
Types of Analysis
Prediction
Understanding
Use text as features or Use text to understand

– How does the quality of the
independent variables language use eWOM impact?
– Which movie will be popular? – Why some users become more
– How well the stock market influential?
perform? – Why some posts are shared
– A customer can be provided loan more?
or not? – Whys some rumors spread more
– A new product launch would be than others?
successful or not?
8
Quality Issues with Social Media Data
• As per classification of Mcluhan, 1964; social media can be termed as cool

medium. It indicates limited depth of information but high participation
• Analysis may misrepresent the real world.

• Not all customers access social media platforms
• Some participants are active, others prefer to be reticent.
• Presence of Nonhumans: social bots and spammers
• Professionally managed accounts of prominent individuals 9

Quality Issues with Social Media Data
• Problem of Garbage in and Garbage out
 Fake reviews are often posted by unscrupulous companies,

zealous competitors, disgruntled employees, and unhappy
consumers
 Almost 38% total reviews in social media is paid or fake.
 In an normal data warehouse, ETL systems takes care of Noise.

current social media analytics offer limited automated filtering
and scrubbing capabilities
1
0
Computational Challenges
• Social data is dynamic in nature and their sheer size and velocity pose
significant computing (storage and analysis) challenges.
• Often contains unstructured data and metadata - not readily treated using
traditional analysis tools.
• Restrictions imposed by websites on data collection
• How social media providers does the sampling and filtering of data streams is
unknown.
1
1
Social Media Data Analytics Process
1
2
Corpus Identification
• Corpus is the equivalent of “dataset” in a general machine learning task.
• A corpus represents a collection of (data) texts, used for a particular analysis.
• A good corpus will usually have some metadata (or annotation), giving
information about the text.
• Corpus identification is the process of identifying the subset of available data to

focus for an analysis.
1
3
Corpus Identification
The subset of text data is identified based on the below attributes:

• Author- Who wrote the text?
• Language
• Region
• Type of content (text, audio, video, photo)
• Venue- Contents are getting generated and shared in a variety of venues.
• Time (When did someone say something in social media?)
1
4
Venue Identification
• Similar human functions can have different meaning across platforms. We cannot
assume that a Facebook like has the same value or impact as an Instagram like.
• Twitter is preferred for real-time dissemination of content, while Facebook and Yelp
(similar to mouthshut.com) are preferred for non-real-time content dissemination.
• Customers actively choose and use social media in response to their specific needs
and motivations.
• Managers need to monitor their customers’ preferred social media platforms
• Selection of social media platforms depends upon the impact of those platforms on
the business (e.g., number of customers that use them) and accessibility of content
on those social media platforms.
1
5
Attractiveness of Twitter as a Venue
• One of the largest social networks. Users include head of states, company CEOs,
NGO, Celebrities, academics and lots of normal people
• In USA, majority of Twitter users are young (18-49 yrs of age), most of them earn
more than 50,000$/ year, have at least one college degree and evenly distributed
among urban, suburban and rural areas.
• User produce copious data.
• Due to availability of API, data extraction is easier

1
6
Attractiveness of Twitter as a Venue
• Twitter is a structured as a network, so twitter data is an excellent source of

network analysis.
• Facebook is preferred medium for connecting with friends and acquiesces for self-
presentation ( Mosca & Quaranta, 2015, Grover et al., 2019).
• Twitter on the other hand acts more as a social broadcasting tool (Sundararajan et al.
2013)
• By allowing connection and information sharing with any users (i.e. non-personal
engagement), Twitter has the ability to extend one’s individual network by effective
opinion and information dissemination.
1
7
Descriptive Statistics
• User Analysis
• No of Unique Users
• User with most number of followers
• Users with maximum no of tweets
• Content Analysis
• Most liked Tweet
• Day/week/month wise distribution
• % of Tweets having URL 1
8
DATA
 Transformation:
Preprocessing Lower case /Remove URL/Parse HTML
• Cleaning  Stop Word Removal:

• Reduction
• Normalization  Tokenization:
 Stemming / Lemmatizations
Connect, Connected, Connecting : Connect
 N-Gram:
 POS Tagging
Transformation
• Lowercase ------- Data Reduction

• Remove accents will remove all diacritics/accents in text.
naïve→ naive.
• Parse html will detect html tags and parse out text only.
<a href. . . >Some text</a>→ Some text
• Remove urls will remove urls from text.

This is a http://orange.biolab.si/ url. → This is a url. 2
0
Tokenization
Breaking text into smaller components ( words, sentences, bigram)
• Word & Punctuation- split the text by words and keep punctuation symbols.
This example. (This), (example)(.)
• By Whitespaces- This example. →(This), (example.)
• Sentence will split the text by full stop, retaining only full sentences. This example. Another
example.
→ (This example.), (Another example.)
• Tweet will split the text by pre-trained Twitter model, which keeps hashtags, emoticons and other
special symbols.
This example. :-) #simple→(This), (example), (.), (:-)), (#simple)
2
1
Data Reduction
Stemming
Lemmatization
It works by cutting off the end or the Takes into consideration the
beginning of the word, taking into account morphological analysis of the
a list of common prefixes and suffixes that words.
can be found in an inflected word. To do so, it is necessary to
 This indiscriminate cutting can be have detailed dictionaries which
successful in some occasions, but not the algorithm can look through to
always, and that is why we affirm that link the form back to its lemma.
this approach presents some limitations
2
2
Porters Stemmer…
Porter’s Stemmer
Developed in 1980 and most widely used.
It identifies 1200 suffixes and rules to handle them.
<condition> <suffix> → <new suffix>

(m>0) EED → EE means “ if the word has at least 1
vowel and 1 consonant plus EED, change the ending to
EE”
Agreed → Feed →
Comparison between Stemming and Lemmatization…
Porter’s Stemmer
Porter stems both meanness and meaning to mean,

creating a false equivalence.
On the other hand, Porter stems goose to goos and
geese to gees, when those two words should be
equivalent.
Lemmatization is expensive for large set of data.
Lemmatization cannot handle unknown word.

Advantages
• makes your data more dense.
• It reduces the size of the dictionary (number of words used in the corpus) two or
three-fold.
• Having the same corpus, but less input dimensions, ML will work better.
2
5
Filtering
• Removes pre-defined stopwords (e.g. removes ‘and’, ‘or’, ‘in’…).

• You can also load your own list of stopwords provided in a simple *.txt file with one stopword
per line.
• Lexicon keeps only words provided in the file. Load a *.txt file with one word per line to use
as lexicon. Click ‘reload’ icon to reload the lexicon.
• Regexp removes words that match the regular expression. Default is set to remove
punctuation.
2
6
Filtering
Document frequency keeps tokens that appear in not less than and not more than the specified
number / percentage of documents
E.g. DF = (3, 5) keeps only tokens that appear in 3 or more and 5 or less documents.
If you provide floats as parameters, it keeps only tokens that appear in the specified
percentage of documents. E.g. DF = (0.3, 0.5) keeps only tokens that appear in 30% to 50% of
documents.
Most frequent tokens keeps only the specified number of most frequent tokens.
Default is a 100 most frequent tokens
2
7
N-Gram
• They are basically a set of co-occurring words within a given window
• Unigram, Bi-Gram, Tri-gram and ……
• "The cow jumps over the moon".
Bigram
Trigram
 the cow
 the cow jumps
 cow jumps
 cow jumps over
 jumps over
 jumps over the
 over the
 over the moon
 the moon 2
8
N-Gram
A PHRASE gives more meaning than a single word.
After constructing 2-grams and 3-grams and comparing them across

documents provides better measure of similarity.
2
9
POS Tagging
• A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in

some language and assigns parts of speech to each word (and other token),
such as noun, verb, adjective, etc.
• Generally computational applications use more fine-grained POS tags like

'noun-plural'.
• POS tags provide linguistic signal on how a word is being used within the scope
of a phrase, sentence, or document.
• Example: “run” can be used as a verbin “I run 5 miles every day”

or as a noun in “I went for a run”. 3
0
POS Tags in Penn Treebank Project
3
1
Word Cloud
• Visual representation based on the count of the word or n-gram

• Static representation (Time not considered)
• Language/context not taken into account
Useful representation, but need to be supported by other analysis.
3
2
Bag of Words
• The bag-of-words model is a simplifying representation of text.
• In Bag-of-words, the grammar is disregarded but the multiplicity is maintained.
• It is commonly used in document classification, where frequency of) occurrence of each

word is used as a feature.
• Calculate the Bag of words for the sentences:

Doc1- John likes to watch movies.
Doc2- Mary likes movies too.
Doc3- Sheetal likes movies.
Sheetal likes football also. 3
3
TF-IDF
TF: Term Frequency, which measures how frequently a term occurs in a document.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
IDF: Inverse Document Frequency, which measures how important a term is. However it is
known that certain terms, such as "is", "of", and "that", may appear a lot of times but have
little importance.
IDF(t) = log10(Total number of documents / Number of documents with term t in it).
3
4
TF-IDF
 Consider a corpus has 10 million documents.

 The word cat appears in 1000 of them.
 In a particular document containing 100 words, the word cat appears 3 times.
 Calculate the TF-IDF of cat for this document?
• The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03.
• Then, the inverse document frequency (i.e., idf) for cat is calculated as log(10,000,000 /
1,000) = 4.
• Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
 For a particular word, IDF value remains unchanged in the entire corpus. 3
5

1.social Media Data Analytics Process - 07092020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.social Media Data Analytics Process - 07092020

Uploaded by

Copyright:

Available Formats

Varieties in Social Media

 Un-willingness of Consumers to fill in questionnaires of MR

 Low cost of data collection from internet

 Advancement in Big Data Technologies

 Consumers are spending long hours on the net (Netizens !).

 declining trust in institutions and organizations (Allsop et al., 2007). As a result,

 Reviews and comments of buyers are mostly unsolicited (high reliability)

An informal, noncommercial digital communication (primarily positive or negative

statements) about a brand, product, service or company made by actual or former

customers (Harrison-Walker, 2001; Hennig-Thurau et al., 2004).

Text as a reflection of the Producers Text s impact on Receivers

Use text as features or Use text to understand

• As per classification of Mcluhan, 1964; social media can be termed as cool

• Analysis may misrepresent the real world.

• Presence of Nonhumans: social bots and spammers

• Professionally managed accounts of prominent individuals 9

 Fake reviews are often posted by unscrupulous companies,

 Almost 38% total reviews in social media is paid or fake.

 In an normal data warehouse, ETL systems takes care of Noise.

• Restrictions imposed by websites on data collection

• Corpus is the equivalent of “dataset” in a general machine learning task.

• A corpus represents a collection of (data) texts, used for a particular analysis.

• Corpus identification is the process of identifying the subset of available data to

The subset of text data is identified based on the below attributes:

• User produce copious data.

• Due to availability of API, data extraction is easier

• Twitter is a structured as a network, so twitter data is an excellent source of

• Cleaning  Stop Word Removal:

• Lowercase ------- Data Reduction

• Remove urls will remove urls from text.

Breaking text into smaller components ( words, sentences, bigram)

• By Whitespaces- This example. →(This), (example.)

Developed in 1980 and most widely used.

It identifies 1200 suffixes and rules to handle them.

<condition> <suffix> → <new suffix>

Porter stems both meanness and meaning to mean,

Lemmatization cannot handle unknown word.

• makes your data more dense.

• Removes pre-defined stopwords (e.g. removes ‘and’, ‘or’, ‘in’…).

A PHRASE gives more meaning than a single word.

After constructing 2-grams and 3-grams and comparing them across

• A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in

• Generally computational applications use more fine-grained POS tags like

• Example: “run” can be used as a verbin “I run 5 miles every day”

• Visual representation based on the count of the word or n-gram

Useful representation, but need to be supported by other analysis.

• The bag-of-words model is a simplifying representation of text.

• In Bag-of-words, the grammar is disregarded but the multiplicity is maintained.

• It is commonly used in document classification, where frequency of) occurrence of each

• Calculate the Bag of words for the sentences:

IDF(t) = log10(Total number of documents / Number of documents with term t in it).

 Consider a corpus has 10 million documents.

 Calculate the TF-IDF of cat for this document?

You might also like