Professional Documents
Culture Documents
Aditi Vaity-02
Aditi Bhide-03
Aneri Adani-06
Avinash Kukreja-13
Dharmik Siroya-15
Introduction
● With digitization, there has been a drastic increase in the usage of some of the
popular social media sites such as Twitter, Facebook, Yahoo, YouTube as well as
e-commerce sites as in Flipkart and Amazon, which have resulted in the generation
of large sets of data.
● If the data is of small size, it is very easy to extract useful information, but if the
size of data is huge, then it is quite difficult to analyze what that data actually
intends.
● Twitter, a social networking site launched in 2006, is undoubtedly one of the most
popular social media platforms available today, with 100 million daily active users
and 500 million tweets sent daily.
● It is one of the renowned social media that gets a huge amount of tweets each
day. This information can be used for economic, industrial, social or government
approaches by arranging and analyzing the tweets as per our demand.
HOW IS TWITTER BUILT?
● FlockDB, an in-house graph database
they wrote, Cassandra and A custom Initially, Twitter was built using Ruby on Rails, a
specialized Web-application framework for the
MySQL fork.
Ruby computer programming language. Its
● Hadoop technology uses a interface allows open adaptation and integration
with other online services.
divide-and-conquer methodology for
processing, by handling large complex
unstructured data that usually does not
fit into regular relational tables.
● The second main component of Hadoop
is its map-reduce framework, which
provides a simple way to break analyses
over large sets of data into small chunks
which can be done in parallel across
your 100 machines.
● 3Vs followed i.e. data with huge
volume, variety and velocity. Hadoop
is a framework which deals with Big
data and it has its own family which
supports processing of different things
which are tied up in one umbrella
called the Hadoop Ecosystem.
● The main core components of Hadoop
framework are MapReduce and HDFS.
● The two key data access components
of Hadoop Ecosystem are Apache Pig
and Apache Hive.
Twitter recommendation products:
There are 3 main recommendation products that Twitter recommends to its customers
1. Users to follow : The total recommendations that could be generated for this product
could be as high as 1 Billion and the recommendations can be valid for months to years. In
2. Tweets: Tweets that are recommended to users on their feed can be in the order of
hundreds of millions with a shelf life of a few hours. As we all know, shelf life for information
is very less now a days as news changes so rapidly even with in a single day
made because most of the users might be part of similar trends and their shelf life is also
1. Coverage: With the increasing catalog of items, it is always important to get high coverage
3. Adaptability: The recommender system should adapt quickly to the fast changing world of
content
4. Scalability: It should be scalable to billions of users with different habits and preferences
5. User preferences: The framework should be able to handle varied user interests in one
ranking framework
Twitter employs both collaborative and content recommendation systems or sometimes a hybrid of both the
models based on the type of recommendation that they are making.
Collaborative filtering:
When it comes to using collaborative filtering approach, there is a unique advantage to Twitter due to its user
follows concept. It makes it easier for Twitter to calculate user similarity as the information is directly. This also
increases the feasibility of creating graph based models and use community detection techniques on top of them
to find similarities among users.
When it comes to content based filtering, it becomes a bit more complicated at Twitter scale.
1. At 280 character limit, we can have 46 words max per tweet considering an average of 5 characters in an
English word. This leads to a lot of content that has to be processed
2. Twitter post might have multi lingual text and it is difficult words from different languages into context
Challenges with User — User and User — item recommendations:
User-User:
Finding similarities based on user follows and recommending all the tweets made by a particular person to
the user who has followed him/her might not be a robust approach. For example, you might follow a
particular person on Twitter for his views on machine learning. His/her tweets on politics might not be of
interest to you and should be avoided from your feed.
User-interests:
User-interests vary all the time: Users will have long term interests like health preferences and short
term interests like trends/events and interests change all the time. For example, during November millions
of users were interested in mid term elections in US and would have liked to watch politics related content.
But the same will not be true for a different time period for the same person.
Geo — dependent interests: Geo-dependent interests always change for users based on the happenings
at a particular point of time. “Trends for you” should always keep up with this change in geo-dependent
interests of the user.
Twitter Mining
Twitter is a gold mine of data. Unlike other social platforms, almost every user’s tweets are completely public
and pullable. This is a huge plus if you’re trying to get a large amount of data to run analytics on. Twitter data
is also pretty specific. Twitter’s API allows you to do complex queries like pulling every tweet about a certain
topic within the last twenty minutes, or pull a certain user’s non-retweeted tweets.
Twitter mining is analysing Twitter message information to predict, discover or investigate potential causation.
Twitter mining includes text mining designed to specifically leverage Twitter tweet content and contexts.
Twitter mining can include analysing additional information associated with tweets, including names, hashtags
and other characteristics. Twitter mining also employs the substantial quantitative information (numbers of
tweets, retweets,, likes, favorites, etc.) to try to better understand the phenomena under consideration.
Finally, Twitter mining can examine how Twitter tweets, retweets, etc., capture and reflect different events or
even how Twitter relates to other social and conventional media.
People can mine data after creating a Twitter Developer Account. It uses frequency mining concept along with
others.
AAAlgoruth
Algorithm used
Twitter is one of the renowned social media that gets a huge amount of tweets each
day. This information can be used for economic, industrial, social or government
approaches by arranging and analyzing the tweets as per our demand.
Hadoop is a big data storage and processing tool for analyzing data with 3Vs, i.e. data
with huge volume, variety and velocity. Hadoop is a framework which deals with Big
data and it has its own family which supports processing of different things which are
tied up in one umbrella called the Hadoop Ecosystem.
THadoop Ecosystem
Twitter Implementation of Sentiment Analysis in Hadoop
APPLICATION