You are on page 1of 15

BIG DATA ANALYTICS

Aditi Vaity-02
Aditi Bhide-03
Aneri Adani-06
Avinash Kukreja-13
Dharmik Siroya-15
Introduction
● With digitization, there has been a drastic increase in the usage of some of the
popular social media sites such as Twitter, Facebook, Yahoo, YouTube as well as
e-commerce sites as in Flipkart and Amazon, which have resulted in the generation
of large sets of data.
● If the data is of small size, it is very easy to extract useful information, but if the
size of data is huge, then it is quite difficult to analyze what that data actually
intends.
● Twitter, a social networking site launched in 2006, is undoubtedly one of the most
popular social media platforms available today, with 100 million daily active users
and 500 million tweets sent daily.
● It is one of the renowned social media that gets a huge amount of tweets each
day. This information can be used for economic, industrial, social or government
approaches by arranging and analyzing the tweets as per our demand.
HOW IS TWITTER BUILT?
● FlockDB, an in-house graph database
they wrote, Cassandra and A custom Initially, Twitter was built using Ruby on Rails, a
specialized Web-application framework for the
MySQL fork.
Ruby computer programming language. Its
● Hadoop technology uses a interface allows open adaptation and integration
with other online services.
divide-and-conquer methodology for
processing, by handling large complex
unstructured data that usually does not
fit into regular relational tables.
● The second main component of Hadoop
is its map-reduce framework, which
provides a simple way to break analyses
over large sets of data into small chunks
which can be done in parallel across
your 100 machines.
● 3Vs followed i.e. data with huge
volume, variety and velocity. Hadoop
is a framework which deals with Big
data and it has its own family which
supports processing of different things
which are tied up in one umbrella
called the Hadoop Ecosystem.
● The main core components of Hadoop
framework are MapReduce and HDFS.
● The two key data access components
of Hadoop Ecosystem are Apache Pig
and Apache Hive.
Twitter recommendation products:

There are 3 main recommendation products that Twitter recommends to its customers

1. Users to follow : The total recommendations that could be generated for this product

could be as high as 1 Billion and the recommendations can be valid for months to years. In

other words, the recommendations have a higher shelf life

2. Tweets: Tweets that are recommended to users on their feed can be in the order of

hundreds of millions with a shelf life of a few hours. As we all know, shelf life for information

is very less now a days as news changes so rapidly even with in a single day

3. Trends/Events: Trends have the smallest number of recommendations that have to be

made because most of the users might be part of similar trends and their shelf life is also

short as they don’t last long


Key aspects while deploying a recommender system:

1. Coverage: With the increasing catalog of items, it is always important to get high coverage

while maintaining low latency

2. Diversity: It is important to give diverse recommendations to the users.

3. Adaptability: The recommender system should adapt quickly to the fast changing world of

content

4. Scalability: It should be scalable to billions of users with different habits and preferences

5. User preferences: The framework should be able to handle varied user interests in one

ranking framework
Twitter employs both collaborative and content recommendation systems or sometimes a hybrid of both the
models based on the type of recommendation that they are making.

Collaborative filtering:

When it comes to using collaborative filtering approach, there is a unique advantage to Twitter due to its user
follows concept. It makes it easier for Twitter to calculate user similarity as the information is directly. This also
increases the feasibility of creating graph based models and use community detection techniques on top of them
to find similarities among users.

Content based approach:

When it comes to content based filtering, it becomes a bit more complicated at Twitter scale.

1. At 280 character limit, we can have 46 words max per tweet considering an average of 5 characters in an
English word. This leads to a lot of content that has to be processed
2. Twitter post might have multi lingual text and it is difficult words from different languages into context
Challenges with User — User and User — item recommendations:

User-User:

Finding similarities based on user follows and recommending all the tweets made by a particular person to
the user who has followed him/her might not be a robust approach. For example, you might follow a
particular person on Twitter for his views on machine learning. His/her tweets on politics might not be of
interest to you and should be avoided from your feed.

User-interests:

User-interests vary all the time: Users will have long term interests like health preferences and short
term interests like trends/events and interests change all the time. For example, during November millions
of users were interested in mid term elections in US and would have liked to watch politics related content.
But the same will not be true for a different time period for the same person.

Geo — dependent interests: Geo-dependent interests always change for users based on the happenings
at a particular point of time. “Trends for you” should always keep up with this change in geo-dependent
interests of the user.
Twitter Mining

Twitter is a gold mine of data. Unlike other social platforms, almost every user’s tweets are completely public
and pullable. This is a huge plus if you’re trying to get a large amount of data to run analytics on. Twitter data
is also pretty specific. Twitter’s API allows you to do complex queries like pulling every tweet about a certain
topic within the last twenty minutes, or pull a certain user’s non-retweeted tweets.

Twitter mining is analysing Twitter message information to predict, discover or investigate potential causation.
Twitter mining includes text mining designed to specifically leverage Twitter tweet content and contexts.
Twitter mining can include analysing additional information associated with tweets, including names, hashtags
and other characteristics. Twitter mining also employs the substantial quantitative information (numbers of
tweets, retweets,, likes, favorites, etc.) to try to better understand the phenomena under consideration.
Finally, Twitter mining can examine how Twitter tweets, retweets, etc., capture and reflect different events or
even how Twitter relates to other social and conventional media.

People can mine data after creating a Twitter Developer Account. It uses frequency mining concept along with
others.
AAAlgoruth
Algorithm used

Twitter is one of the renowned social media that gets a huge amount of tweets each
day. This information can be used for economic, industrial, social or government
approaches by arranging and analyzing the tweets as per our demand.

Hadoop is a big data storage and processing tool for analyzing data with 3Vs, i.e. data
with huge volume, variety and velocity. Hadoop is a framework which deals with Big
data and it has its own family which supports processing of different things which are
tied up in one umbrella called the Hadoop Ecosystem.
THadoop Ecosystem
Twitter Implementation of Sentiment Analysis in Hadoop
APPLICATION

● Reviews from Websites


● Applications as a Sub-component Technology
● Business Intelligence
● Applications across Domains
CONCLUSION

● Extensive dependence on social media data such as Twitter data, e-commerce


data, etc. have gained much attention in the area of sentiment analysis.
● Hadoop proves to be an efficient framework for huge data analysis since
Hadoop operates in a fault-tolerant manner. In addition to this, Hadoop can
be integrated with Apache Pig, Hive, Zookeeper, Sqoop, etc., which promises
improved efficiency and performance of Hadoop.
● Hadoop framework has been used which is integrated with Apache Flume to
fetch data from Twitter and Apache Pig and Hive are used to perform analysis
on extracted Twitter data. First, recent trends in the extracted tweets were
determined and then sentiment analysis was performed on the retrieved data.
THANK YOU

You might also like