Mining Big Data

for Ranking
System
Mentor :-
Dr. Shirshu Verma
Associate Professor
Team :-
Aman Kr. Raj (IIT2011041)
Khushal Gautam (IIT2011054)
Neelesh Kr. Nirmal (IIT2011033)
Nikhil Passey (IIT2011159)
Shivam Chaudhary (IIT2011047)
Sudheer Singh (IIT2011064)
Vishal Chaudhary (IIT2011042)
Introduction
 Ranking is required whenever there is
requirement for comparing relevancy.

 The complexity of modern analytics needs
is outstripping the available computing
power of legacy systems.


Introduction (continue…)
 In legacy environments, traditional tools
and batch processes can take hours,
days, or even weeks, in a world where
businesses require access to data in
minutes, or seconds – or even sub-
seconds.
 For example, to rank the user’s of
stackoverflow on the basis of level of
expertise in various fields , we need to
analyze huge amount of data.
Introduction (continue…)
 Hadoop is a great choice for these kind of
problems.

 Hadoop is used not only for handling the
historically grown BIG data, but also used
for meeting high performance needs for
an application.
what is hadoop??
 Hadoop is an open source Apache project.
Hadoop framework was written in Java. It is
scalable and therefore can support high
performance demanding applications.
Storing very large amounts of data on the file
systems of multiple computers are possible in
Hadoop framework. It is configured to enable
scalability from single node or computer to
thousands of nodes or independent systems in
such a way that the individual nodes use
local computer storage, CPU, memory and
processing power.
Hadoop (contd…)
Problem definition
 Ranking the users for every topic on
stackoverflow.com is approached in this
project.Stackoverflow is an open-source
website where any body can ask question
or answer any question with assigned user-
id and password or as a guest user.
Questions and answers are categorized on
the basis of the area in which they are more
relevant. Hashtags are assigned to every
question and answers.
Problem definition..
 . Hashtag tells us the relevancy of the
question or answer in the particular
area. Our main objective is to assign
the level of expertize to the user in
every field on the basis of the response
on the user's questions and positive and
negative responses on the answers
given by other users.

Problem definition…
 We take all the features of the user's
response like positive/negative
response, location preference, student
or professional which are valid users of
the website. We have taken the data-
dump of the stackoverflow, which is
trememdous amount of data(20 GB).

Problem definition(contd…)
 We will analyze all the data for every user
and develop a recommendation system
on the basis of level of expertize of the
users in various areas. To process this huge
volume of data, heuristic approach to
analyze this data will take huge amount
of time, months or even years. So we
need something special technique
Problem definition…
 In order to achieve this we will bring
―computation to the data instead of data
to the computational method‖, because
to bring the data for computation we
need extra I/O operations to load this
huge volume data into memory and if the
memory is limited then we need extra
memory and resources to process this
data

WHY HADOOP…???
 You can't have a conversation about Big Data for
very long without talking about Hadoop. But what
exactly is Hadoop, and what makes it so special?
Basically, it's a way of storing enormous data sets
across distributed clusters of servers and then
running "distributed" analysis applications in each
cluster. It's designed to be robust, in that your Big
Data applications will continue to run even when
individual servers — or clusters — fail. And it's also
designed to be efficient, because it doesn't
require your applications to shuttle huge volumes
of data across your network
Hadoop VS Rdbms..
 RDBMS and Hadoop are different
concepts of storing, processing and
retrieving the information. DBMS and
RDBMS are in the literature for a long time
whereas Hadoop is a new concept
comparatively. As the storage capacities
and customer data size are increased
enormously, processing this information
with in a reasonable amount of time
becomes crucial.
Hadoop VS Rdbms(contd…)
 Especially when it comes to data
warehousing applications, business
intelligence reporting, and various
analytical processing, it becomes very
challenging to perform complex reporting
within a reasonable amount of time as the
size of the data grows exponentially as
well as the growing demands of
customers for complex analysis and
reporting.

Hadoop VS Rdbms(contd…)
 Hadoop framework works very well with
structured and unstructured data. This also
supports variety of data formats in real time
such as XML, JSON and text based flat file
formats. However, RDBMS only work with
better when an entity relationship model (ER
model) is defined perfectly and therefore, the
database schema or structure can grow and
unmanaged otherwise,i.e., an RDBMS works
well with structured data.
Hadoop VS Rdbms(contd…)
 Hadoop will be a choice in environments
such as when there are needs for BIG
data processing on which the data being
processed does not have consistent
relationships. Where the data size is too
BIG for complex processing, or not easy
to define the relationships between the
data, then it becomes difficult to save the
extracted information in an RDBMS with a
coherent relationship
POINTS TO CONSIDER:
 RDBMS is relational database
management system. Hadoop is node
based flat structure.
 RDMS is generally used for OLTP
processing whereas Hadoop is currently
used for analytical and especially for BIG
DATA processing.

Points contd..
 Any maintenance on storage, or data files, a
downtime is needed for any available RDBMS.
In standalone database systems, to add
processing power such as more CPU, physical
memory in non-virtualized environment, a
downtime is needed for RDBMS such as DB2,
Oracle, and SQL Server. However, Hadoop
systems are individual independent nodes
that can be added in an as needed basis.

Points contd..
 The database cluster uses the same data
files stored in shared storage in RDBMS
systems, whereas the storage data can
be stored independently in each
processing node.
 The performance tuning of an RDBMS can
go nightmare. Even in proven
environment. However, Hadoop enables
hot tuning by adding extra nodes which
will be self-managed.

Overview of Hadoop…
Implementations of Hadoop..
HDFS…
 Hadoop, including HDFS, is well suited for
distributed storage and distributed processing
using commodity hardware. It is fault tolerant,
scalable, and extremely simple to expand.
MapReduce, well known for its simplicity and
applicability for large set of distributed
applications, is an integral part of Hadoop.
 HDFS is highly configurable with a default
configuration well suited for many
installations. Most of the time, configuration
needs to be tuned only for very large clusters.

HDFS…
 Hadoop is written in Java and is
supported on all major platforms.
 Hadoop supports shell-like commands to
interact with HDFS directly.
 The NameNode and Datanodes have
built in web servers that makes it easy to
check current status of the cluster.