You are on page 1of 38

Commercial Analytics

of Clickstream Data using
Hadoop

June 2014

Submitted by:
Kartik Gupta
201100048
M.C.A
Thapar University

Submitted to:
School of Mathematics and
Computer Application
Department,
Thapar university,
Patiala.

Outline
 Overview
 Big Data
 Hadoop
 Major Steps
 Results and Analysis
 Conclusion and Future Scope

Overview
 This Project gives an analytic report to find the
behavior and location of visitor using Hadoop.
 Map Reduce is implemented to refine and sort
the raw data.
 Searching is done based on the country, ip
addresses, Postal code, categories wise
 Hadoop is a tool which converts the
unstructured, structured and semi-structured
data into pair into a single value which is
represented in binary format.
 MapReduce framework is used for parallel
implementation.

Big Data  Big Data is a term used to describe large collections of data that may be unstructured grow so large and quickly that it is difficult to manage with regular database or statistical tools.  3 v’s of Big data .

Hadoop     Open source project started by Doug Cutting A platform to manage Big Data Helps in Distributed computing Runs on Commodity Hardware Data storage (HDFS)   Runs on commodity hardware (usually Linux) Horizontally scalable Processing (MapReduce)   Parallelized (scalable) processing Fault Tolerant .

CORE PARTS OF HADOOP .

Some specific features ensure that the Hadoop clusters are highly functional  RackAwareness  Minimal Data Motion  Utilities  Rollback  Highly Operable .Hadoop Distributed File System(HDFS)  Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

How HDFS works .

This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. .MapReduce  MapReduce is a programming model and an associated implementation for processing large data sets. The run-time system takes care of scheduling tasks. monitoring them and re-executes the failed tasks. MapReduce usually splits the input data-set into independent chunks which are processed in a completely parallel manner.

Mapreduce program that has been written tells the job client to run a mapreduce job. 10 .Execution flow in MapReduce 1.

This sends a message to the Jobtracker which produces a unique ID for the job. 11 .Execution flow in MapReduce 2.

JobClient copies job resources . 12 .Execution flow in MapReduce 3. such as jar file.

13 . the JobClient can tell the JobTracker to start the job. Once the resources are in Distributed Filesystem.Execution flow in MapReduce 4.

14 .Execution flow in MapReduce 5.. It retrieves these input splits from the distributed file system. The JobTracker does its own initialization for the job.

it will return the map task or reduce task as response to the heart beat. 15 .Execution flow in MapReduce 6. Now that the Jobtracker has work for Tasktrackers.

Execution flow in MapReduce 7. 16 . The TaskTracker need to obtain the code to execute. so they get it from the shared file system.

Execution flow in MapReduce 8. The TaskTracker now will run the job. 17 .

It is typically captured in semi-structured website log files. Potential Uses of Clickstream Data  What is the most efficient path for a site visitor to research a product.OTHER TECHNOLOGICAL TERMS Clickstream Data  Clickstream data is an information trail a user leaves behind while visiting a website. and what are they most likely to buy in the future?  Where should I spend resources on fixing or enhancing the user experience on my website? Basically we will focus on the “path optimization” use case. and then buy it?  What products do visitors tend to buy together. Specifically: how can we improve our website to reduce bounce rates and improve conversion? .

. which represents five days of clickstream data.STEP I Upload Acme website log dataset contains about 4 million rows of data.

registerd user swid. url .e timestamp. ip address. geocoded ip address.STEP II Represent the dataset in unstructured format i.

STEP III Represent the users data from the unstructured loaddataset .

STEP IV Represent the products categories wise from the dataset .

STEP V Shows the refine dataset of acme logfiles .

STEP VI Combine all the tables i.e acme log. products. users. .

Results and Analysis Configuration of Hadoop .

Results and Analysis Count the no of VISITORS from any country .

Results and Analysis Retrieving the ip address and displaying the state of visitors .

Results and Analysis Showing the no of ip access this category at a time .

Results and Analysis Initial stage of mapping and reduction .

Results and Analysis Category accessed by total no of ips .

Results and Analysis Showing shoes category acc to state access by total no of ip .

Results and Analysis showing details of ip accessed by visitors but gender wise .

Result and Analysis No of Females accessed this page .

Result and Analysis Total no of ip address accessed particular webpage .

Result and Analysis Calculate the sum of ages of all the visitors .

 Therefore analyze the behavior and location of the visitor.Conclusion  The amount of clickstream data is rapidly growing and with this demand for accessing information over web has increased significantly.  It is inefficient to process large data using traditional sequential method  Therefore MapReduce is used for processing large datasets .

 Then the tradeoff would be done between distance and other factor that would be fused .  Nearest location method can be fused with any other method to help in better way for decision making.Future Scope  Clickstream information play an important role in a wide variety of applications such as decision support systems. in event detection. profile-based marketing. e-commerce industry .  Location search is used by various industries like telecom .

Thank you !!! .