You are on page 1of 52

COMMERCIAL ANALYTICS OF

CLICKSTREAM DATA
USING HADOOP
Project Report submitted in partial fulfillment of the requirements for the
award of degree of

Master of Computer Application

Submitted By
Kartik Gupta
(Roll No. 201100048)
Under the supervision of:
Mr.Pritpal Singh
Assistant Professor

SCHOOL OF MATHEMATICS AND COMPUTER APPLICATION
THAPAR UNIVERSITY
PATIALA – 147004

June 2014

Certificate
I hereby certify that the work is being presented in the report. “Commercial Analytics
of Clickstream Data using Hadoop” , in partial fulfillment of the requirements for the
award of degree of Master of Computer Application submitted in School of
Mathematics and Computer Application, Thapar University, Patiala is an authentic
record of my own work carried out under the supervision of Mr. Abhinandan Garg
and Mr. Pritpal Singh.

Kartik Gupta
201100048
Date:

Certify that the above statement made by the student is correct to the best of our
knowledge and belief.

Mr. Pritpal Singh

Mr. Abhinandan Garg

Faculty Coordinator

Industry Coordinator

Page | 1

Acknowledgement
I would like to express my deep sense of gratitude to my supervisor, Mr.Pritpal
Singh, Assistant Professor, School of Mathematics and Computer Application, Thapar
University, Patiala, for his invaluable help and guidance during the course of thesis. I
am highly indebt to him for constantly encouraging me by giving his critics on my
work. I am grateful to him for giving me the support and confidence that helped me a
lot in carrying out the project work in the present form. And for me, it’s an honor to
work under him.
I also take the opportunity to thank Dr.Rajiv Kumar (SDP Coordinator), Associate
Professor of School of Mathematics and Computer Applications Thapar University,
Patiala, for providing us great infrastructure and facilities, which has helped to
become a complete software engineer.
I would also like to thank my parents and friends for their inspiration and ever
encouraging moral support, which went a long way in successful completion of my
training.
Above all, I would like to thank the almighty God for his blessings and for driving me
with faith, hope and courage in the thinnest of the times.

Kartik Gupta

Page | 2

Abstract
Hadoop play an important role in today’s life to solve a big data problem because
everyone is surrounded by data, in fact it won’t be wrong to say that we live in a data
age. As the data is increasing it is becoming more and more challenging for IT
companies to maintain and analyze huge amount of datasets. Now days many
companies facing similar scaling challenges, it wasn’t feasible for everyone to
reinvent their own proprietary tool. Doug Cutting saw an opportunity and led the
charge to develop an open source version of Map Reduce system known as Hadoop.
Hadoop is best known for Map Reduce and its distributed file system (HDFS). Its
distributed file system runs on large cluster of commodity machines. Basically it
distributes the large dataset (in terabyte or in petabyte form) to thousand of machine
that they can work accordingly and manage data easily in some seconds. HDFS was
inspired from GFS but GFS provides a fault tolerant way to store data on commodity
hardware and deliver high aggregate performance to their clients where as HDFS
doesn’t provide any fault tolerant way to store data.
MapReduce is a framework that is used for parallel implementation. MapReduce
splits the input into the independent chunks and execute them in parallel over different
mappers. When using MapReduce, developer need not to worry about other factors
like fault tolerance, load balancing etc. all these factors are handled by MapReduce.
He can only concentrate more on his programming. Without MapReduce it would not
be efficient to process so much big spatial data. Spatial data may contain the
knowledge about lot of locations. So it is very important to efficiently process it.

Page | 3

Table of Contents
Certificate
Acknowledgement
Abstract
Table of Contents
List of Figuresv
Chapter 1 Introduction
1.1 Big Data................................................................................................................1
1.1.1 Uses of big data.........................................................................................1
1.2 Clickstream Data..................................................................................................2
1.2.1 Potential Uses of Clickstream Data...........................................................3
1.3 Point of Interest.....................................................................................................3
1.4 Tableau(Analytic Tool).........................................................................................4
1.5 MapReduce...........................................................................................................6
1.5.1 Types of nodes in MapReduce..................................................................8
1.5.2 Execution flow in MapReduce..................................................................9
1.5.3 Input and output to MapReduce..............................................................10
Chapter 2 Hadoop.....................................................................................................10
2.1 Hadoop Overview.......................................................................................10
2.2 HDFS..........................................................................................................11
2.3 Hadoop Cluster...........................................................................................12
2.4 Data Replication.........................................................................................12
Chapter 3 Literature Review14
3.1 Clickstream query in event detection.................................................................14
Page | 4

49 Bibliograhy51 Page | 5 ....41 Chapter 7 Conclusion and Future Scope49 7...............................16 3.....................1 Snapshots...........32 5........................................................................................................................2 Optimization of MapReduce Jobs to Improve the Large-Scale analysis .............................................1 Overview.....2 Future Scope............................................................21 Chapter 4 Problem Statement27 4...........................27 4..........................................................1 Conclusion...................................................................................................................................................3.......49 7...........................................................17 3.........................27 4.......4 Finding location Using Website LogFiles..............15 3.......................................19 3...........................3MapReduceSolution……………………………………………………………… … 22 Chapter 5 Implementation32 5.............................37 Chapter 6 Results and Analysis41 6.....................................................2 Problem Analysis ..................................33 5..................................................................3 Scalable data compression with MapReduce......5 Finding the Sales Using Pie Chart................2 Clickstream Data...........................................1 Proposed Objective..........3 View and Refine the Website Clickstream Data.................6 Finding Growth rate of Website Using Bar Graph.....

.....3 Subdividing earth’s surafce upto three bits...............................1 Flowchart of Mapper..................................................................................................................................4 Locations across a region in polygon method......................5 Execution flow in MapReduce....17 Figure 2...............1 Subdividing earth’s surafce upto one bit.....................25 Figure 3.............7 Replication in Hadoop.....2 Gaps in zip code method...........................................................................3 Longitude and Latitudes of Earth................28 Figure 3..............................21 Figure 2...........................................11 Figure 1................................................................9 Figure 1..6 Figure 1.........40 Figure 5.......................................................................19 Figure 2..............................................1 Compression and decompression of RDF...............................................29 Figure 3....................20 Figure 2.....................................................................4 Various nodes in MapReduce.........................................41 Figure 5.............................42 Page | 6 .18 Figure 2.................13 Figure 2..........8 Figure 1...............................................22 Figure 2.....................................................................................2 Locations across a region in zip code method.......39 Figure 4.................................2 Types of input files................4 Figure 1.............................................................................................................1 Configuration of Hadoop..6 Locations inside a small polygon............................8 Nearest locations found through voronoi method..........5 Nearest locations found by polygon method................18 Figure 2............................................................24 Figure 2...................2 Subdividing earth’s surafce upto two bits............6 Architecture of Hadoop........................2 Flowchart of Reducer......................................3 Nearest locations found by zip code method.......................................................5 Figure 1.........................9 Procedure to make k-d tree............................20 Figure 2........10 k-d tree..30 Figure 4......1 Gaps in zip code method............................List of Figures Figure 1....................................................................................7 Voronoi diagram.............................................

.......47 Figure 5...........................................................................................48 Page | 7 .................43 Figure 5...7 Final stage of mapping and reduction....................................3 Input file.......5 Initial stage of mapping and reduction......................................Figure 5...............................................................................4 Interface to find nearest location...............44 Figure 5........................................8 Output.................................................46 Figure 5...............................................6 Middle stage of mapping and reduction..........45 Figure 5....

long term planning. they are discarded after that. the online data on web today is 281 Exabyte. Analysis of this data can induce useful patterns about the behavior of user.1. Various problems that have to be faced in processing big data are capturing the data.Chapter 1 Introduction 1. 1. The trend to giant data sets is because of extra information derived from analysis of a single large set of correlated data. understanding the business trends. storage. The Internet companies such as Google. Most of this data is generated frequently. This explosion of data is seen every sector in the computing industry. photographs. Professionals are trying to make results from this huge amount of data. According to Google. Yahoo. click through events etc. Due to their giant size it is not efficient to process those using traditional methods. But due to their correlated behavior it becomes difficult to query them. which was 5 Exabyte in 2002. transferring. There has been a tremendous increase in user generated data since 2005. their opinions on peoples and products.1 Big Data Big data is a collection of large data sets. like decision making. fighting crime. to generate better options for users and to just get to know their users Page | 1 . Also there is huge quantity of data which is indirectly generated by web sites in the form of access log files. sharing. Facebook etc have to deal with large amounts of data generated by the user and this data is in the form of blog posts. search. and the data sets are stored temporarily for a fixed period and once they are used.  Companies are trying very hard to use this data in order to generate user preferences. analysis etc.1 Uses of big data  People are sharing more and more data online like sharing their personal lives. and audio/video files. This big data helps in finding the relations between various fields which may help in various ways. as compared to separate smaller sets with the equivalent total amount of data. and getting real-time roadway traffic conditions. status messages.

the visitor’s IP address.2 ClickStream Data Clickstream Data Clickstream data is an information trail a user leaves behind while visiting a website.more and more. big data is used to study trend of the market. Now enterprises of all types can use Hadoop and the Hortonworks Data Platform (HDP) to refine and analyze clickstream data. and a user ID that uniquely identifies the website visitor. Facebook was to store and process their massive volume of clickstream data.2. and what are they most likely to buy in the future?  Where should I spend resources on fixing or enhancing the user experience on my website? 1.  If minute to minute data about the weather is kept. 1.1 Potential Uses of Clickstream Data One of the original uses of Hadoop at IBM. then it can help in avoiding the situations like floods or any other calamities. the destination URLs of the pages visited. An important use of clickstream information that a user can Page | 2 .  Big data is beneficial in various situations like if one wants to have minute to minute or second to second information of anything. Yahoo. then big data will be accumulated having all the records. and then buy it?  What products do visitors tend to buy together. These website log files contain data elements such as a date and time stamp.  In capital market.3 Point of interest A point of interest is that how individual user can grow their business with the help of clickstream information. They can then answer business questions such as:  What is the most efficient path for a site visitor to research a product. 1.  Big data helps in keeping the information about real time traffic conditions. it may help in better understanding them and avoid any jams. It is typically captured in semi-structured website log files.

because they are more likely to buy high profit products. cubes. it is necessary to make sure that the site service to the wanted users. it is probably huge. you can bring data in-memory and do fast ad-hoc visualization. Using the groups. We can work directly with the data to create reports and dashboards. Expressive Visualization A bar chart doesn’t work for everything.. Some are more profitable to the organization than others. It is important to identify visitor’s interest so that user can fulfill their needs accordingly so that they can improve their growth rates acc. it is desirable to analyze banner efficiency for each user group. unstructured. See patterns and outliers in all that data that’s stored in your Hadoop cluster. enter your credentials and begin. Page | 3 . Tableau connects to multiple flavors of Hadoop. and cloud data sources such as salesforce.4 Tableau Tableau has optimized direct connections for many high-performance databases. Hadoop. to visitor because not all the visitor’s are equal.improve their delivering service and advertisement based on user interest and thereby improving the quality of user interaction and leading to higher customer loyalty. This lets you work with data that traditional databases find extremely difficult to process. Thus. with a single click to select live connect or in-memory analytics. Data visualization tools must be flexible enough for you to tell any story. including unstructured data and XML data. you can find the trends and outliers that are the most important. Hadoop’s distributed file system (HDFS) and the MapReduce algorithm support parallel processing across massive data. in memory analytical engine. The bottom line is that your Hadoop reporting is faster.g. Hadoop data. And getting to the best visual representation of your data is rarely a journey along a straight line: when you navigate between different perspectives. nested or all three. easier and more efficient with Tableau. It’s easy to get started: just connected to one or more than 30 databases and formats supported by tableau.com and Google analytics. and to optimize this by presenting the right banners to right people Analysis of customer interest for product development purposes is also very interesting. Tableau connects live or brings your data into its fast. e. Once connected. 1.

bring in new data. You can filter by crime type. Tableau offers the ability to create different views of your data and change them as your needs evolve. Looking at data with a geographic element on a map brings in an entirely new dimension: notice how in this example it becomes clear that sales are clustered in a few metro areas. And switching between views is as easy as a click. the data visualization becomes a natural extension of the analyst’s thought process. Without interactivity. or create another view of your data. You can also highlight related data in multiple views at once to investigate patterns across time and location. so you don’t have to wait on an IT change request to understand your data. drill up. the analyst is left with unanswered questions. representing it one way. or date. With the right interactivity.1: Different Types of Visualization Interactive Visualization The cycle of visual analysis is a process of getting data. In this dashboard showing crime activity in the District of Columbia. Different views answer different questions. from a crosstab report to a daily sales graph to sales by segment by month. three filters that act on two different visualizations let you quickly understand trends by day of week and location. Page | 4 . The follow-on questions might lead to a need to drill down. district. filter.There are many ways to look at simple sales data. Figure 1. noticing results and asking follow on questions.

then you can’t get the most out of your data. Each machine handles a small part of the data. Processing of this much large amount of data can be done efficiently only if it is done in a parallel way. The key to providing this kind of valuable decision support is to allow the business user to quickly and easily create or modify a dashboard. it’s often useful to create a dashboard on the fly to investigate an issue or provide background for a strategic decision. If every new dashboard or change request requires IT support and takes weeks.Figure 1. MapReduce is a programming model that allows the user to concentrate more on code writing. He needs not be worried about the concepts of parallel programming like how he will distribute the data to different machines or what will Page | 5 .5 MapReduce MapReduce is a parallel programming technique which is used for processing very large amounts of data. 1. And while some dashboards are used over and over again.2: Shows Interactive visualization Access Any Data Tableau is powerful analytic tool because they let you relate different views of information visually.

So the developers must have knowledge that how he will represent his coding in the form of Map and Reduce functions. Before MapReduce large scale data processing was difficult because before that one had to look after the following factors himself but now they all are handled by the framework. In MapReduce. This all is inbuilt in MapReduce framework. All the coding is done in the form of two functions that is Map and Reduce. the input work is divided into various independent chunks.  Managing thousands or hundreds of processors  I/O Scheduling  Status and monitoring  Managing parallelization and distribution  Fault/crash tolerance Page | 6 . The output of the Map function is sorted and fed to the Reduce function. Therefore it can be said that MapReduce is a software framework which allows to write the programs with a ease which involves large amount of data and the developer need not to worry about anything else when his data is being processed parallelly.happen on failure of any machine etc. The mappers process the chunks and produce the output. MapReduce can be run on the commodity hardware with reliability and in a fault tolerant way. The input to this framework is provided in the form of key/value pairs and it produces the output also in the form of key/value pairs. These chunks are processed by the mappers in the totally parallel way. MapReduce is very suitable when there is a large amount of data involved. then this output is sorted and fed as input to the reducers. For small amount of data it might not be that useful. Means the MapReduce framework and distributed file system runs on the same set nodes due which it provides high aggregate bandwidth. monitoring of tasks and re-executing the tasks which have been failed. The framework itself will handle scheduling of tasks. Mapreduce allows a very high aggregate bandwidth across the cluster because typically the nodes that store and the nodes that compute are the same.

user only can concentrate on his own implementation. 1. Page | 7 . Before MapReduce one needed to have a complete knowledge about all these factors. Figure 1. there are many TaskTrackers per cluster. They perform map and reduce operations. There is only one of these per cluster. had to do separate coding for them like how to distribute the data if any node fails.MapReduce provides all of these easily because these all are inbuilt in Map Reduce framework. keeping a check on processors etc.5.4: Various nodes in MapReduce There are also NameNode and DataNodes and they are a part of hadoop file syatem.1  Types of nodes in map reduce JobTracker node manages the MapReduce jobs. It schedules the map tasks and reduce tasks on the appropriate TaskTrackers in a rack aware manner and moitors for any failing task that need to be rescheduled on a different TaskTracker. But now with MapReduce all that is inbuilt.  TaskTracker nodes are to achieve parallelism for map and reduce tasks. It receives jobs submitted by the clients.

 The first step is the mapreduce that has been written tells the job client to run a mapreduce job. It manages the the filesystem namespace and metadata.  The JobTracker does its own initialization for the job. usually HDFS.There is only one NameNode per cluster. to the shared file system. There are many DataNodes per cluster. It retrieves these input splits from the distributed file system. the JobClient tells the JobTracker to start the job.  JobClient copies job resources. They periodically report to the NameNode the list of blocks it stores.  As soon as the resources are in HDFS. It calculates how to split the data so that it can send each split to a different mapper process to maximize throughput.5.2 MapReduce Execution Overview Figure 1. these manages blocks with data and manages them to the clients. Expensive hardware commodity is used for this node. Page | 8 . such as jar file containing the java code that has been written to implement map and reduce tasks. Inexpensive commodity hardware is used for this node. 1.  The TaskTrackers are continuously sending heartbeat messages to the JobTracker.  This sends a message to the Jobtracker which produces a unique ID for the job.5: Execution flow in MapReduce The process of running a mapreduce job consists of 8 major steps.

3 Inputs and Outputs to MapReduce The MapReduce framework works only on <key. though they might be of different type. produces a new set of key value pairs. This set of key/value pair that is <k2.v1>.v2) reduce (k2. therefore they are represented by <k2. it will return the map task or reduce task as response to the heart beat.Now that the Jobtracker has work for them.v2> is fed to the reducer. Page | 9 . Map function after processing those key value pairs.  The TaskTracker need to obtain the code to execute.5. value> pairs. Reducer can do that according to various factors. 1. value> pairs. Like it can combine all the values which had the same key.v1)  list(k2. The inputs and outputs to the MapReduce framework can be shown as: map (k1. This time these key value pairs can be of different type. the framework accepts the input to the MapReduce work as a set of <key. Now the work of the reducer is to reduce the input data. so they get it from the shared file system. value> pairs and produces the output also in a set of <key.v2>.list(v2))  list(v2) First line represents the map function which shows that map function took the value in the form of key value pairs which is represented <k1. That is. And then it keys the list of values that is list(v2).

a paper was published that described Google’s distributed file system . and it soon ran into problems with the creators realizing that the architecture they had developed would not scale to the billions of web-pages on the Internet. Nutch was an ambitious project started in 2002. Hadoop originated from Apache Nutch3 which is an open source web search engine was a part of the Lucene project. when they decided to Page | 10 . Hadoop was born in February 2006. Hadoop was created by Doug Cutting.the Google File System (GFS).Chapter 2 Hadoop 2.1 Hadoop Overview Hadoop is the Apache Software Foundation open source and Java-based implementation of the Map/Reduce framework. But in 2003.

These datasets are divided into blocks and stored across a cluster of machines which run the Map/Reduce or Hadoop jobs. but on the other hand. Once the data is generated and loaded on to the HDFS. 2. Block_id. On one hand.move NDFS and Nutch under a separate subproject under Lucene. In January 2008. HDFS was designed keeping in mind the ideas behind Map/Reduce and Hadoop. Hadoop provides various 3. implements the Hadoop Distributed File System (HDFS). This helps the Hadoop framework to partition the work in such a way that data access is local as much as possible. Page | 11 . it assumes that each analysis will have a large Datanodes proportion of the dataset. Hence. Read data tools which help in processing of vast amounts of data using the Map/Reduce framework and. HDFS runs on clusters on commodity hardware. additionally. This implies that it is capable of handling datasets of much bigger size than conventional file systems (even petabytes). So the time taken to read the whole of dataset is more important than the latency occurred in reading the first record. This has its advantages and disadvantages.top level project under Apache and HDFS or Hadoop Filename Distributed File System was name kept instead of NDFS. applications for which low-latency access to data is critical will not perform well with HDFS.2 Hadoop Distributed File System or HDFS 2. A very important feature of the HDFS is its “streaming access”. Hadoop made its own1. it can read bigger chunks of contiguous data locations very fast. random seek turns out to be a so slow that it is highly advisable to avoid it. datanode HDFS is a file system which is designed for storing very giant files with streaming data access patterns.

In case of a node-failure. the replication of files is made over 3 different nodes in which two nodes are from same rack and one node from different rack.  NameNode Node controlling the HDFS.3 Hadoop Cluster A Hadoop cluster consists of the following main components. And it is the NameNode only that is responsible for ensuring that HDFS is fault tolerant. It requests work from the JobTracker and reports back on updates to the work allocated to it.6: Architecture of HDFS 2. In order to achieve the fault tolerance.Figure 1. The simplicity of Map/Reduce ensures that such restarts are easily achievable. the JobTracker starts the work scheduled on the failed node on another free node. This ensure that if the user code is malicious it does not bring down the TaskTracker Page | 12 .  TaskTracker Node actually running the Hadoop Job. but forks a separate daemon for each task instance. It is responsible for scheduling the jobs on the various TaskTracker nodes. all of which are implemented as JVM daemons.  JobTracker Master node controlling the distribution of a Hadoop (MapReduce) Job across free nodes on the cluster. It is responsible for serving any component that needs access to files on the HDFS. The TaskTracker daemon does not run the job on its own.

While the blockreport tells about all the blocks present on the Datanode. 1. The JobTracker tries to allocate Write work to nodes such files accesses are local.6. Based on the application it can be decided that how many replications would be there of a file. All decisions regarding replication of blocks is made by the NameNode. The factor of replication can be specified at the time of file creation time and can be changed later. The block size and number of replications can be configured per file.4 Data Replication HDFS is designed to store very giant files across machines in large clusters. These blocks of a file are replicated to provide fault tolerance. it means that DataNode is up and working properly. As long as it receives the heartbeat messages. as much as possible.Datanodes Datanodes Block ops  DataNode This node is part of the HDFS and holds the files that are put on the HDFS. Page | 13 . Read Usually these nodes also work as TaskTrackers. It keeps on receiving the heartbeat and blockreport messages periodically from every Datanode in the cluster. Files are stored in a sequence of blocks.

Figure 1.7: Replications in HDFS Page | 14 .

Page | 15 . governmental organizations and fire departments can increase their situational awareness picture about the area they are responsible for. conferences and demonstrations in the area one monitor. traffic jams. a real-time New York City party finder. independent from the event type.g. parties. the particular nature of Twitter and its adoption by a younger. some people like to pinpoint it on a map.  Here. Some are interested in house fires. At the time of writing. to name just one possibility. A large part of these information streams comes from many private users who describe everything on the social media like how they are feeling or what is happening around them or what they are doing currently. e. Furthermore. “trendy” crowd suggests applications along the lines of. some are interested in bomb threats. one want to know where it caused what kind of casualties or damages. So. In particular. more than 500 million posts are issued every day.  So based on all those streams event detection will take place. so that others could know about it and the information becomes more actionable. Private customers that have an interest in what is going on in their area. on social networking sites.1 Clickstream query in event detection The rise of Social Media platforms in recent years brought up huge information streams which require new approaches to analyze the respective data.Chapter 3 Literature Review 2. one sees the following customer groups and use cases:  Police forces. Broadway premiers. One believes that such a system can be useful in very different scenarios.  Journalists and news agencies to instantly be informed about breaking events.. if there is an earthquake in the area one monitor. Many people share the information like what they are interested in. on-going baseball games. It is understood that how to leverage the potential of these real-time information streams. gatherings.

Example: The extended methods added to MapReduce have become widespread in recent years. but its configuration parameters need to be finely and efficiently tuned for the specific deployment. MapReduce is a programming model made for data-intensive applications which contain massive data sets as well as execution framework for processing of large-scale data on clusters. Suppose a data analysis task requires the joining of multiple data sets to compute certain aggregates. fault tolerance. Therefore tuning requires mainly 3 steps  Data redistribution technique should be extended so as to find the nodes with high performance. Various technologies that are being used are  Filtering–join–aggregation Page | 16 . A main objective of such a system is to improve the performance of complex data analytical tasks and confirm the potential of their approach which is one of the key technologies for the extension and improvement of the MapReduce runtime framework to efficiently process complex data analysis tasks on large clusters. high scalability. So in order to achieve maximum performance all these configuration parameters should be tuned very carefully.2.  Making a new routing schedule shuffle phase so as to define the scheduler task while memory management level has reduced.2 Optimization of MapReduce Jobs to Improve the Large-Scale Data Analysis Process Data-intensive applications works on large amount of data using any parallel processing paradigm. While there are many strong points about MapReduce like its easy programming structure. MapReduce performance is also directly dependent on the various configuration parameters of Hadoop in various conditions.  Number of map/reduce slots should be utilized so as to make the execution time more efficient.

which has already been partitioned and has been sorted (or hashed) by the map and reduce modules. and sends the join results to the reducers for partial aggregation to improve the performance of complex analysis queries. and reduces the completion time of tasks and improves the utilization rate. the following question naturally arises: there are good approaches for optimizing all the problems of MapReduce. joins the qualified tuples. a compression technique is required to be applied Page | 17 . so these parameters need to look after first of all.  Map-reduce-merge Map–reduce–merge included a merge phase on top of MapReduce for increasing the efficiency of the merged data.  MapReduce-indexing MapReduce-indexing strategies provided a detailed analysis of four MapReduce indexing strategies of varying complexity and were examined for the deployment of large-scale indexing. Filtering logic is applied by the first job to all the data sets in parallel. These statements are released due to usage of RDF (Resource Description Framework) data model.Suppose that the approach requires a filtering–join–aggregation task in two successive MapReduce jobs. 2. In order to handle so much big amount of data.3 Scalable data compression with MapReduce The Semantic Web consists of billions of statements. These are the main parameters which need to be tuned in any MapReduce job so as to improve the efficiency.  Pipelined-MapReduce The pipelined-MapReduce allows data transfer by using a pipeline between the operations and it expands the batched MapReduce programming model. One of the most important requirements for effective performance tuning is to discover those important parameters that are related to tuning a job for all features. Therefore. All of these studies demonstrate the usefulness of the performance-tuning job of MapReduce.

It makes the use of dictionary encoding technique. where each statement is made of three different terms: a subject. where the semantics of information can be interpreted by machines. An example statement is <http://www.1: Compression and decompression of RDF 2.nl> <rdf:type> <dbpedia:University>. This example states that the concept identified by the uniform resource identifier (URI). Information is represented as a set of resource description framework (RDF) statements.by the RDF. But that size of data is so big that even applying the compression to such data is difficult. a predicate. Dictionary encoding technique maintains the structure of the data. Flow diagram for compression and decompression of RDF. Therefore in order to do efficient compression and decompression of such big data MapReduce algorithms are used.4 Finding locations using Website Log files Nowadays almost every businesses have a functionality of "Find Locations" of their visitor’s on the websites through the given address so that he can improve their Page | 18 . The SemanticWeb is an extension of the current WorldWideWeb.vu. and an object. (a) Compression algorithm (b) Decompression algorithm Figure 2.

2: Locations across a certain region in geocoded code method Figure 2. This is done by matching the geocoded code with others in the Database.5 Finding the sales using Pie charts In some of the tableau frameworks. Example Like one of the business man has to find the visitor’s finding the products on his website. 2. it give them as their closest locations. Merchant wants to know that from where he found there website path.2 shows encircled dots (current location of vsitor) surrounded by various other dots(different places) which can be compared to a real life scenario. Now the methods which is used behind this query is that it matches the geocoded ip address of all places one by one and the code of which places fall under same geocoded ip address .service in those areas from where he get the more visitors and target in other areas also . These figures show the current location of a person in encircled dot and all the other location of various places as simple dots. These circles depict various different ip address of different areas. how much time he/she will be spent on his website and on which page most time spent by visitor. Now what this query will do is that it will match the geocoded code from the log files of the website in the database so that merchant can know the area of visitor so that he can promote well. Figure 2. It is obvious that places which will fall under same code would belong to the same region and would be near to each other. a merchant can find the location from where he Page | 19 . So those places whose geocoded code will match with place will be displayed.

address because just a polygon has to be created. This is the method which is adopted by most of the sites. first the numbers of visitors are specified in a particular year. To make the bar graph diagram. And then based on those seeds analyze the sales in a particular region.got the more users and how can improve sales on the other areas. In each states there are lot of different cities with different postal codes so here in fig 2.6 Finding the growth rate of Website using bar graph A Bar Graph describes the number of visitor’s on the website increases or decreases per year from the different regions. They just display all the locations that fall within the specified boundary. And they themselves focus on the areas which are more profitable than others. A piechart is an area defined on the map and it has clear boundaries. which is a circle around one point. Figure 2. Page | 20 .3 shows the sales in particular state with different postal codes. Those regions depict that the analytics of sales and visitors from which region they are increasing so that we can focus on those and improve website information. Through this kind of analytics any merchant can grow their business easily by finding areas. Then queries can be done which will be based on all the new points that have been defined inside the new polygon.4: Locations across a region in polygon method 2.

Main concern here is to draw a more accurate and efficient solution to this problem. In MapReduce execution takes place in parallel. So we here provide the solution for that problem using Mapreduce.1 Proposed objective This Project will find the Visitor’s location and their path from which he arrives on website (point of interest).7: Bar Graph diagram Chapter 3 Problem Statement 3. hotel. so also it will take less in executing the large database. It is desired that one could query the unbounded data means there is no limitation on the range of data . and so on. The location can be of any category like school. Those locations are point of interest. Anyone would be able to find the nearest location over a whole country or more than that. so it well suits our work. but somewhere all of them Page | 21 . library. There are mainly two types of queries.Figure 2. Many methods are already there to find nearest location. one that solves bounded box problem that involves limited data and other are unbounded box or open box which queries wide range of data. So as this algorithm will find the nearest location there will no limitation like one can find it in the range of 10 km only or anything like that. Mapreduce is mainly useful in those cases where large amount of data is involved.

3. A typical web log is on the order of 100 bytes per click so large websites handling millions of simultaneous users can generate 100s of gigabytes or even terabytes of weblogs per day.have few or more flaws which are not well suited for our application. but somewhere all of them are take more time to analyze because of big data because detailed analysis of website log is a common Big Data Task.This program Page | 22 . To search out the bits of valuable information from this mass of data can require very sophisticated programs. Our solutions need to extract these items. Figure 3. The remote IP address is the first component of the standard Apache weblog and the time may be extracted from the timestamp. Depending on the type of Website. These various drawbacks are defined as follows with respect to the existing methods. these logs may contain information about customer shopping habits. The solutions verified in this report to tackle a simpler weblog analysis task: using the remote IP address and timestamp collected with each weblog to measure the amount of traffic coming to the website by country of origin on an hour-by-hour basis during the average day. or web advertisement effectiveness. social media networks.2 Problem analysis Many methods are already there to find visitor location. and look the IP address up in a table mapping IP addresses to host countries (for simplicity we will look at only the first two octets of the IP address and look them up in a table listing all the two-octet or Class B addresses that are used solely by a single country). which is the second component of most weblogs (see Figure ).1: A Standard Apache Web Log and its Components The data used in these tests was generated by a Hadoop program.

Country Percent Country Percent China 31 Iran 3 US 13 Korea 3 India 7 Mexico 3 Brazil 5 Nigeria 3 Japan 5 Turkey 2 Germany 4 Italy 2 Russia 4 Philippines 1 France 3 Pakistan 1 UK 3 Vietnam 1 Table: Top Twenty Internet Using Countries Hour Percent Hour Percent 00 4 11 2 Page | 23 .produces realistic sequential Apache web logs for a specified month. day. The web site is assumed to be in the Central US time zone and each of the countries is assigned a single offset from that for simplicity. As in table1: The remote hosts are distributed geographically among the top 20 Internet-using countries and temporally so that each region is most active during their local evening hours (simulating a consumer or social web site). the algorithms used in these solutions can easily be adapted to other formats. Note that although the log format used by the Apache web server was used in these tests. year and number of clicks per day from the unstructured data. as shown in Table 2.

Writing a MapReduce program is the most general way to exploit the capabilities of this framework for data manipulation and analysis. Hourly Distribution of Web Accesses in Local Time The MapReduce Solution The Hadoop MapReduce framework provides a flexible. The design of the MapReduce program to solve the geographical web analysis is fairly straightforward: the Mapper will read log files from HDFS one line at a time. combine tasks that aggregate the values for each key being emitted by a mapper. and reduce tasks that process the values for each key. with a value of 1. resilient and efficient mechanism for distributed computing over large clusters of servers. each with a value corresponding to the total number of hits coming from that country in that hour across the whole set of log files. look up the country corresponding to that IP address in a table.01 3 12 3 02 1 13 2 03 1 14 2 04 1 15 4 05 1 16 6 06 2 17 5 07 2 18 6 08 2 19 8 09 2 20 12 10 2 21 12 11 2 22 12 Table 2. The framework supports map tasks that read data stored in the Hadoop Distributed File System (HDFS) and emit key-value pairs. parse the first two octets of the remote IP address as well as the hour of the web access. and emit a key composed of the country code and hour. So it can be seen that none of the methods have been implemented Page | 24 . The Combiner and Reducer (actually the same program) will add up all the values per key and write 24 keys per country detected to HDFS. The flow is shown in Figure 2.

So the proposed solution given will be much efficient with respect to time and accuracy both. Chapter 4 Implementation Page | 25 .with MapReduce yet and moreover they are not solving the problem in any accurate way.

time spent on each page. This output then goes to reducer which then processes this data and gives the output. a list of pages seen in the order in which they are visited. how long people spend on your site. So all the work is carried out in these two functions only.1 Overview The solution is provided to this problem using Hadoop. The graphic allows you to click on the pages.are the route that visitors choose when clicking or navigating through a website. search terms. and when and where they left. so our large data will be processed parallel altogether. all the pages viewed. and how often they return. Page | 26 . A clickstream will show you when and where person came in to a website. etc. There is a wealth of information to be analyzed.4. hence the label ‘interactive’. presented in the order the pages are viewed also defined as the ‘succession of mouse clicks’ that each visitor makes. Hadoop (Map Reduce) executes in parallel. Clickstream Data Clickstreams. Examining individual clickstreams will give you the information you need to make content related decisions without guessing. value> pair and provides a new <key . countries. Map function will take the input in the form of <key. also known as clickpaths. as aggregated staistics. A clickstream is a list of all the pages viewed by a visitor. ISPs.value> pair as output. An interactive clickstream is a graphic representation of a clickstream. and see what the visitor saw. It will also tell which pages are most frequently viewed. The most obvious reason for examining clickstreams is to extract specific information about what people are doing on your site. browsers. on average.clickstream info will tell you. Taken all together. you can examine visitor clickstreams in conjunction with any of the information provided by a good stats program: visit durations. Map Reduce needs two functions to be made namely Map and Reduce. The Process will give you insight into what your visitors are thinking.

IP address. geocoded IP address.0. timestamp. Fig 4.uploading the acme log files The raw data file appears in the File Browser. users* – CRM (customer relationship management) user data listing SWIDs (Software User IDs) along with date of birth and gender.gz file. IP address.tsv. Page | 27 . The Acme log dataset contains about 4 million rows of data. and user ID (SWID). then select the Acmelogfiles. organizations will process weeks. and you can see that it contains information such as URL. we loaded website data files into Hadoop platform. Often. we’re working with: Acme website logsfiles – website log files containing information such as URL. and user ID (SWID). months.1. In the Hortonworks Sandbox. products* – CMS(content management system) data that maps product categories to website URLs. timestamp. and then used Hive queries to refine the data. geocoded IP address. click the File Browser icon in the toolbar at the top of the page.View and Refine the Website Clickstream Data In the “Loading Data into the Hadoop. or even years of data. which represents five days of clickstream data.

2-shows unstructured data in the acme log files Now at the users table using HCatalog. In the HCatalog we browse the data of the users which are visit the site through the different paths. Page | 28 .The data is in unstructured form and it can be structured by the map reduce tmethod using hive commands. Fig 4.

Select the users table to show its content The users table appears.4.Fig 4. and gender columns. birth date.3. There are more than 34000 users id is in the users table.4. which shows product categories to website URLs and after that we show the which user has chosen and how much time has been spent on which category Page | 29 .Shows the user information who visit the site We can also use HCatalog to view the data in the products table. and we can see the register user ID. Fig .

shows the hive query to connect the data To view the Acme data in Hive.Fig 4.6.Shows the url of the page of each category In the “Loading Data into the Hadoop. then click Browse Datain the Acme log Page | 30 . click Table.5. First. Fig 4. we used a Hive script to generate an “Acmelogfiles” view that contained a subset of the data in the Acme log table. This is easily accomplished in Hadoop. even when working with millions or billions of rows of data. we used Apache Hive to join the three data sets into one master set.

This Hive query executed a join to create a unified dataset across our data sources. Fig 4.row. CRM.7-Shows the Structured Acme log files data Next. CMS data Page | 31 . Fig4. we created a “webloganalytics” script to join the Acme website log data to the CRM data (registered users) and CMS data (products).8-Shows the hive Script to join data of Acme log.

You can view the data generated by the webloganalytics script in Hive as described in the preceding steps. Fig 4.9-Shows the webloganalytics data through which we can analyze Page | 32 .

Chapter 6 Results and Analysis Page | 33 .

Figure 6.1: Configuration of hadoop Figure 5. So here it shows all the nodes to be working.So once the hadoop is installed. HDFS is used for the implementation of map reduce . it is a tool which checks the processing status of jvm. This is confirmed by using command “jps”. After this command hadoop is started on the linux.sh”. Page | 34 .1 shows the pre map reduce processing. This command shows that whether all the nodes are up and working. it is started by using the command “start-all.

Figure 6. reducer starts to reduce it simultaneously. Page | 35 . It can be seen that they both start simultaneously the data which has been mapped by the mapper.2: Categories of input files Figure 6. It means the unstructured big data which have been converted by mapper in to the corresponding structured data are accepted by the reducer simultaneously.2 shows the first phase when MapReduce job has started. In the starting both mapping and reduction are 0 percent.

1 Analysis of the Website Clickstream Data Data visualization can help you optimize your website and convert more visits into sales and revenue. First. drag the ip field into the SIZE box Page | 36 .6. The map view displays a global view of the data. From the Webloganalytics we can analyze our data and make it changes accordingly.      Analyze the clickstream data by location Filter the data by product category Graph the website user data by age and gender Pick a target customer segment Identify a few web pages with the highest bounce rates Fig6. Now let’s take a look at a count of IP address by state.shows the combine data in webloganalytics Fig6.3. This Hive query executed a join to create a unified dataset across our data sources.3 webloganalytics represents that the data after combination of Acme website log data to the CRM data (Registered Users) and CMS data (products).

Fig-6.5 Shows the visitor’s country.4: Show’s visitors Country Figure 6.Figure 6. state and its ip address Page | 37 . Each blue dot represents the different country.4 shows the map where all the countries are marked up by the blue dots represents the country of visitors. In this we can see there is a country marked up country code”fra” name as France.

6 shows that from which city trying or searching for which type of product.Figure 6. Fig 6. From this type of information a merchant can grow more growth rate accordingly by providing those services from which visitors are getting benefit. In this map the visitors are from USA (New York) its ip address is 4.5 shows the location of country and state from visitors belong or from where to access the website and search for something.133.96.30. In this visitor from Atlanta city searches for the accessories and there are 692 ip addresses which are searching. Page | 38 .6 shows the category and visitors location Fig6.

Figure 6.7 shows that how many products are searched from each country so the merchant can focus on supplying particular products in specific areas.8: Shows the growth rate Figure 6. In this fig using pie chart we can see 47% shoes are buying in usa.7: Category wise searching country Figure 6.Figure 6. Page | 39 .8 shows the growth rate of product with respective of user who are buyiere are 391770 total ip addresses are searched and buy handbags by the females and searched a particular url.

Fig-6.1 Conclusion The amount of data is rapidly growing on a social websites. profile-based marketing. With the advancement in web technology and availability of raw data infrastructure. e-commerece and many other sites. In this report a method to tackle the behavior and location problem of the visitor has been derived. the demand for analyze of Clickstream information over web has increased significantly. Chapter 6 Conclusion and Future Scope 6. In this bar graph shows the given url is searched by mens in Canada. So a more refined method has been presented which uses Mapreduce method to refine the raw data to find the locations of Page | 40 . There are already existing methods to find the path but they were somehow not so accurate and efficient according to this problem.7 shows the final phase of the Clickstream data Analyze job. Clickstream information Analytics play an important role in a wide variety of applications such as decision support systems. A lot of frameworks and techniques are there to handle Clickstream info.9 Shows the visitor who searches url in Canada Figure 5. to know about the visitor and path where he comes from.

Like a e-commerce industry may be interested in finding some path of the visitor so that they can target the customer and knows what he/she wants to buy from his website. which are then input to the reduce tasks. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. Like for a telecom industry. The framework sorts the outputs of the maps. A variety of clickstream data efficiently modeled using MapReduce. ecommerce industry if it wants find the behavior and location around it where it can start its service but at same time has adequate population and similarly for other organizations.2 Future Scope Clickstream information play an important role in modern day where data is growing rapidly. So the MapReduce has been used to execute the data parallelly. Finding the behavior and location of visitor helps in decision making i.e for which areas which kind of supplies needed.the visitor. And finding the behavior and location is one of the most important part for an online business. There MapReduce is very well suited for the complex clickstream data. 6. The size of the data in this becomes too big. It would be of great help for various organizations. MapReduce is a programming model that lets developers focus on the writing code that processes their data without having to worry about the details of parallel execution. It can be helpful and used in day to day life for business growth. It improves the solution in terms of time complexity to a great extends because the data is being processed parallel. It would be inefficient to access that data sequentially. Page | 41 .

Bibliography Page | 42 .

Lassila O. Henri E. 2013 Page | 43 .ACM New York. 24-39. Niels Drost. Big Data.html [9] Jens Dittrich. No. Bal “ Scalable RDF data compression with MapReduce” Concurrency and Computation: Practice and Experience Vol. 25 No. NY. Hendler J. No. 12. Patel “Building a Scalable Geospatial Database System” In SIGMOD.readwriteweb. 1997 [8] “Geohash and its format” [online] http://geohash.2013 [2] Nathan Eagle “Big data.ph. Seinstra. New York [6] Afsin Akdogan. Istanbul.Michael Rauchman “Big Data in Capital Markets” SIGMOD’13.May 30. and complex social systems”. Cyrus Shahabi “Voronoi-Based Geospatial Query Processing with MapReduce” CloudCom Pg. USA 2010 [3] Wenfei Fan “Querying Big Social Data” Springer Berlin Heidelberg BNCOD 2013 [4] Marissa Mayer “The Coming Data Explosion” Richard MacManus cited as http://www. 5. 1 Pg.org/site/tips. and the 4th Paradigm” IOS Press and the authors Semantic Web 4. Vol. 2010 [5] Alex Nazaruk. 2010 [7] J.org/TR/rdf-primer/ [13] Jacopo Urbani. Ugur Demiryurek. August 27th 31st 2012. 2012 [11] Berners-Lee T. 9-16. Vol. Scientific American May 2001 [12] “W3C recommendation: Rdf primer” http://www.M. Jason Maassen. Turkey. [10] Kyuseok Shim “MapReduce Algorithms for Big Data Analysis” Proceedings of the VLDB Endowment. JorgeArnulfo Quian´eRuiz “Efficient Big Data Processing in Hadoop MapReduce” Proceedings of the VLDB Endowment. 12. 5. Frank J. 2013. “The semantic web”.[1] Pascal Hitzler and Krzysztof Janowicz “Linked Data.com/archives/the_coming_data_explosion.w3. Farnoush Banaei Kashani. global development. June 22–27.

Computers. 1670-1681. 1982 [15] Tomoyuki Shibata. 93-D No. Toshikazu Wada “K-D Decision Tree: An Accelerated and Memory Efficient Nearest Neighbor Classifier” IEICE Transactions Vol. 2010 Page | 44 . 7 Pg.[14] Der-Tsai Lee “On -Nearest Neighbor Voronoi Diagrams in the Plane” IEEE Trans.