You are on page 1of 19

An effective forum crawler

CHAPTER 1

INTRODUCTION

INTERNET forums (also called web forums) are important services where users can
request and exchange information with others. For example, the Trip Advisor Travel Board is a
place where people can ask and share travel tips. Due to the richness of information in forums,
researchers are increasingly interested in mining knowledge from them.
The World Wide Web consists of many web pages with billions of documents. As the
web pages increases the contents on web pages is also growing dynamically like news, monetary
knowledge, diversions, financial data, entertainment and schedules. So it is very difficult to
obtain relevant information on the web pages that the user has demanded from that particular
search engine. For Example the major search engine like Google that crawls innumerable pages
per day takes weeks to crawl the whole web. So to find relevant information a crawler is used.
The breadth first strategy crawler crawls the web and stores all the relevant data and
show hyperlinks as a result. Due to which the database becomes too large to handle. If this
drawback is handled then it’s simple for the user to get his desired data and also the size of
database up to an extent will be reduced. So to avoid this drawback a crawler is needed that
searches only the subset of the World Wide Web not the whole web, but for this a crawler has to
address two problems. The first problem is, for deciding which pages to download next, it should
have good strategy and the second problem is that it must have a reliable system that maintains
and manages harmful results or effects of system crashes.
A crawler is a program that is used to download and store Web pages, often for a Web
search engine. A Crawler traverses the World Wide Web in a systematic manner with the
intention of gathering data or knowledge or for the aim of web indexing. Web crawler is also
referred as robot or a spider. A web crawler could be a system for the bulk downloading of
websites. A crawler starts off by placing an initial set of URLs, in a queue, where all URLs to be
retrieved are kept and prioritized. The crawler gets a URL in some order from this queue,
downloads the page, extracts any URLs within the downloaded page, and then in the queue it

Dept. of CSE, MSEC 2014-15 Page 1


An effective forum crawler

puts the new URLs. This whole process is continued. Finally the collected pages are used later
for other applications, like for Web search engine or a Web cache.
Web crawlers are used for a many purposes. They are the main components of web
search engines, systems that assemble a corpus of websites, index them, and permit users to issue
queries against the index and find the pages i.e. web pages that match the queries.

Web crawlers are an essential component to search engines; running a web crawler is a
challenging task. There are tricky performance and reliability issues and even more importantly,
there are social issues. Crawling is the most fragile application since it involves interacting with
hundreds of thousands of web servers and various name servers, which are all beyond the control
of the system. Web crawling speed is governed not only by the speed of one’s own Internet
connection, but also by the speed of the sites that are to be crawled. Especially if one is a
crawling site from multiple servers, the total crawling time can be significantly reduced, if many
downloads are done in parallel.
Despite the numerous applications for Web crawlers, at the core they are all
fundamentally the same.

Following is the process by which Web crawlers work:


1. Download the Web page.
2. Parse through the downloaded page and retrieve all the links.
3. For each link retrieved, repeat the process.

The Web crawler can be used for crawling through a whole site on the Inter-/Intranet.
You specify a start-URL and the Crawler follows all links found in that HTML page. This
usually leads to more links, which will be followed again, and so on. A site can be seen as a tree-
structure, the root is the start-URL; all links in that root-HTML-page are direct sons of the root.
Subsequent links are then sons of the previous sons.

Dept. of CSE, MSEC 2014-15 Page 2


An effective forum crawler

CHAPTER 2

SYSTEM ANALYSIS

2.1 EXISTING SYSTEM

In the previous they proposed a method for learning regular expression patterns of URLs
that lead a crawler from an entry page to target pages. It is very effective but it will works for
only specific site from which the sample page is drawn. The same process will be repeated every
time for the new site. Therefore, it is not suitable for the large- scale crawling. In divergence,
FoCUS learns URL patterns across multiple sites and automatically finds forum entry page from
given a page forum. An experimental result specifies that FoCUS is an effective in large scale
forum crawling by leveraging crawling knowledge learned from a few annotated forum sites.
Majority of the Web Crawlers existing today can be categorized into two; Generic Deep
Web Crawlers and Focused Crawlers. A Generic Deep Web Crawler retrieves all the Web pages
as it crawls by following the hyperlinks. The basic procedure followed by such a crawler is
shown in Figure 1.

2.1.1 Semantic focused crawler

A Semantic Focused Crawler is a Focused Crawler that makes use of Semantic Web
technologies for performing the crawling. An Ontology-based Semantic Focused Crawler links
Web documents with related Ontology concepts for the purpose of categorizing them . It makes
use of Ontologies to analyze the semantic similarity between URLs of Web pages and topics.
The limitation of this type of crawler is that most of these crawlers fetch the surrounding texts of
URLs as the descriptive texts of the URLs and compute the similarity between the URLs and
ontology concepts based on these texts. But, the surrounding texts cannot be used to correctly or
sufficiently describe the URLs. A Metadata Abstraction based
Semantic Focused Crawler is a Focused Crawler that extracts meaningful information or
metadata from relevant Web pages and annotates the metadata with Ontology Mark-up
Languages. Many of these supervised classification models use predefined classifiers based on

Dept. of CSE, MSEC 2014-15 Page 3


An effective forum crawler

plain texts without enough semantic support. This will decrease the performance of document
classification.

2.1.2. Deep generic web crawler

A deep generic Web crawler starts with a list of URLs to visit, called the seeds. Some
content is accessible only by filling in HTML forms, and cannot be reached via conventional
crawlers that just follow hyperlinks. Crawlers that automatically fill in forms to reach the content
behind them are called hidden web or deep web crawlers.

As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds
them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively
visited according to a set of policies. If the crawler is performing archiving of websites it copies
and saves the information as it goes. The large volume implies that the crawler can only
download a limited number of the Web pages within a given time, so it needs to prioritize its
downloads. The high rate of change implies that the pages might have already been updated or
even deleted.

A Web crawler is one type of bot, or software agent. In general, it starts with a list of
URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks
in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the
frontier are recursively visited according to a set of policies.

Generic Web crawlers , which adopt the breadth-first strategy, are usually inefficient in crawling
Web forums. A Web crawler must make a tradeoff between "performance and cost" to balance
the content quality and the costs of bandwidth and storage. A shallow (breadth-first) crawling
cannot ensure to access all valuable content, whereas a deep (depth-first) crawling may fetch too
many duplicate and invalid pages (usually caused by login failures). In the experiments we found
out that using a breadth-first and depth-unlimited crawler, averagely there is more than 40%
crawled forum pages are invalid or duplicate. Moreover, a generic crawler usually ignores the
content relationships among pages and stores each page individually . whereas a forum crawler
should preserve such relationships to facilitate various data mining tasks.
Dept. of CSE, MSEC 2014-15 Page 4
An effective forum crawler

In brief, neither the breadth-first nor the depth-first strategy can simply satisfy the
requirements of forum crawling. An ideal forum crawler should answer two questions:
1) what kind of pages should be crawled? Obviously, duplicate and invalid pages should be
skipped to save network bandwidth and reduce redundancy.
2) what kind of out links in a page should be followed, and how to follow these links? In nature,
these two questions are coupled with each other. To verify whether a page is valuable, the
crawler should find out where it comes from (i.e., following which links can reach this page);
while to judge whether a link should be followed, the crawler must evaluate the in formativeness
of the target page.

Crawling policies

There are important characteristics of the Web that make crawling very difficult

1. its large volume,


2. its fast rate of change, and
3. dynamic page generation.

1.The large volume implies that the crawler can only download a fraction of the Web pages
within a given time, so it needs to prioritize its downloads.

2. The high rate of change implies that by the time the crawler is downloading the last pages
from a site, it is very likely that new pages have been added to the site, or that pages have already
been updated or even deleted.

3.The number of possible URLs crawled being generated by server-side software has also made
it difficult for web crawlers to avoid retrieving duplicate content.

Dept. of CSE, MSEC 2014-15 Page 5


An effective forum crawler

Figure1. Generic Deep Web Crawler

Dept. of CSE, MSEC 2014-15 Page 6


An effective forum crawler

2.1.3. Focused crawler


A focused crawler is a web crawler that collects Web pages that satisfy some specific
property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration
process. FoCUS learns URL patterns across multiple sites and automatically finds forum entry
page given a page from a forum. Experimental results show that FoCUS is effective in large
scale forum crawling by leveraging crawling knowledge learned from a few annotated forum
sites.
It is a system overcome by existing crawl systems. In this method for learning regular
expression patterns of URLs that lead a crawler from an entry page to target pages. It is very
effective but it only works for the specific site from which the sample page is drawn. The same
process has to be repeated every time for a new site. Therefore, it is not suitable to large- scale
crawling. In contrast, FoCUS learns URL patterns across multiple sites and automatically finds
forum entry page given a page from a forum. Experimental results show that FoCUS is effective
in large scale forum crawling by leveraging crawling knowledge learned from a few annotated
forum sites.
FoCUS uses a weak page classifier (SVMlight) along with majority voting method for
index/thread URL detection. The issues related with this combined method are that the outcome
of a weak classifier may be erroneous. If the URL group contains very less number of URLs
(two or four) and if majority of them are misclassified, then that will affect the accuracy of the
crawling process. One more drawback is that if the URL group contains a few URLs and if the
number of URLs is even, then the majority voting method will fail if half of the URLs in the
group are classified erroneously. Another drawback is that FoCUS is not capable of detecting
JavaScript-based URLs. So if the page-flipping URL is JavaScript-generated, then only the first
page will be retrieved and the remaining pages will not be crawled which affects the precision
and coverage parameters. FoCUS uses a Breadth First Strategy (BFS) for crawling which has
several drawbacks as already discussed.

Dept. of CSE, MSEC 2014-15 Page 7


An effective forum crawler

2.1.4. iRobot
iRobot is another Forum Crawler which automatically rebuilds the sitemap of the target
Web forum site and then selects an optimal traversal path which only traverses informative pages
and skips invalid and duplicate ones. This crawler does not have the capability to detect Entry
URLs which will affect the coverage and the overall performance of the Crawler. FoCUS
(Forum Crawler Under Supervision) is a Forum Crawler which integrates a learning module and
an online crawling module. The learning module classifies the pages and detects URLs present in
those pages. It forms regular expressions based on the detected URLs. The online crawling part
uses these regular expressions to detect the URLs and to perform the crawling.
IRobot aims to automatically learn a forum crawler with minimum human intervention by
sampling pages, clustering them, selecting informative clusters via an in formativeness measure,
and finding a traversal path by a spanning tree algorithm. But iRobot would only take the first
path (entry! board! thread). IRobot learns URL location information to discover new URLs in
crawling, but a URL location might become invalid when the page structure changes. As
opposed to iRobot, we explicitly define entry-index-thread paths and leverage page layouts to
identify index pages and thread pages.

2.2 Disadvantages of Existing system

 After a large number of Web pages are fetched, it will start losing its focus which will result in
the introduction of a lot of noise into the final collection.

 It may crawl many redundant and duplicate pages and many a times misses useful pages.

 Download of large amount of useless pages wastes network bandwidth and negatively affects
repository quality.

 Most of these crawlers fetch the surrounding texts of URLs as the descriptive texts of the URLs
and compute the similarity between the URLs and ontology concepts based on these texts. But,
the surrounding texts cannot be used to correctly or sufficiently describe the URLs.

Dept. of CSE, MSEC 2014-15 Page 8


An effective forum crawler

 It crawls the pages without understanding the correlation among them and so it cannot be used to
crawl Web Forums.

2.3 PROPOSED SYSTEM

In this paper, we propose an Effective Forum Crawler which overcomes the above
disadvantages. This crawler has two parts: a learning part and a crawling part. The learning part
detects several URLs and forms regular expressions based on that. The crawling part does the
crawling of the Web Forum using theses learned regular expressions.

The main contributions of our work are the following:


 Instead of using a linear kernel setting for the SVM, we are using a Gaussian Kernel
which gives better results in terms of accuracy and convergence time.
 The weak classifier combined with majority voting method is removed and a strong page
classifier is used which gives more accurate results.
 A browser emulator toolkit is used which loads the Webpage and executes the
JavaScript code in that if it exists and thus helps to detect any JavaScript-based URLs.
 A new feature Has ReplyBtn is added to the list of features used for index/thread page
classification.

A Freshness First Strategy (FFS) is used for crawling


 FFS helps to retrieve the recently updated pages before other pages.
 A Web Crawler gets limited amount of network space and disk space and so it is
important to get the recently updated pages prior to other pages.
 Web pages change rapidly over time and the crawler’s copy may become obsolete soon.

Dept. of CSE, MSEC 2014-15 Page 9


An effective forum crawler

An example set of URLs is given below.


http://www.gardenstew.com/about20152.html
http://www.gardenstew.com/about18382.html
http://www.gardenstew.com/about19741.html
http://www.gardenstew.com/about20142.html
http://www.gardenstew.com/user-34.html
http://www.gardenstew.com/post-180803.html

In the case of above example, the generic pattern “*” is refined to a specific pattern
http://www.gardenstew.com/ \w+\W+\d+.html that matches all URLs. Then this pattern is
refined to three more specific patterns.
http://www.gardenstew.com/about\d+.html.
http://www.gardenstew.com/user-\d+.html.
http://www.gardenstew.com/post-\d+.html
Each pattern matches a subset of URLs. These patterns are refined recursively until no
more specific patterns can be formed. These three patterns become the final output.

2.4 Advantages of Proposed system

 Instead of using a linear kernel setting for the SVM, we are using a Gaussian Kernel
which gives better results in terms of accuracy and convergence time.
 The weak classifier combined with majority voting method is removed and a strong page
classifier is used which gives more accurate results.
 A new feature Has ReplyBtn is added to the list of features used for index/thread page
classification.
 A Freshness First Strategy (FFS) is used for crawling.

Dept. of CSE, MSEC 2014-15 Page 10


An effective forum crawler

CHAPTER 3

SYSTEM SPECIFICATION

3.1 Hardware Requirements

 1.2 GHz CPU


 80GB hard disk
 2 GB RAM

3.2 Software Requirements

 Jquery, Ajax and Dojo libraries


 Eclipse LUNA Integrated Development Environment
 Apache Tomcat web server
 JBoss AS 7 Application server
 Firebug Debugging tool
 Windows/ Linux Operating System

Dept. of CSE, MSEC 2014-15 Page 11


An effective forum crawler

CHAPTER 4

SYSTEM ARCHITECTURE

2.1 SYSTEM ARCHITECTURE

Fig 2: Architecture of Web Crawler

Dept. of CSE, MSEC 2014-15 Page 12


An effective forum crawler

The architecture of the proposed Forum Crawler is shown in next section. This crawler
consists of two major modules: one learning module and a crawling module. Learning module
forms regular expressions by analyzing the different URLs in different pages. The crawling
module performs the online crawling by using these regular expressions.

4.1 Learning Module

The learning part in turn contains several sub modules. This crawler detects the Entry
URL of a Web Forum, given any page of it. This task is performed by the Entry URL Detection
sub module. The detected Entry URL is stored in the Entry URL repository. Next sub module of
the learning module is the Index/Thread URL Classification module. All the URLs from the
index and thread pages are collected and these URLs are classified as either index or thread
URLs. The URLs are classified according to which page it is pointing to. The page classification
is done by SVM with a Gaussian kernel setting. Several page features are given as input to the
classifier in the baseline system. In addition to these features, one more parameter that we are
using in our system is a HasReplyBtn feature which indicates the presence of a Reply button in
post pages. Almost 95% of the Forums contain a Reply button in the post page. The classified
index/thread URLs are stored in a index/thread URL repository.

4.2 Crawling Module

When the learning module finishes its task, actual crawling starts. This crawler, unlike
FoCUS adopts a Freshness First Strategy for performing the crawling. In this, the freshest page
will be crawled prior to older ones especially when available resource is limited.

Freshness of a page can be calculated as follows:


F(page,tupdate) = tupdate – T0 (1)
Where tupdate is the last update time of page and T0 is a static timestamp. This score is used to
give a higher score to a fresher page and lower score to an older one. The idea is to detect the last
update times of URLs before retrieving them. In forums, thread pages contain the last updated
time in most cases. This last update time is used to determine the page’s freshness. Usage of this
strategy allows helps to utilize the bandwidth more efficiently.

Dept. of CSE, MSEC 2014-15 Page 13


An effective forum crawler

CHAPTER 5

IMPLEMENTATION

a. READ URL:
We are concentrating on focus ontology which search for the relevant web pages based
on the keyword we give. Actually it forms a hierarchy of links. The web information on the
particular web page for a particular keyword, which we give as, input. It will search for the link
on that seed URL and after that switch to that link and find another link on that web page but it
,should match with the keyword, it will do like that until it reach the limit that we set. But it may
be possible that it will not found the number of links that we set before. Then it shows that the
web page is not having any further link for that particular keyword. While fetching the links the
user profiles also make sure that it should fetch only the unique links, mean s that it should not
revisit the same link again and again. Finally , when we finished with the links, we will give one
txt file as input and run the three pattern matching algorithm.

b. PATTERN RECOGNITION:
Here with pattern we mean only text. Pattern matching is used for syntax analysis. When
we compare pattern matching with regular expressions then we will find that patterns are more
powerful, but slower in matching. A pattern is a character string. All keywords can be written in
both the upper and lower cases. A pattern expression consists of atoms bound by unary and
binary operators. Spaces and tabs can be used to separate keywords. Text mining is an important
step of knowledge discovery process. It is used to extract hidden information from not-structured
or semi-structured data. This aspect is fundamental because much of the web information is
semi-structured due to the nested structure of HTML code, much of the web information is
linked, and much of the web information is redundant. Web text mining helps whole knowledge
mining process of mining, extraction and integration of useful data, information and knowledge
from the web page content. Pattern recognition is applied on the web information like this ,
When we start the retrieval it will give me the links related to the keyword. It will then read the

Dept. of CSE, MSEC 2014-15 Page 14


An effective forum crawler

web pages that are extracted from the links and while it will read the web page it will extract
only the content. Here content means only the text that is available on the web page. It should not
include images, tags, and buttons. The extracted content should be stored in some file. But it
should not include any HTML tags.

c. IDENTIFICATION PROCESS:
This process will identify the required urn is whether right kind of link or wrong kind
link. It will identify the url, protocol link also for retrieve the relevant web page for user
requesting. It’s used to omit bad urls while user requesting web pages. Bad urls are identified by
pattern of protocol occur on the relevant web pages on the server side.

d. DOWNLOADING PROCESS:
After completion of all process the downloading will started. It will start to downloading
requesting url link of users need. After three checking process only it will downloaded the
relevant link for users request. It will working efficiently to users, the requested link will
retrieve.

e. INDEX URL AND THREAD URL TRAINING SETS:


Recall that an index URL is a URL that is on an entry or index page; its destination page
is another index page; its anchor text is the board title of its destination page. A thread URL is a
URL that is on an index page; its destination page is a thread page; its anchor text is the thread
title of its destination page. We also note that the only way to distinguish index URLs from
thread URLs is the type of their destination pages. Therefore, we need a method to decide the
page type of a destination page.The index pages and thread pages each have their own typical
layouts. Usually, an index page has many narrow records, relatively long anchor text, and short
plain text; while a thread page has a few large records Each post has a very long text block and
relatively short anchor text. An index page or a thread page always has a timestamp field in
each record, but the timestamp order in the two types of pages are reversed: the timestamps are
typically in descending order in an index page while they are in ascending order in a thread
page.

Dept. of CSE, MSEC 2014-15 Page 15


An effective forum crawler

CHAPTER 6

PERFORMANCE EVALUATION

The performance of proposed Effective Forum Crawler will be evaluated against the
performance of Board Forum Crawler (BFC). The parameters that will be used for comparing the
performances are the following:
 Precision: Number of relevant documents retrieved by a search divided by the total number of
documents retrieved by that search or it is the fraction of retrieved documents that are relevant to
the search .

precision=(relevantpages∩retrievedpages)/
retrievedpages

 Recall: Number of relevant documents retrieved by a search divided by the total number of
existing relevant documents or it is the fraction of the documents that are relevant to the query
and that are successfully retrieved .

recall=(relevantpages∩retrievedpages)/
relevantpages

 Crawling Time: Time taken by the Web Crawler to crawl a pre-defined number of pages.
BFC yields a precision of 90% and a recall value that is greater than 70%.

Dept. of CSE, MSEC 2014-15 Page 16


An effective forum crawler

CHAPTER 6

CONCLUSION AND FUTURE SCOPE

 The web crawler collects detail information about the website and the websites links. It
includes the website URL, the web page title, the meta tag information, the web page
content, the links on the page.
 In this paper the basic of web crawling is discussed and the survey of different web forum
crawling techniques is discussed. Crawler automatically crawl the forum data and it clean
up the unwanted data. After cleaning the unwanted data, Crawler allocates that space to
new queries posted by the user.
 Comparing with other techniques of web forum crawling, Crawler outperforms these
crawlers in terms of effectiveness and coverage.
 It shows that the learned patterns are effective and the resulting crawler is efficient.

FUTURE SCOPE
 In future, we would like to extend the Crawler to other sites like Q&A sites, blog sites
and other social media sites also.

 We would like to experiment the Crawler in many other Forum sites and would like to
improve its efficiency.

Dept. of CSE, MSEC 2014-15 Page 17


An effective forum crawler

REFERENCES

[1] Leng, Alex Goh Kwang, Ravi Kumar P, Ashutosh Kumar Singh, Rajendra Kumar Dash.
"PyBot: An Algorithm for Web Crawling."Nanoscience, Technology and Societal Implications
(NSTSI), 2011 International Conference on. IEEE, 2011.

[2] Ding, Li, et al. "Swoogle : a search and metadata engine for the semantic web."Proceedings
of the thirteenth ACM international conference on Information and knowledge management.
ACM, 2004.

[3] Luong, Hiep Phuc, Susan Gauch, and Qiang Wang. "Ontology-based focused crawling."
Information, Process, and Knowledge Management, 2009. eKNOW'09. International Conference
on. IEEE, 2009.

[4] Dong, Hai, and Farookh Khadeer Hussain. "Focused crawling for automatic service
discovery, annotation, and classification in industrial digital ecosystems." Industrial Electronics,
IEEE Transactions on 58.6 (2011): 2106-2116.

[5] Guo, Yan, Kui Li, Kai Zhang, Gang Zhang. "Board forum crawling: a Web crawling method
for Web forum."Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web
Intelligence. IEEE Computer Society, 2006.

[6] Mohsen Jamali, Hassan Sayyadi, Babak Bagheri, Hassan Abolhassani. "A method for
focused crawling using combination of link structure and content similarity." Web Intelligence,
2006. WI 2006. IEEE/WIC/ACM International Conference on. IEEE, 2006.

[7] Sachan, Amit, Wee-Yong Lim, and Vrizlynn LL Thing. "A Generalized Links and Text
Properties Based Forum Crawler." Proceedings of the The 2012 IEEE/WIC/ACM International
Joint Conferences on Web Intelligence and Intelligent Agent Technology-Volume 01. IEEE
Computer Society, 2012.

Dept. of CSE, MSEC 2014-15 Page 18


An effective forum crawler

[8] Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang, "iRobot: An intelligent
crawler for Web forums." Proceedings of the 17th international conference on World Wide Web.
ACM, 2008.

[9] Jingtian Jiang, Xinying Song, Nenghai Yu, Chin-Yew Lin,. “FoCUS: Learning to Crawl Web
Forums”, ieee transactions on knowledge and

[10] data engineering, vol. 25, no. 6, june 2013

[11] Jiang, Jingtian, and Nenghai Yu. "Schedule web forum crawling with a freshness-first
strategy." Computer Science and Network Technology (ICCSNT), 2011 International
Conference on. Vol. 3. IEEE, 2011.

Dept. of CSE, MSEC 2014-15 Page 19

You might also like