Web Crawling

WebWeb-Crawling
What is WebCrawler ?

Program that browses the www in a methodical, automated manner or in an orderly fashion Ants, Ants, automatic indexers, bots,Web indexers, bots, spiders, spiders,Web robots or Webscutters process is called Web crawling or spidering.
Common Usage of Web Crawler

Search Engine Presenting information in manageable format Testing / Validation /maintenance of website
Business usage of WebCrawler

Collecting Social media/ networking data to identify potential customers Product Listings & Consumer Reviews Custom Tariff Data Collection E-mail address harvesting
How WebCrawler works?
WebCrawler Examples

Yahoo! Slurp -- Yahoo Search crawler Bingbot -- Microsoft's Bing webcrawler Googlebot -- Google web crawler
How Google works?

Googlebot, a web crawler that finds and fetches web pages. The indexer that sorts every word on every page and stores the resulting index of words in a huge database. The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.
WebWeb-Crawling Techniques

HttpUnit HtmlUnit HttpUrlConnection
HttpUnit

Center of HttpUnit is the WebWebConversation class
WebConversation wc = new WebConversation(); WebRequest req = new GetMethodWebRequest("http://www.metacube.c om/productengg.asp"); WebResponse resp=wc.getResponse( req);
Navigating Web Page Link Using the response:

WebLink link = resp.getLinkWith( "linkId" ); link.click(); WebResponse secondResponse = wc.getCurrentPage();
Retrieve Table Structure of WebPage:

WebTable table = resp.getTables()[0];
Advantages of HttpUnit

Simple way to crawl a website Powerful tool for creating test suites to ensure the end-to-end end-tofunctionality of your Web applications. Retrieve forms or table structure or frames associated with web page and can also navigate to particular web link.
Limitations of HttpUnit:

Not able to parse whole javascript associated with that web page HttpUnitOptions.setScriptingEnabled( false) can disable the javascript but can result into wrong page validations Things center around WebConversations, WebRequests and WebResponses.
HtmlUnit
Open source java library for creating HTTP calls which imitate the browser functionality. Higher-level than HttpUnits, modeling Higherweb interaction in terms of the documents and interface elements which the user interacts with Can be used for web application testing
Web Client
Starting point and work as browser simulator WebClient.getPage() is just like typing an address in the browser. It returns an HtmlPage object. HtmlPage lets you access to many of a web page content @Test public void testGoogle(){ WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); assertEquals("Google", currentPage.getTitleText());}
HTML Elements
HtmlPage lets you ability to access any of the page HTML elements and all of their attributes and sub elements. This includes forms, tables, images, input fields, divs or any other Html element you may imagine. Can also access any of the DOM elements by using XPath //Using XPath to get the first result in Google //query HtmlElement element = (HtmlElement)currentPage.getByXPath("//h3").ge t(0); DomNode result = element.getChildNodes().get(0);
Htmlunit Javascript support

Uses the Mozilla Rhino JavaScript engine. Lets you the ability to run pages with JavaScript or even run JavaScript code by command ScriptResult result = currentPage.executeJavaScript(JavaScript Code); Turn off the JavaScript all together using currentPage.getWebClient().setJavaScript Enabled(false);
Advantages of HtmlUnit:

JavaScript code is executed just like in normal browsers when the page loads or when an handler is triggered. HtmlUnit provides the ability to inject code into an existing page via HtmlPage.executeJavascript(String yourJsCode).
Limitations of HtmlUnit
Pages which use third party libraries might not work when tested via HtmlUnit like the following webpage for crawling Iran: http://portal.irica.gov.ir/Portal/Home/ Default.aspx?CategoryID=82cebca0Default.aspx?CategoryID=82cebca058cc-4be6-bef558cc-4be6-bef5-a94e2242140a Turning off Javascript can result into wrong page validations.
HttpURLConnection:

Crawl site data using standard Jdk api only Ensure the Http request that we are forming must be bit to bit identical Analyze the http request using any 3rd party tool and then identify the http request header and post data
HttpURLConnection Example
Access the URL object with the specifying url string URL urlObj = new URL(urlString); HttpURLConnection conn = urlObj.openConnection(); //Add Header information to the particular connection conn.setRequestMethod("POST"); conn.addRequestProperty("Keepconn.addRequestProperty("Keep-Alive", "115"); conn.addRequestProperty("Accept-Language","enconn.addRequestProperty("Accept-Language","en-us,en;q=0.5"); conn.addRequestProperty("Connection","keepconn.addRequestProperty("Connection","keep-alive"); conn.addRequestProperty("Userconn.addRequestProperty("User-Agent", "Mozilla/5.0") conn.addRequestProperty("Host","portal.irica.gov.ir"); conn.setRequestProperty("Cookie",iranHomePageInfo.getCookie()); conn.addRequestProperty("Accept","text/html");
Continue
HttpURLConnection Example Continued..

//Build the post data for navigation to particular page String postData = URLEncoder.encode("__EVENTTARGET", "UTF-8") "UTF+ "="+ URLEncoder.encode("WebPart_c91c1c00_0fcc_4fea_8582_3576b4 43d5f2$LF0$PagerObject","UTF43d5f2$LF0$PagerObject","UTF-8"); postData += "&" +URLEncoder.encode("__EVENTARGUMENT", "UTF"UTF-8") + "="+ URLEncoder.encode(pageNumberStr, "UTF-8"); "UTFpostData += "&" + URLEncoder.encode("__VIEWSTATE", "UTF-8") "UTF+ "="+ URLEncoder.encode(iranHomePageInfo.getViewState(), "UTF"UTF-8"); //Then write post data to commection OutputStreamWriter out = new OutputStreamWriter(conn.getOutputStream()); //write post data to connection out.write(postData); out.flush(); out.close();
HttpUrlConnection Pros:

Basic/standard Jdk api to be used for crawling Can handle complex java script and 3rd party libraries javascript also More focus on Http request formation
HttpUrlConnection Cons

More cumbersome since we have to write complex code for building complex request bit to bit identical Have to match cookies and all header information as it is displaying via Httpanalyzer or 3rd party tool
Conclusions
Use HttpUnit only for Simple web site Web pages don't have much java script Use HtmlUnit when: HttpUnit is not able to crawl the web page We require GUI-Less browser simulation for Java GUIprograms. Models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like we do in our "normal" browser.
Conclusions Continued..
Use HttpUrlConnection when: HttpUnit and HtmlUnit both are not working Web pages having fairly amount of javascript and 3rd party javascript libraries
Questions ??

Web Crawling

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Crawling

Uploaded by

Copyright:

Available Formats

WebWeb-Crawling

Common Usage of Web Crawler

Business usage of WebCrawler

How WebCrawler works?

How Google works?

HttpUnit HtmlUnit HttpUrlConnection

Center of HttpUnit is the WebWebConversation class

Navigating Web Page Link Using the response:

Retrieve Table Structure of WebPage:

Htmlunit Javascript support

HttpURLConnection Example Continued..

You might also like