You are on page 1of 26

WebWeb-Crawling

What is WebCrawler ?


Program that browses the www in a methodical, automated manner or in an orderly fashion Ants, Ants, automatic indexers, bots,Web indexers, bots, spiders, spiders,Web robots or Webscutters process is called Web crawling or spidering.

Common Usage of Web Crawler




Search Engine Presenting information in manageable format Testing / Validation /maintenance of website

Business usage of WebCrawler




Collecting Social media/ networking data to identify potential customers Product Listings & Consumer Reviews Custom Tariff Data Collection E-mail address harvesting

How WebCrawler works?

WebCrawler Examples


Yahoo! Slurp -- Yahoo Search crawler Bingbot -- Microsoft's Bing webcrawler Googlebot -- Google web crawler

How Google works?




Googlebot, a web crawler that finds and fetches web pages. The indexer that sorts every word on every page and stores the resulting index of words in a huge database. The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

WebWeb-Crawling Techniques


HttpUnit HtmlUnit HttpUrlConnection

HttpUnit


Center of HttpUnit is the WebWebConversation class

WebConversation wc = new WebConversation(); WebRequest req = new GetMethodWebRequest("http://www.metacube.c om/productengg.asp"); WebResponse resp=wc.getResponse( req);

Navigating Web Page Link Using the response:


WebLink link = resp.getLinkWith( "linkId" ); link.click(); WebResponse secondResponse = wc.getCurrentPage();

Retrieve Table Structure of WebPage:


WebTable table = resp.getTables()[0];

Advantages of HttpUnit
 

Simple way to crawl a website Powerful tool for creating test suites to ensure the end-to-end end-tofunctionality of your Web applications. Retrieve forms or table structure or frames associated with web page and can also navigate to particular web link.

Limitations of HttpUnit:


Not able to parse whole javascript associated with that web page HttpUnitOptions.setScriptingEnabled( false) can disable the javascript but can result into wrong page validations Things center around WebConversations, WebRequests and WebResponses.

HtmlUnit
 Open source java library for creating HTTP calls which imitate the browser functionality.  Higher-level than HttpUnits, modeling Higherweb interaction in terms of the documents and interface elements which the user interacts with  Can be used for web application testing

Web Client
 Starting point and work as browser simulator  WebClient.getPage() is just like typing an address in the browser. It returns an HtmlPage object.  HtmlPage lets you access to many of a web page content  @Test public void testGoogle(){ WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); assertEquals("Google", currentPage.getTitleText());}

HTML Elements
 HtmlPage lets you ability to access any of the page HTML elements and all of their attributes and sub elements. This includes forms, tables, images, input fields, divs or any other Html element you may imagine.  Can also access any of the DOM elements by using XPath  //Using XPath to get the first result in Google //query HtmlElement element = (HtmlElement)currentPage.getByXPath("//h3").ge t(0); DomNode result = element.getChildNodes().get(0);

Htmlunit Javascript support


 Uses the Mozilla Rhino JavaScript engine.  Lets you the ability to run pages with JavaScript or even run JavaScript code by command ScriptResult result = currentPage.executeJavaScript(JavaScript Code);  Turn off the JavaScript all together using currentPage.getWebClient().setJavaScript Enabled(false);

Advantages of HtmlUnit:


JavaScript code is executed just like in normal browsers when the page loads or when an handler is triggered. HtmlUnit provides the ability to inject code into an existing page via HtmlPage.executeJavascript(String yourJsCode).

Limitations of HtmlUnit
 Pages which use third party libraries might not work when tested via HtmlUnit like the following webpage for crawling Iran: http://portal.irica.gov.ir/Portal/Home/ Default.aspx?CategoryID=82cebca0Default.aspx?CategoryID=82cebca058cc-4be6-bef558cc-4be6-bef5-a94e2242140a  Turning off Javascript can result into wrong page validations.

HttpURLConnection:


Crawl site data using standard Jdk api only Ensure the Http request that we are forming must be bit to bit identical Analyze the http request using any 3rd party tool and then identify the http request header and post data

HttpURLConnection Example
Access the URL object with the specifying url string URL urlObj = new URL(urlString); HttpURLConnection conn = urlObj.openConnection(); //Add Header information to the particular connection conn.setRequestMethod("POST"); conn.addRequestProperty("Keepconn.addRequestProperty("Keep-Alive", "115"); conn.addRequestProperty("Accept-Language","enconn.addRequestProperty("Accept-Language","en-us,en;q=0.5"); conn.addRequestProperty("Connection","keepconn.addRequestProperty("Connection","keep-alive"); conn.addRequestProperty("Userconn.addRequestProperty("User-Agent", "Mozilla/5.0") conn.addRequestProperty("Host","portal.irica.gov.ir"); conn.setRequestProperty("Cookie",iranHomePageInfo.getCookie()); conn.addRequestProperty("Accept","text/html");

Continue

HttpURLConnection Example Continued..


//Build the post data for navigation to particular page String postData = URLEncoder.encode("__EVENTTARGET", "UTF-8") "UTF+ "="+ URLEncoder.encode("WebPart_c91c1c00_0fcc_4fea_8582_3576b4 43d5f2$LF0$PagerObject","UTF43d5f2$LF0$PagerObject","UTF-8"); postData += "&" +URLEncoder.encode("__EVENTARGUMENT", "UTF"UTF-8") + "="+ URLEncoder.encode(pageNumberStr, "UTF-8"); "UTFpostData += "&" + URLEncoder.encode("__VIEWSTATE", "UTF-8") "UTF+ "="+ URLEncoder.encode(iranHomePageInfo.getViewState(), "UTF"UTF-8"); //Then write post data to commection OutputStreamWriter out = new OutputStreamWriter(conn.getOutputStream()); //write post data to connection out.write(postData); out.flush(); out.close();

HttpUrlConnection Pros:


Basic/standard Jdk api to be used for crawling Can handle complex java script and 3rd party libraries javascript also More focus on Http request formation

HttpUrlConnection Cons


More cumbersome since we have to write complex code for building complex request bit to bit identical Have to match cookies and all header information as it is displaying via Httpanalyzer or 3rd party tool

Conclusions
Use HttpUnit only for  Simple web site  Web pages don't have much java script Use HtmlUnit when:  HttpUnit is not able to crawl the web page  We require GUI-Less browser simulation for Java GUIprograms.  Models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like we do in our "normal" browser.

Conclusions Continued..
Use HttpUrlConnection when:  HttpUnit and HtmlUnit both are not working  Web pages having fairly amount of javascript and 3rd party javascript libraries

Questions ??

You might also like