Professional Documents
Culture Documents
What is WebCrawler ?
Program that browses the www in a methodical, automated manner or in an orderly fashion Ants, Ants, automatic indexers, bots,Web indexers, bots, spiders, spiders,Web robots or Webscutters process is called Web crawling or spidering.
Search Engine Presenting information in manageable format Testing / Validation /maintenance of website
Collecting Social media/ networking data to identify potential customers Product Listings & Consumer Reviews Custom Tariff Data Collection E-mail address harvesting
WebCrawler Examples
Yahoo! Slurp -- Yahoo Search crawler Bingbot -- Microsoft's Bing webcrawler Googlebot -- Google web crawler
Googlebot, a web crawler that finds and fetches web pages. The indexer that sorts every word on every page and stores the resulting index of words in a huge database. The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.
WebWeb-Crawling Techniques
HttpUnit
WebConversation wc = new WebConversation(); WebRequest req = new GetMethodWebRequest("http://www.metacube.c om/productengg.asp"); WebResponse resp=wc.getResponse( req);
Advantages of HttpUnit
Simple way to crawl a website Powerful tool for creating test suites to ensure the end-to-end end-tofunctionality of your Web applications. Retrieve forms or table structure or frames associated with web page and can also navigate to particular web link.
Limitations of HttpUnit:
Not able to parse whole javascript associated with that web page HttpUnitOptions.setScriptingEnabled( false) can disable the javascript but can result into wrong page validations Things center around WebConversations, WebRequests and WebResponses.
HtmlUnit
Open source java library for creating HTTP calls which imitate the browser functionality. Higher-level than HttpUnits, modeling Higherweb interaction in terms of the documents and interface elements which the user interacts with Can be used for web application testing
Web Client
Starting point and work as browser simulator WebClient.getPage() is just like typing an address in the browser. It returns an HtmlPage object. HtmlPage lets you access to many of a web page content @Test public void testGoogle(){ WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("http://www.google.com/"); assertEquals("Google", currentPage.getTitleText());}
HTML Elements
HtmlPage lets you ability to access any of the page HTML elements and all of their attributes and sub elements. This includes forms, tables, images, input fields, divs or any other Html element you may imagine. Can also access any of the DOM elements by using XPath //Using XPath to get the first result in Google //query HtmlElement element = (HtmlElement)currentPage.getByXPath("//h3").ge t(0); DomNode result = element.getChildNodes().get(0);
Advantages of HtmlUnit:
JavaScript code is executed just like in normal browsers when the page loads or when an handler is triggered. HtmlUnit provides the ability to inject code into an existing page via HtmlPage.executeJavascript(String yourJsCode).
Limitations of HtmlUnit
Pages which use third party libraries might not work when tested via HtmlUnit like the following webpage for crawling Iran: http://portal.irica.gov.ir/Portal/Home/ Default.aspx?CategoryID=82cebca0Default.aspx?CategoryID=82cebca058cc-4be6-bef558cc-4be6-bef5-a94e2242140a Turning off Javascript can result into wrong page validations.
HttpURLConnection:
Crawl site data using standard Jdk api only Ensure the Http request that we are forming must be bit to bit identical Analyze the http request using any 3rd party tool and then identify the http request header and post data
HttpURLConnection Example
Access the URL object with the specifying url string URL urlObj = new URL(urlString); HttpURLConnection conn = urlObj.openConnection(); //Add Header information to the particular connection conn.setRequestMethod("POST"); conn.addRequestProperty("Keepconn.addRequestProperty("Keep-Alive", "115"); conn.addRequestProperty("Accept-Language","enconn.addRequestProperty("Accept-Language","en-us,en;q=0.5"); conn.addRequestProperty("Connection","keepconn.addRequestProperty("Connection","keep-alive"); conn.addRequestProperty("Userconn.addRequestProperty("User-Agent", "Mozilla/5.0") conn.addRequestProperty("Host","portal.irica.gov.ir"); conn.setRequestProperty("Cookie",iranHomePageInfo.getCookie()); conn.addRequestProperty("Accept","text/html");
Continue
HttpUrlConnection Pros:
Basic/standard Jdk api to be used for crawling Can handle complex java script and 3rd party libraries javascript also More focus on Http request formation
HttpUrlConnection Cons
More cumbersome since we have to write complex code for building complex request bit to bit identical Have to match cookies and all header information as it is displaying via Httpanalyzer or 3rd party tool
Conclusions
Use HttpUnit only for Simple web site Web pages don't have much java script Use HtmlUnit when: HttpUnit is not able to crawl the web page We require GUI-Less browser simulation for Java GUIprograms. Models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like we do in our "normal" browser.
Conclusions Continued..
Use HttpUrlConnection when: HttpUnit and HtmlUnit both are not working Web pages having fairly amount of javascript and 3rd party javascript libraries
Questions ??