• Only search web pages about a specific topic(e.g., cricket)
thus reducing amount of network traffic and download. • Objective – selectively seek out pages that are relevant to a predefined set of topics and throw out all unrelated pages. • It assumes that some labeled examples of relevant and non relevant pages are available. Algorithms of Focused Crawling 1. FISH SEARCH 2. SHARK SEARCH 3. INFOSPIDERS 4. N-BEST FIRST 5. INTELLIGENT CRAWLING Fish Search Algorithm • Uses principle of the fish school metaphor. • Fetches document according to relevancy • If relevant then score=1 else score=0 (i.e. irrelevant) Shark Search Algorithm • Improved version of fish search. • Score between 0 and 1 using Vector space model. • Child relevance depends on • Inherited score • Meta data Infospider Algorithm • Uses Neural network and Back propagation. • It is multiagent system for mining of information. • Crawls only current surroundings . • Not provide stale information. N-Best First • Generalization of Best First. • At each point N documents are picked for crawling instead of one page. • Using some algorithm it chooses best document to crawl. Intelligent Crawling • Give priorities to documents on basis of characteristics. • Characteristics are page content, URL data or sibling pages. • It has the potential of self learning. Conclusion • Many algorithms Which to use? ANS - depends on weaknesses and strengths of algorithm • Like Fish search algorithm is slow and resource consuming while shark search algorithm is more effective than Fish search. • InfoSpiders algorithm is more scalable. • N-Best first has better performance than InfoSpiders and Shark search. • Intelligent crawling is the highly effective algorithm that learns to crawl without user training. THANK YOU