You are on page 1of 8

Data Structure and Files

SCE
Name Roll No. Gr No.
Siddhant Jain 221027 21910811
Anish Kataria 221034 21911105
Anjali More 221082 22020114
Shraddha Mulay 221083 22020260
Web Crawler Your
Logo
Here
A Web crawler, sometimes called a Spider or Spiderbot. Its an
Internet bot that systematically browses the World Wide Web,
typically operated by search engines for the purpose of Web
indexing.

Basic Crawler Operation


1. Begin with known “seed” pages
2. Fetch and parse them
3. Extract URLs they point to
4. Place the extracted URLs on a ArrayList
5. Fetch each URL on the ArrayList and repeat
Your
How Google Crawler works? Logo
Here

• Google uses software known as Web Crawlers to discover


publicly available webpages. The most well-known crawler is
called Googlebot.
• Crawlers look at webpages and follow links on those pages and
go from link to link and bring data about those webpages back
to Google’s servers.
Code Your
import java.io.IOException; Logo
import java.util.ArrayList; Here
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.Scanner;
public class Crawler {
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner sc= new Scanner(System.in);
System.out.print("Enter a website: ");
String url= sc.nextLine(); //reads string.
//String url = "https://en.wikipedia.org/";
crawl(1, url, new ArrayList<String>());
}
private static void crawl(int level, String url, ArrayList<String> visited) {
if(level <=5)
{
Document doc = request(url, visited);
if(doc != null)
{
for(Element link : doc.select("a[href]")) {
String next_link = link.absUrl("href");
if(visited.contains(next_link) == false) { Your
crawl(level++, next_link, visited); Logo
}
Here
}
}
}
}
private static Document request(String url, ArrayList<String> v) {
try {
Connection con = Jsoup.connect(url);
Document doc = con.get();
if(con.response().statusCode() == 200) {
System.out.println("Link: "+url);
System.out.println(doc.title());
v.add(url);
return doc;
}
return null;
}
catch(IOException e) {
return null;
}
}
}
Output Your
Logo
Here
Your
Logo
Time Complexity Here

Function Name Number of Lines in Time


code Complexity
main 6 O(1)
crawl 15 O(5)
request 16 O(1+1+1)=O(3)
Your
Logo
Here

Thankyou

You might also like