You are on page 1of 6

6/1/23, 6:26 PM webScraping.

ipynb - Colaboratory

import requests 
import pandas as pd
from bs4 import BeautifulSoup

# 1) scrape header tags from wikipedia.org and create a datarFrame
url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
headers = soup.find_all(['h1','h2','h3','h4','h5','h6'])
header_text = [header.get_text() for header in headers]
df_headers = pd.DataFrame({'Headers':header_text})
df_headers = df_headers.style.set_caption('Header tags from wikipedia:')
df_headers

Header tags from wikipedia:


  Headers

0 Main Page

1 Welcome to Wikipedia

2 From today's featured article

3 Did you know ...

4 In the news

5 On this day

6 Today's featured picture

7 Other areas of Wikipedia

8 Wikipedia's sister projects

9 Wikipedia languages

# 2) scrape IMDB's top rated 50 movies data and create a data frame 
url = 'https://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
movies = soup.select('td.titleColumn')
ratings = soup.select('td.imdbRating strong')
movie_data = []

for movie , rating in zip(movies,ratings):
  title = movie.find('a').text
  year = movie.find('span').text.strip('()')
  rating = rating.text
  movie_data.append({'Title':title,'Year':year,'Rating':rating})

df_movies = pd.DataFrame(movie_data)
print('Top 50 movies on IMDB:')
df_movies.head(50)

https://colab.research.google.com/drive/1oe3IwTz1pXJQB5ruNSRv8V1J3F9Ap7We#scrollTo=tsxIIWZUA1CX&printMode=true 1/6
6/1/23, 6:26 PM webScraping.ipynb - Colaboratory
2 The Dark Knight 2008 9.0

3 The Godfather Part II 1974 9.0

4 12 Angry Men 1957 9.0

5 Schindler's List 1993 8.9

6 The Lord of the Rings: The Return of the King 2003 8.9

7 Pulp Fiction 1994 8.8

8 The Lord of the Rings: The Fellowship of the Ring 2001 8.8

9 The Good, the Bad and the Ugly 1966 8.8

10 Forrest Gump 1994 8.8

11 Fight Club 1999 8.7

12 The Lord of the Rings: The Two Towers 2002 8.7

13 Inception 2010 8.7

14 Star Wars: Episode V - The Empire Strikes Back 1980 8.7

15 The Matrix 1999 8.7

16 Goodfellas 1990 8.7

17 One Flew Over the Cuckoo's Nest 1975 8.6

18 Se7en 1995 8.6

19 It's a Wonderful Life 1946 8.6

20 Seven Samurai 1954 8.6

21 The Silence of the Lambs 1991 8.6

22 Saving Private Ryan 1998 8.6

23 City of God 2002 8.6

24 Interstellar 2014 8.6

25 Life Is Beautiful 1997 8.6

26 The Green Mile 1999 8.6

27 Star Wars: Episode IV - A New Hope 1977 8.5

28 Terminator 2: Judgment Day 1991 8.5

29 Back to the Future 1985 8.5

30 Spirited Away 2001 8.5

31 The Pianist 2002 8.5

32 Psycho 1960 8.5

33 Parasite 2019 8.5

34 Léon: The Professional 1994 8.5

35 Gladiator 2000 8.5

36 The Lion King 1994 8.5

37 American History X 1998 8.5


import requests
38 The Departed 2006 8.5
import pandas as pd
from bs4 import BeautifulSoup
39 Whiplash 2014 8.5

40 The Prestige 2006 8.5


# 3) Scrape IMDB's Top rated 50 Indian movies' data and create a data frame
url = 'https://www.imdb.com/list/ls056092300/'
41 The Usual Suspects 1995 8.5
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
42 Casablanca 1942 8.5
movies = soup.find_all('h3', class_='lister-item-header')
43 Grave of the Fireflies 1988 8.5
ratings = soup.find_all('span', class_='ipl-rating-star__rating')
movie_data = []
44 Harakiri 1962 8.5

for movie, rating in zip(movies, ratings):
45 The Intouchables 2011 8.5
    title = movie.find('a').text
46 Modern Times 1936 8.4
    year = movie.find('span', class_='lister-item-year').text.strip('()')
    rating = rating.text
47 Once Upon a Time in the West 1968 8.4
    movie_data.append({'Title': title, 'Year': year, 'Rating': rating})
48 Cinema Paradiso 1988 8.4
df_indian_movies = pd.DataFrame(movie_data)
49
print("Top 50 Indian Movies on IMDB:") Rear Window 1954 8.4
df_indian_movies.head(50)

https://colab.research.google.com/drive/1oe3IwTz1pXJQB5ruNSRv8V1J3F9Ap7We#scrollTo=tsxIIWZUA1CX&printMode=true 2/6
6/1/23, 6:26 PM webScraping.ipynb - Colaboratory
2 Paper Flowers 1959 Rate

3 Lagaan: Once Upon a Time in India 2001 1

4 Pather Panchali 1955 Rate

5 Charulata 1964 2

6 Rang De Basanti 2006 Rate

7 Dev.D 2009 3

8 3 Idiots 2009 Rate

9 Awaara 1951 4

10 Nayakan 1987 Rate

11 Aparajito 1956 5

12 Pushpaka Vimana 1987 Rate

13 Thirst 1957 6

14 The Ritual 1977 Rate

15 Sholay 1975 7

16 Aradhana 1969 Rate

17 Do Ankhen Barah Haath 1957 8

18 Bombay 1995 Rate

19 Neecha Nagar 1946 9

20 Do Bigha Zamin 1953 Rate

21 Garm Hava 1974 10

22 Piravi 1989 Rate

23 Mughal-E-Azam 1960 8.4

24 Report to Mother 1986 0

25 Madhumati 1958 Rate

26 Goopy Gyne Bagha Byne 1969 1

27 Gangs of Wasseypur 2012 Rate

28 Guide 1965 2

29 Satya 1998 Rate

30 Roja 1992 3

31 Mr. India 1987 Rate

32 The Cloud-Capped Star 1960 4

33 Harishchandrachi Factory 2009 Rate

34 Masoom 1983 5

35 Agneepath 1990 Rate

36 Tabarana Kathe 1986 6

37 Zakhm 1998 Rate

38 Dil Chahta Hai 2001 7

39 Bhaag Milkha Bhaag 2013 Rate

40 Chupke Chupke 1975 8

41 Dilwale Dulhania Le Jayenge 1995 Rate

42 Like Stars on Earth 2007 9

43 Ardh Satya 1983 Rate

44 Bhumika 1977 10

45 Enthiran 2010 Rate

46 Sadma 1983 7.8

47 A Breath 2004 0

48 Lamhe 1991 Rate

49 Haqeeqat 1964 1

https://colab.research.google.com/drive/1oe3IwTz1pXJQB5ruNSRv8V1J3F9Ap7We#scrollTo=tsxIIWZUA1CX&printMode=true 3/6
6/1/23, 6:26 PM webScraping.ipynb - Colaboratory

# 4) Scrape list of respected former presidents of India and create a data frame
url = 'https://presidentofindia.nic.in/former-presidents.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
president_table = soup.find('table', class_='views-table')
president_data = []

if president_table is not None:
    presidents = president_table.find_all('tr')[1:]
    for president in presidents:
        columns = president.find_all('td')
        if len(columns) > 1:
            name = columns[0].text.strip()
            term = columns[1].text.strip()
            president_data.append({'Name': name, 'Term of Office': term})

df_presidents = pd.DataFrame(president_data)
print("Former Presidents of India:")
print(df_presidents)

Former Presidents of India:


Empty DataFrame
Columns: []
Index: []

import requests
import pandas as pd
from bs4 import BeautifulSoup

def scrape_odi_teams_men():
    url = 'https://www.icc-cricket.com/rankings/mens/team-rankings/odi'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find('table', class_='table')
    rows = table.find_all('tr')
    
    team_data = []
    for row in rows[1:11]:  # Exclude the header row and consider only the top 10 teams
        cells = row.find_all('td')
        rank = cells[0].text.strip()
        team = cells[1].text.strip()
        matches = cells[2].text.strip()
        points = cells[3].text.strip()
        rating = cells[4].text.strip()
        team_data.append({'Rank': rank, 'Team': team, 'Matches': matches, 'Points': points, 'Rating': rating})
    
    df = pd.DataFrame(team_data)
    return df

def scrape_odi_batsmen_men():
    url = 'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find('table', class_='table')
    rows = table.find_all('tr')
    
    player_data = []
    for row in rows[1:11]:  # Exclude the header row and consider only the top 10 batsmen
        cells = row.find_all('td')
        rank = cells[0].text.strip()
        player = cells[1].text.strip()
        team = cells[2].text.strip()
        rating = cells[3].text.strip()
        player_data.append({'Rank': rank, 'Player': player, 'Team': team, 'Rating': rating})
    
    df = pd.DataFrame(player_data)
    return df

def scrape_odi_bowlers_men():
    url = 'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/bowling'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find('table', class_='table')
    rows = table.find_all('tr')
    
    player_data = []
    for row in rows[1:11]:  # Exclude the header row and consider only the top 10 bowlers
        cells = row.find_all('td')
        rank = cells[0].text.strip()

https://colab.research.google.com/drive/1oe3IwTz1pXJQB5ruNSRv8V1J3F9Ap7We#scrollTo=tsxIIWZUA1CX&printMode=true 4/6
6/1/23, 6:26 PM webScraping.ipynb - Colaboratory
        player = cells[1].text.strip()
        team = cells[2].text.strip()
        rating = cells[3].text.strip()
        player_data.append({'Rank': rank, 'Player': player, 'Team': team, 'Rating': rating})
    
    df = pd.DataFrame(player_data)
    return df

scrape_odi_teams_men()
scrape_odi_batsmen_men()
scrape_odi_bowlers_men()

Rank Player Team Rating

0 1\n \n\n\n(0) Josh Hazlewood AUS 705

1 2\n \n\n\n(0) Mohammed Siraj IND 691

2 3\n \n\n\n(0) Mitchell Starc AUS 686

3 4\n \n\n\n(0) Matt Henry NZ 667

4 5\n \n\n\n(0) Trent Boult NZ 660

5 6\n \n\n\n(0) Rashid Khan AFG 659

6 7\n \n\n\n(0) Adam Zampa AUS 652

7 8\n \n\n\n(0) Mujeeb Ur Rahman AFG 637

8 9\n \n\n\n(0) Mohammad Nabi AFG 631

9 10\n \n\n\n(0) Shaheen Afridi PAK 630

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape and create data frame
def scrape_icc_cricket(url):
    # Send a GET request to the specified URL
    response = requests.get(url)
    
    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the table containing the required data
    table = soup.find('table')
    
    # Initialize empty lists to store the data
    teams = []
    matches = []
    points = []
    ratings = []
    
    # Iterate over each row in the table
    for row in table.find_all('tr')[1:]:
        # Extract the data from each column in the row
        cols = row.find_all('td')
        team = cols[0].text.strip()
        match = cols[1].text.strip()
        point = cols[2].text.strip()
        rating = cols[3].text.strip()
        
        # Append the data to the respective lists
        teams.append(team)
        matches.append(match)
        points.append(point)
        ratings.append(rating)
    
    # Create a data frame using the scraped data
    df = pd.DataFrame({
        'Team': teams,
        'Matches': matches,
        'Points': points,
        'Rating': ratings
    })
    
    return df

# Scrape and create data frame for top 10 ODI teams in women's cricket
url_a = 'https://www.icc-cricket.com/rankings/womens/team-rankings/odi'
df a = scrape icc cricket(url a)
https://colab.research.google.com/drive/1oe3IwTz1pXJQB5ruNSRv8V1J3F9Ap7We#scrollTo=tsxIIWZUA1CX&printMode=true 5/6
6/1/23, 6:26 PM webScraping.ipynb - Colaboratory
df_a   scrape_icc_cricket(url_a)

# Scrape and create data frame for top 10 women's ODI batting players
url_b = 'https://www.icc-cricket.com/rankings/womens/player-rankings/odi/batting'
df_b = scrape_icc_cricket(url_b)

# Scrape and create data frame for top 10 women's ODI all-rounders
url_c = 'https://www.icc-cricket.com/rankings/womens/player-rankings/odi/all-rounder'
df_c = scrape_icc_cricket(url_c)

# Print the data frames
print("Top 10 ODI Teams in Women's Cricket:")
print(df_a)
print("\nTop 10 Women's ODI Batting Players:")
print(df_b)
print("\nTop 10 Women's ODI All-rounders:")
print(df_c)

Top 10 ODI Teams in Women's Cricket:


Team Matches Points Rating
0 1 Australia\nAUS 21 3,603
1 2 England\nENG 28 3,342
2 3 South Africa\nSA 26 3,098
3 4 India\nIND 27 2,820
4 5 New Zealand\nNZ 25 2,553
5 6 West Indies\nWI 27 2,535
6 7 Thailand\nTHA 11 821
7 8 Bangladesh\nBAN 14 977
8 9 Pakistan\nPAK 27 1,678
9 10 Sri Lanka\nSL 9 479
10 11 Ireland\nIRE 14 548
11 12 Netherlands\nNED 9 0
12 13 Zimbabwe\nZIM 11 0

Top 10 Women's ODI Batting Players:


Team Matches \
0 1\n \n\n\n\n\n(1)\nThis... Beth Mooney
1 2\n \n\n\n\n\n(... Laura Wolvaardt
2 3\n \n\n\n\n\n(... Natalie Sciver
3 4\n \n\n\n\n\n(... Meg Lanning
4 5\n \n\n\n\n\n(... Harmanpreet Kaur
.. ... ...
95 96\n \n\n\n\n\n... Kim Garth
96 97\n \n\n\n\n\n... Arlene Kelly
97 98\n \n\n\n\n\n... Fatima Sana
98 99\n \n\n\n\n\n... Mary-Anne Musonda
99 =\n \n\n\n\n\n\... Lea Tahuhu

Points Rating
0 AUS 754
1 SA 732
2 ENG 731
3 AUS 717
4 IND 716
.. ... ...
95 AUS 202
96 IRE 199
97 PAK 197
98 ZIM 196
99 NZ 196

[100 rows x 4 columns]

Top 10 Women's ODI All-rounders:


Team Matches \
0 1\n \n\n\n(0) Hayley Matthews
1 2\n \n\n\n(0) Natalie Sciver
2 3\n \n\n\n(0) Ellyse Perry
3 4\n \n\n\n(0) Marizanne Kapp
4 5\n \n\n\n(0) Amelia Kerr
5 6\n \n\n\n(0) Deepti Sharma
6 7\n \n\n\n(0) Ashleigh Gardner
7 8\n \n\n\n(0) Jess Jonassen
8 9\n \n\n\n(0) Nida Dar
9 10\n \n\n\n(0) Sophie Ecclestone
10 11\n \n\n\n(0) Sophie Devine
( )

https://colab.research.google.com/drive/1oe3IwTz1pXJQB5ruNSRv8V1J3F9Ap7We#scrollTo=tsxIIWZUA1CX&printMode=true 6/6

You might also like