Piyush Shivhare CS201083 WebScraping - Ipynb - Colaboratory

6/1/23, 6:26 PM webScraping.
ipynb - Colaboratory
import requests
import pandas as pd
from bs4 import BeautifulSoup
# 1) scrape header tags from wikipedia.org and create a datarFrame
url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
headers = soup.find_all(['h1','h2','h3','h4','h5','h6'])
header_text = [header.get_text() for header in headers]
df_headers = pd.DataFrame({'Headers':header_text})
df_headers = df_headers.style.set_caption('Header tags from wikipedia:')
df_headers
Header tags from wikipedia:

Headers
0 Main Page
1 Welcome to Wikipedia
2 From today's featured article
3 Did you know ...
4 In the news
5 On this day
6 Today's featured picture
7 Other areas of Wikipedia
8 Wikipedia's sister projects
9 Wikipedia languages
# 2) scrape IMDB's top rated 50 movies data and create a data frame
url = 'https://www.imdb.com/chart/top'
soup = BeautifulSoup(response.text,'html.parser')
movies = soup.select('td.titleColumn')
ratings = soup.select('td.imdbRating strong')
movie_data = []
for movie , rating in zip(movies,ratings):
title = movie.find('a').text
year = movie.find('span').text.strip('()')
rating = rating.text
movie_data.append({'Title':title,'Year':year,'Rating':rating})
df_movies = pd.DataFrame(movie_data)
print('Top 50 movies on IMDB:')
df_movies.head(50)
https://colab.research.google.com/drive/1oe3IwTz1pXJQB5ruNSRv8V1J3F9Ap7We#scrollTo=tsxIIWZUA1CX&printMode=true 1/6
6/1/23, 6:26 PM webScraping.ipynb - Colaboratory
2 The Dark Knight 2008 9.0
3 The Godfather Part II 1974 9.0
4 12 Angry Men 1957 9.0
5 Schindler's List 1993 8.9
6 The Lord of the Rings: The Return of the King 2003 8.9
7 Pulp Fiction 1994 8.8
8 The Lord of the Rings: The Fellowship of the Ring 2001 8.8
9 The Good, the Bad and the Ugly 1966 8.8
10 Forrest Gump 1994 8.8
11 Fight Club 1999 8.7
12 The Lord of the Rings: The Two Towers 2002 8.7
13 Inception 2010 8.7
14 Star Wars: Episode V - The Empire Strikes Back 1980 8.7
15 The Matrix 1999 8.7
16 Goodfellas 1990 8.7
17 One Flew Over the Cuckoo's Nest 1975 8.6
18 Se7en 1995 8.6
19 It's a Wonderful Life 1946 8.6
20 Seven Samurai 1954 8.6
21 The Silence of the Lambs 1991 8.6
22 Saving Private Ryan 1998 8.6
23 City of God 2002 8.6
24 Interstellar 2014 8.6
25 Life Is Beautiful 1997 8.6
26 The Green Mile 1999 8.6
27 Star Wars: Episode IV - A New Hope 1977 8.5
28 Terminator 2: Judgment Day 1991 8.5
29 Back to the Future 1985 8.5
30 Spirited Away 2001 8.5
31 The Pianist 2002 8.5
32 Psycho 1960 8.5
33 Parasite 2019 8.5
34 Léon: The Professional 1994 8.5
35 Gladiator 2000 8.5
36 The Lion King 1994 8.5
37 American History X 1998 8.5

import requests
38 The Departed 2006 8.5
import pandas as pd
39 Whiplash 2014 8.5
40 The Prestige 2006 8.5

# 3) Scrape IMDB's Top rated 50 Indian movies' data and create a data frame
url = 'https://www.imdb.com/list/ls056092300/'
41 The Usual Suspects 1995 8.5
soup = BeautifulSoup(response.text, 'html.parser')
42 Casablanca 1942 8.5
movies = soup.find_all('h3', class_='lister-item-header')
43 Grave of the Fireflies 1988 8.5
ratings = soup.find_all('span', class_='ipl-rating-star__rating')
movie_data = []
44 Harakiri 1962 8.5
for movie, rating in zip(movies, ratings):
45 The Intouchables 2011 8.5
title = movie.find('a').text
46 Modern Times 1936 8.4
year = movie.find('span', class_='lister-item-year').text.strip('()')
rating = rating.text
47 Once Upon a Time in the West 1968 8.4
movie_data.append({'Title': title, 'Year': year, 'Rating': rating})
48 Cinema Paradiso 1988 8.4
df_indian_movies = pd.DataFrame(movie_data)
49
print("Top 50 Indian Movies on IMDB:") Rear Window 1954 8.4
df_indian_movies.head(50)
2 Paper Flowers 1959 Rate
3 Lagaan: Once Upon a Time in India 2001 1
4 Pather Panchali 1955 Rate
5 Charulata 1964 2
6 Rang De Basanti 2006 Rate
7 Dev.D 2009 3
8 3 Idiots 2009 Rate
9 Awaara 1951 4
10 Nayakan 1987 Rate
11 Aparajito 1956 5
12 Pushpaka Vimana 1987 Rate
13 Thirst 1957 6
14 The Ritual 1977 Rate
15 Sholay 1975 7
16 Aradhana 1969 Rate
17 Do Ankhen Barah Haath 1957 8
18 Bombay 1995 Rate
19 Neecha Nagar 1946 9
20 Do Bigha Zamin 1953 Rate
21 Garm Hava 1974 10
22 Piravi 1989 Rate
23 Mughal-E-Azam 1960 8.4
24 Report to Mother 1986 0
25 Madhumati 1958 Rate
26 Goopy Gyne Bagha Byne 1969 1
27 Gangs of Wasseypur 2012 Rate
28 Guide 1965 2
29 Satya 1998 Rate
30 Roja 1992 3
31 Mr. India 1987 Rate
32 The Cloud-Capped Star 1960 4
33 Harishchandrachi Factory 2009 Rate
34 Masoom 1983 5
35 Agneepath 1990 Rate
36 Tabarana Kathe 1986 6
37 Zakhm 1998 Rate
38 Dil Chahta Hai 2001 7
39 Bhaag Milkha Bhaag 2013 Rate
40 Chupke Chupke 1975 8
41 Dilwale Dulhania Le Jayenge 1995 Rate
42 Like Stars on Earth 2007 9
43 Ardh Satya 1983 Rate
44 Bhumika 1977 10
45 Enthiran 2010 Rate
46 Sadma 1983 7.8
47 A Breath 2004 0
48 Lamhe 1991 Rate
49 Haqeeqat 1964 1
# 4) Scrape list of respected former presidents of India and create a data frame
url = 'https://presidentofindia.nic.in/former-presidents.htm'
soup = BeautifulSoup(response.text, 'html.parser')
president_table = soup.find('table', class_='views-table')
president_data = []
if president_table is not None:
presidents = president_table.find_all('tr')[1:]
for president in presidents:
columns = president.find_all('td')
if len(columns) > 1:
name = columns[0].text.strip()
term = columns[1].text.strip()
president_data.append({'Name': name, 'Term of Office': term})
df_presidents = pd.DataFrame(president_data)
print("Former Presidents of India:")
print(df_presidents)
Former Presidents of India:

Empty DataFrame
Columns: []
Index: []
import requests
import pandas as pd
def scrape_odi_teams_men():
url = 'https://www.icc-cricket.com/rankings/mens/team-rankings/odi'
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', class_='table')
rows = table.find_all('tr')

team_data = []
for row in rows[1:11]: # Exclude the header row and consider only the top 10 teams
cells = row.find_all('td')
rank = cells[0].text.strip()
team = cells[1].text.strip()
matches = cells[2].text.strip()
points = cells[3].text.strip()
rating = cells[4].text.strip()
team_data.append({'Rank': rank, 'Team': team, 'Matches': matches, 'Points': points, 'Rating': rating})

df = pd.DataFrame(team_data)
return df
def scrape_odi_batsmen_men():
url = 'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting'

player_data = []
for row in rows[1:11]: # Exclude the header row and consider only the top 10 batsmen
player = cells[1].text.strip()
player_data.append({'Rank': rank, 'Player': player, 'Team': team, 'Rating': rating})

df = pd.DataFrame(player_data)
return df
def scrape_odi_bowlers_men():
url = 'https://www.icc-cricket.com/rankings/mens/player-rankings/odi/bowling'

player_data = []
for row in rows[1:11]: # Exclude the header row and consider only the top 10 bowlers
player = cells[1].text.strip()
player_data.append({'Rank': rank, 'Player': player, 'Team': team, 'Rating': rating})

df = pd.DataFrame(player_data)
return df
scrape_odi_teams_men()
scrape_odi_batsmen_men()
scrape_odi_bowlers_men()
Rank Player Team Rating
0 1\n \n\n\n(0) Josh Hazlewood AUS 705
1 2\n \n\n\n(0) Mohammed Siraj IND 691
2 3\n \n\n\n(0) Mitchell Starc AUS 686
3 4\n \n\n\n(0) Matt Henry NZ 667
4 5\n \n\n\n(0) Trent Boult NZ 660
5 6\n \n\n\n(0) Rashid Khan AFG 659
6 7\n \n\n\n(0) Adam Zampa AUS 652
7 8\n \n\n\n(0) Mujeeb Ur Rahman AFG 637
8 9\n \n\n\n(0) Mohammad Nabi AFG 631
9 10\n \n\n\n(0) Shaheen Afridi PAK 630
import requests
import pandas as pd
# Function to scrape and create data frame
def scrape_icc_cricket(url):
# Send a GET request to the specified URL

# Create a BeautifulSoup object to parse the HTML content

# Find the table containing the required data
table = soup.find('table')

# Initialize empty lists to store the data
teams = []
matches = []
points = []
ratings = []

# Iterate over each row in the table
for row in table.find_all('tr')[1:]:
# Extract the data from each column in the row
cols = row.find_all('td')
team = cols[0].text.strip()
match = cols[1].text.strip()
point = cols[2].text.strip()
rating = cols[3].text.strip()

# Append the data to the respective lists
teams.append(team)
matches.append(match)
points.append(point)
ratings.append(rating)

# Create a data frame using the scraped data
df = pd.DataFrame({
'Team': teams,
'Matches': matches,
'Points': points,
'Rating': ratings
})

return df
# Scrape and create data frame for top 10 ODI teams in women's cricket
url_a = 'https://www.icc-cricket.com/rankings/womens/team-rankings/odi'
df a = scrape icc cricket(url a)
df_a scrape_icc_cricket(url_a)
# Scrape and create data frame for top 10 women's ODI batting players
url_b = 'https://www.icc-cricket.com/rankings/womens/player-rankings/odi/batting'
df_b = scrape_icc_cricket(url_b)
# Scrape and create data frame for top 10 women's ODI all-rounders
url_c = 'https://www.icc-cricket.com/rankings/womens/player-rankings/odi/all-rounder'
df_c = scrape_icc_cricket(url_c)
# Print the data frames
print("Top 10 ODI Teams in Women's Cricket:")
print(df_a)
print("\nTop 10 Women's ODI Batting Players:")
print(df_b)
print("\nTop 10 Women's ODI All-rounders:")
print(df_c)
Top 10 ODI Teams in Women's Cricket:

Team Matches Points Rating
0 1 Australia\nAUS 21 3,603
1 2 England\nENG 28 3,342
2 3 South Africa\nSA 26 3,098
3 4 India\nIND 27 2,820
4 5 New Zealand\nNZ 25 2,553
5 6 West Indies\nWI 27 2,535
6 7 Thailand\nTHA 11 821
7 8 Bangladesh\nBAN 14 977
8 9 Pakistan\nPAK 27 1,678
9 10 Sri Lanka\nSL 9 479
10 11 Ireland\nIRE 14 548
11 12 Netherlands\nNED 9 0
12 13 Zimbabwe\nZIM 11 0
Top 10 Women's ODI Batting Players:

Team Matches \
0 1\n \n\n\n\n\n(1)\nThis... Beth Mooney
1 2\n \n\n\n\n\n(... Laura Wolvaardt
2 3\n \n\n\n\n\n(... Natalie Sciver
3 4\n \n\n\n\n\n(... Meg Lanning
4 5\n \n\n\n\n\n(... Harmanpreet Kaur
.. ... ...
95 96\n \n\n\n\n\n... Kim Garth
96 97\n \n\n\n\n\n... Arlene Kelly
97 98\n \n\n\n\n\n... Fatima Sana
98 99\n \n\n\n\n\n... Mary-Anne Musonda
99 =\n \n\n\n\n\n\... Lea Tahuhu
Points Rating
0 AUS 754
1 SA 732
2 ENG 731
3 AUS 717
4 IND 716
.. ... ...
95 AUS 202
96 IRE 199
97 PAK 197
98 ZIM 196
99 NZ 196
[100 rows x 4 columns]
Top 10 Women's ODI All-rounders:

Team Matches \
0 1\n \n\n\n(0) Hayley Matthews
1 2\n \n\n\n(0) Natalie Sciver
2 3\n \n\n\n(0) Ellyse Perry
3 4\n \n\n\n(0) Marizanne Kapp
4 5\n \n\n\n(0) Amelia Kerr
5 6\n \n\n\n(0) Deepti Sharma
6 7\n \n\n\n(0) Ashleigh Gardner
7 8\n \n\n\n(0) Jess Jonassen
8 9\n \n\n\n(0) Nida Dar
9 10\n \n\n\n(0) Sophie Ecclestone
10 11\n \n\n\n(0) Sophie Devine
( )

Piyush Shivhare CS201083 WebScraping - Ipynb - Colaboratory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Piyush Shivhare CS201083 WebScraping - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

6/1/23, 6:26 PM webScraping.

Header tags from wikipedia:

2 From today's featured article

3 Did you know ...

6 Today's featured picture

7 Other areas of Wikipedia

8 Wikipedia's sister projects

3 The Godfather Part II 1974 9.0

4 12 Angry Men 1957 9.0

5 Schindler's List 1993 8.9

7 Pulp Fiction 1994 8.8

9 The Good, the Bad and the Ugly 1966 8.8

10 Forrest Gump 1994 8.8

11 Fight Club 1999 8.7

12 The Lord of the Rings: The Two Towers 2002 8.7

13 Inception 2010 8.7

14 Star Wars: Episode V - The Empire Strikes Back 1980 8.7

15 The Matrix 1999 8.7

16 Goodfellas 1990 8.7

17 One Flew Over the Cuckoo's Nest 1975 8.6

18 Se7en 1995 8.6

19 It's a Wonderful Life 1946 8.6

20 Seven Samurai 1954 8.6

21 The Silence of the Lambs 1991 8.6

22 Saving Private Ryan 1998 8.6

23 City of God 2002 8.6

24 Interstellar 2014 8.6

25 Life Is Beautiful 1997 8.6

26 The Green Mile 1999 8.6

27 Star Wars: Episode IV - A New Hope 1977 8.5

28 Terminator 2: Judgment Day 1991 8.5

29 Back to the Future 1985 8.5

30 Spirited Away 2001 8.5

31 The Pianist 2002 8.5

32 Psycho 1960 8.5

33 Parasite 2019 8.5

34 Léon: The Professional 1994 8.5

35 Gladiator 2000 8.5

36 The Lion King 1994 8.5

37 American History X 1998 8.5

40 The Prestige 2006 8.5

3 Lagaan: Once Upon a Time in India 2001 1

4 Pather Panchali 1955 Rate

6 Rang De Basanti 2006 Rate

8 3 Idiots 2009 Rate

10 Nayakan 1987 Rate

12 Pushpaka Vimana 1987 Rate

14 The Ritual 1977 Rate

16 Aradhana 1969 Rate

17 Do Ankhen Barah Haath 1957 8

18 Bombay 1995 Rate

19 Neecha Nagar 1946 9

20 Do Bigha Zamin 1953 Rate

21 Garm Hava 1974 10

22 Piravi 1989 Rate

23 Mughal-E-Azam 1960 8.4

24 Report to Mother 1986 0

25 Madhumati 1958 Rate

26 Goopy Gyne Bagha Byne 1969 1

27 Gangs of Wasseypur 2012 Rate

29 Satya 1998 Rate

31 Mr. India 1987 Rate

32 The Cloud-Capped Star 1960 4