Welcome to Scribd!

Skip carousel

A Python Web Scraping How-To Guide: Devbyexample

Uploaded by

Euler Pi

0% found this document useful (0 votes)

12 views6 pages

Original Title

Web scraping

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

12 views6 pages

A Python Web Scraping How-To Guide: Devbyexample

Uploaded by

Euler Pi

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 6

Search inside document

02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

>DevByExample_

A Python Web Scraping How-To

Guide
Web Scraping with Python/BeautifulSoup/Requests

Install
$ pip install requests beautifulsoup4

BeautifulSoup on Text
from bs4 import BeautifulSoup

text = '''<div><h1>My Header</h1></div>'''

soup = BeautifulSoup(text, 'html.parser')

print(soup.prettify())

<div>
<h1>
My Header

</h1>

</div>

Fetch Webpage and Create Soup

import requests
from bs4 import BeautifulSoup

url = 'https://devbyexample.com/test-scraping'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

https://www.devbyexample.com/web-scraping-cheat-sheet 1/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Find By ID
<h1 id="article-title">Hello Everyone</h1>

header = soup.find(id="article-id")

print(header)

<h1 id="article-title">Hello Everyone</h1>

print(header.string)

Hello Everyone

Find By Class
<div id="articles">

<div class='end'><button>Next Page</button></div>

</div>

articles = soup.select('.article')

print(articles)

[ <div class="article">...</div>,

<div class="article">...</div>,

<div class="article">...</div>]

https://www.devbyexample.com/web-scraping-cheat-sheet 2/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Navigating Elements in Tree

<ul>

<li><a href="https://google.com">Google</a></li>

<li><a href="https://apple.com">Apple</a></li>

</ul>

# Get First Link

print(soup.a)

<a href="https://google.com">Google</a>

# Get all Link elements on page

print(soup.find_all("a"))

[ <a href="https://google.com">Google</a>,

<a href="https://bing.com">Bing</a>,

<a href="https://apple.com">Apple</a>]

# Print all hrefs on page

for link in soup.find_all("a"):

print(link['href'])

https://google.com

https://bing.com

https://apple.com

https://www.devbyexample.com/web-scraping-cheat-sheet 3/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Element Attributes
<div id="article-10" class="article">

<h3>Header</h3>

<p>First Paragraph</p>

<p>Second Paragraph</p>

</div>

print(soup.div.name)

div

print(soup.div.contents)

[ '\n',
<h3>Header</h3>,

'\n',
<p>First Paragraph</p>,

'\n',
<p>Second Paragraph</p>,

'\n']

for strings in div.strings:

print(repr(strings))

'\n'

'Header'

'\n'

'First Paragraph'

'\n'

'Second Paragraph'

'\n'

for strings in soup.div.stripped_strings:

print(repr(strings))

'Header'

'First Paragraph'

'Second Paragraph'

https://www.devbyexample.com/web-scraping-cheat-sheet 4/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Find By Regex
<div>

<head><title>Sample Title</title></head>

<h1>Title Header</h1>

<hr>

<div>A description of something</div>

<h2>Section Header</h2>

<h2>Another Header</h2>

</div>

import re

headers = soup.find_all(re.compile('^h[1-6]'))

print(headers)

[ <h1>Title Header</h1>,

<h2>Section Header</h2>,

<h2>Another Header</h2>]

Search with CSS Select

<div>

<h3><a href="/sites">Sites</a></h3>

<li><a href="https://google.com">Google</a></li>

<li><a href="https://apple.com">Apple</a></li>

</ul>

</div>

print(soup.select('div a'))

[ <a href="/sites">Sites</a>,

<a href="https://google.com">Google</a>,

<a href="https://bing.com">Bing</a>,

<a href="https://apple.com">Apple</a>]

print(soup.select('div > h3 > a'))

[<a href="/sites">Sites</a>]

print(soup.select('li:nth-child(odd)'))

[ <li><a href="https://google.com">Google</a></li>,

<li><a href="https://apple.com">Apple</a></li>]

print(soup.select('a[href*="http"]'))

[ <a href="https://google.com">Google</a>,

<a href="https://bing.com">Bing</a>,

<a href="https://apple.com">Apple</a>]

https://www.devbyexample.com/web-scraping-cheat-sheet 5/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Parent, Children and Siblings

<div>

<ul>

<li><a href="https://google.com">Google</a></li>

<li><a href="https://apple.com">Apple</a></li>

</ul>

</div>

# Get Parent Name

ul_element = soup.find('ul')

print(ul_element.parent.name)

div

# Print all text in children

for child in ul_element.children:

print(child.string)

Google

Bing

Apple

# Siblings
first_li_element = soup.find('li')

print(first_li_element)

for sibling in first_li_element.next_siblings:

print(sibling)

<li><a href="https://google.com">Google</a></li>

<li><a href="https://apple.com">Apple</a></li>

Interested in Learning Dev with Deep Dives into Real World Examples?

https://www.devbyexample.com/web-scraping-cheat-sheet 6/6

2016 KTM SXF XCF 250 Service Repair Manual PDF
Document306 pages
2016 KTM SXF XCF 250 Service Repair Manual PDF
ja 787
No ratings yet
Userbouquet - Ilu XXX Adult
Document9 pages
Userbouquet - Ilu XXX Adult
j k
No ratings yet
Portfolio Code
Document12 pages
Portfolio Code
Cherry Charan Tej
No ratings yet
WDC Assignment (Sakshi)
Document34 pages
WDC Assignment (Sakshi)
priyanshi srivastava
No ratings yet
Untitled Document
Document2 pages
Untitled Document
Ceajhie Peralta
No ratings yet
Top 5 Popular Website
Document23 pages
Top 5 Popular Website
James Monsanto
100% (1)
Submitted By:: Lab Task Subject: Web Engineering
Document5 pages
Submitted By:: Lab Task Subject: Web Engineering
M Nabil
No ratings yet
Master Stephen Co
Document9 pages
Master Stephen Co
Sulaiman Hasan
No ratings yet
Musha HTML
Document5 pages
Musha HTML
gio giorga
No ratings yet
Creds Crawl
Document6 pages
Creds Crawl
Zipiddy Dooda
No ratings yet
Wa0015.
Document61 pages
Wa0015.
log2namita
No ratings yet
Restaurant
Document10 pages
Restaurant
Aman ranjan
No ratings yet
Backend Web Development Project Files
Document20 pages
Backend Web Development Project Files
Ram
No ratings yet
Exemplu Laborator2 PDF
Document11 pages
Exemplu Laborator2 PDF
DanuIepuras
No ratings yet
BOOTSTRAB (14 To 21)
Document29 pages
BOOTSTRAB (14 To 21)
shivsaicable67
No ratings yet
AI Practicals
Document108 pages
AI Practicals
Kajal Goud
No ratings yet
Week 1
Document26 pages
Week 1
21131a0578
No ratings yet
Bulma and Others
Document5 pages
Bulma and Others
Aquedudct08090
No ratings yet
Build A Personal Portfolio Webpage
Document10 pages
Build A Personal Portfolio Webpage
John Kesh Mahugu
No ratings yet
Web Pract
Document38 pages
Web Pract
Niraj Kumar Gupta
No ratings yet
Assessment 2 Web
Document30 pages
Assessment 2 Web
prathampalgandhi10
No ratings yet
Sample Report
Document26 pages
Sample Report
34Shreya Chauhan
No ratings yet
4abd678e92b2763e40f27e4a40cb2f71
Document19 pages
4abd678e92b2763e40f27e4a40cb2f71
Ina Ina
No ratings yet
Index
Document8 pages
Index
Host Galaxy
No ratings yet
MODEL
Document8 pages
MODEL
Lalit Purohit
No ratings yet
19bce0722 VL2021220104488 Ast03
Document17 pages
19bce0722 VL2021220104488 Ast03
Rajvansh Singh
No ratings yet
Prectical 11 To Prectical 22 RWPD
Document77 pages
Prectical 11 To Prectical 22 RWPD
KUSHAL TRIVEDI
No ratings yet
Materi Uts Prak PWL
Document13 pages
Materi Uts Prak PWL
Kellin Kel
No ratings yet
Youtubedownloadersite Com
Document16 pages
Youtubedownloadersite Com
Anu Agrawal
No ratings yet
SESION 10 (Pandas 2)
Document120 pages
SESION 10 (Pandas 2)
2marlenehh2003
No ratings yet
Forkamuajaa
Document3 pages
Forkamuajaa
Yoga Ku
No ratings yet
Web Lab File
Document34 pages
Web Lab File
saloniaggarwal0304
No ratings yet
Designschool Coded, Free Css Template With PSD To HTML Tutorial
Document15 pages
Designschool Coded, Free Css Template With PSD To HTML Tutorial
Саша Крстић
No ratings yet
Resume Code
Document3 pages
Resume Code
abhimittal309
No ratings yet
HTML5 Essentials
Document62 pages
HTML5 Essentials
mani weber
No ratings yet
Web Technology Lab
Document43 pages
Web Technology Lab
venkata rama krishna rao junnu
No ratings yet
Skrip 123
Document12 pages
Skrip 123
Haji Bimi
No ratings yet
AIT Practical (WORD)
Document37 pages
AIT Practical (WORD)
Vaibhav Shinde
No ratings yet
Koding Blog Lama
Document71 pages
Koding Blog Lama
Anonymous ciLHcA5N
No ratings yet
Kajian
Document112 pages
Kajian
kaliber ilmu
No ratings yet
HTML Footer-With-Button-Logo
Document2 pages
HTML Footer-With-Button-Logo
josieleq
No ratings yet
Portfolio
Document6 pages
Portfolio
samirmonal2005
No ratings yet
HTML 6
Document2 pages
HTML 6
Riyan Casino
No ratings yet
App Landing Page Source Code
Document6 pages
App Landing Page Source Code
malik
No ratings yet
4 AppLandingPagee PDF
Document6 pages
4 AppLandingPagee PDF
VictorBotez
No ratings yet
WOT Practicle File: Manjotpal Singh UE188058 4 SEM It Section-1 (Gp-3)
Document36 pages
WOT Practicle File: Manjotpal Singh UE188058 4 SEM It Section-1 (Gp-3)
e5y4uh
No ratings yet
New Projectcode
Document4 pages
New Projectcode
DAOUD
No ratings yet
SR No. List of Practical Date Sign
Document40 pages
SR No. List of Practical Date Sign
Chintan Tripathi
No ratings yet
Web Development Internship: Learnovate Ecommerce
Document22 pages
Web Development Internship: Learnovate Ecommerce
Mihir Chopra
No ratings yet
Web Tech Lab
Document11 pages
Web Tech Lab
Ravi Sharma
No ratings yet
Web Design in 4 Minutes
Document8 pages
Web Design in 4 Minutes
Akash Katyayana
100% (1)
Filters
Document268 pages
Filters
Antonio Sanchez
No ratings yet
Question One
Document28 pages
Question One
MySticaL Me:
No ratings yet
10
Document30 pages
10
jesudoss
No ratings yet
1
Document27 pages
1
jesudoss
No ratings yet
Pip Install Pip Install Folium: Django
Document8 pages
Pip Install Pip Install Folium: Django
Dirga Daniel
No ratings yet
Experiment Number 1:: What Is Bootstrap?
Document45 pages
Experiment Number 1:: What Is Bootstrap?
Kirandeep Kaur061
No ratings yet
Practical: 5: AIM: Design A Django Project For "Django Blog"
Document8 pages
Practical: 5: AIM: Design A Django Project For "Django Blog"
Shubham waykar
No ratings yet
Google Voice - Ant
Document307 pages
Google Voice - Ant
Dig81767
No ratings yet
Spring Boot - H2 DB Spring Boot - JPA+H2 DB: Page 1 of 15
Document15 pages
Spring Boot - H2 DB Spring Boot - JPA+H2 DB: Page 1 of 15
Prem Vinodh
No ratings yet
Program:1
Document39 pages
Program:1
meghna0827
No ratings yet
CSS Grid Layout: 5 Practical Projects
From Everand
CSS Grid Layout: 5 Practical Projects
Craig Buckler
No ratings yet
Cross Browser Testing
Document17 pages
Cross Browser Testing
Kaushik Mukherjee
No ratings yet
Hosts
Document245 pages
Hosts
leonel Umbrella
No ratings yet
Zopener Zoom
Document4 pages
Zopener Zoom
Anubrata
No ratings yet
Install MongoDB On AMAZON Linux, CentOS7 and Redhat.
Document5 pages
Install MongoDB On AMAZON Linux, CentOS7 and Redhat.
Atulr
No ratings yet
Educational Backlinks
Document9 pages
Educational Backlinks
Shahzad Hashmi
No ratings yet
Auto Page Reload On Error
Document1 page
Auto Page Reload On Error
Showunmi Oludotun
No ratings yet
List of Some Free SEO Friendly Directories
Document28 pages
List of Some Free SEO Friendly Directories
shaitansingh
100% (4)
Cookie Test
Document7 pages
Cookie Test
amresh kumar
No ratings yet
Ew B Clear Cache
Document13 pages
Ew B Clear Cache
Magdum Engineering
No ratings yet
+5213311522056 20180622
Document229 pages
+5213311522056 20180622
David Lopez Bernal
No ratings yet
Nuevo Documento de Texto
Document2 pages
Nuevo Documento de Texto
Miguel Angel Mendoza Castro
No ratings yet
Scribd Cookie
Document4 pages
Scribd Cookie
micheal
No ratings yet
WEB2000-Building A Web Service
Document22 pages
WEB2000-Building A Web Service
Vicente RJ
No ratings yet
Data
Document226 pages
Data
Samuel Terrell
No ratings yet
URL Trasadfdfgsdfgh
Document11 pages
URL Trasadfdfgsdfgh
gabriel
No ratings yet
Tampermonkey Autorefresh-1
Document2 pages
Tampermonkey Autorefresh-1
JOSE
No ratings yet
Fblog2021 09 01
Document1 page
Fblog2021 09 01
Kipas Angin
No ratings yet
Optimizing SharePoint Server 2013 Websites For Internet Search Engines
Document33 pages
Optimizing SharePoint Server 2013 Websites For Internet Search Engines
mharakeh1
No ratings yet
Script X Ss Page
Document545 pages
Script X Ss Page
md alam
No ratings yet
URL Vs URI: Most Important Differences You Must Know What Is The URL?
Document7 pages
URL Vs URI: Most Important Differences You Must Know What Is The URL?
Sudheer Kumar
No ratings yet
Semrush-Domain Overview (Desktop) - Urbansole Com pk-15th Mar 2023
Document8 pages
Semrush-Domain Overview (Desktop) - Urbansole Com pk-15th Mar 2023
moosa novaleathers
No ratings yet
Stuti
Document17 pages
Stuti
Nimit Hirpara
No ratings yet
Keyword Research Checklist: Step 1 Status
Document3 pages
Keyword Research Checklist: Step 1 Status
Corona TV
100% (6)
CC Dorks
Document6 pages
CC Dorks
Smith Rels
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
Document22 pages
Seminar Report: Submitted By: Aanchal Garg CSE
Abhijit Singh Dahiya
No ratings yet
Auto Followers FB 100 % Working
Document15 pages
Auto Followers FB 100 % Working
Kevin Blevins
No ratings yet
Advanced SEO Interview Questions and Answers
Document41 pages
Advanced SEO Interview Questions and Answers
Nirav Thakkar
No ratings yet
XML Document For Huawei m860 Android Smart Phone Metro Pcs Versin
Document136 pages
XML Document For Huawei m860 Android Smart Phone Metro Pcs Versin
Jimmy Green
No ratings yet