Welcome to Scribd!

Manipulating HTML Using Nokogiri

Uploaded by

0% found this document useful (0 votes)

15 views3 pages

This document discusses how to extract HTML links from a document using Nokogiri. It shows how to find all links, links within a specific element, a link associated with text, and get link text. XPath and CSS selector syntax examples are provided to extract href attributes, link elements, and text. Useful Nokogiri and HTML parsing references are also included.

Original Description:

Short tutorial on using Nokogiri to search for and extract HTML links and other useful elements.

Original Title

Manipulating HTML using Nokogiri

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

15 views3 pages

Manipulating HTML Using Nokogiri

Uploaded by

rdpoor

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 3

Search inside document

# Extracting HTML links using Nokogiri

Here are some common operations you might do when parsing links in HTTP, shown both in
`css` and `xpath` syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

## extracting all the links

We can use xpath or css to nd all the `<a>` elements and then keep only the ones that have
an `href` attribute:

nodeset = doc.xpath('//a') # Get all anchors via xpath

nodeset.map {|element| element["href"]}.compact # => ["http://google.com",
"http://stackoverflow.com"]

nodeset = doc.css('a') # Get all anchors via css

nodeset.map {|element| element["href"]}.compact # => ["http://google.com",
"http://stackoverflow.com"]

In the above cases, the `.compact` is necessary because the search for the `<a>` element
returns the "just a bookmark" element in addition to the others.

But we can use a more re ned search to nd just the elements that contain an `href`
attribute:

attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath

attrs.map {|attr| attr.value} # => ["http://google.com",
"http://stackoverflow.com"]
fi
fi
fi
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com",
"http://stackoverflow.com"]

## nding a speci c link

To nd a link within the `<div id="block2">`

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')

nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use `at_xpath` or `at_css` instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')

element['href'] # => "http://stackoverflow.com"

## nd a link from associated text

What if you know the text associated with a link and want to nd its url? A little xpath-fu (or
css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"

## nd text from a link

For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"

## useful references
fi
fi
fi
fi
fi
fi
In addition to the extensive [Nokorigi documentation][1], I came across some useful links
while writing this up:

* [a handy Nokogiri cheat sheet][2]

* [a tutorial on parsing HTML with Nokogiri][3]
* [interactively test CSS selector queries][4]

[1]: http://nokogiri.org/
[2]: https://github.com/sparklemotion/nokogiri/wiki/Cheat-sheet
[3]: http://ruby.bastardsbook.com/chapters/html-parsing/
[4]: http://try.jsoup.org/

Awg Copper Wire Table Current Limits
Document37 pages
Awg Copper Wire Table Current Limits
Cristopher Entena
100% (2)
Reddit
Document23 pages
Reddit
hi12345
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
Document6 pages
A Guide To Web Scraping in Python Using Beautiful Soup
paco
No ratings yet
Sure Cuts A Lot Help
Document27 pages
Sure Cuts A Lot Help
bamadixiechick
No ratings yet
Selenium - POM ObjectRepository WebTables
Document9 pages
Selenium - POM ObjectRepository WebTables
JASPER WESSLY
No ratings yet
Mastering JavaScript: The Complete Guide to JavaScript Mastery
From Everand
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Tim Robards
Rating: 5 out of 5 stars
5/5 (1)
1Z0-147 StudyGuide
Document218 pages
1Z0-147 StudyGuide
Ibrahima Lamine Ba
No ratings yet
Defacing Website
Document31 pages
Defacing Website
MAd BAdh
No ratings yet
Iso-Iec 17067
Document43 pages
Iso-Iec 17067
Por Jut
100% (1)
HTML, JS, PHP, Ajax
Document218 pages
HTML, JS, PHP, Ajax
pavanjammula
No ratings yet
Important Information:: Ingersoll-Rand Winch or Hoist. The Manual Form Numbers Are As Follows
Document45 pages
Important Information:: Ingersoll-Rand Winch or Hoist. The Manual Form Numbers Are As Follows
fredd
No ratings yet
Efficient: Maintainable, Modular
Document106 pages
Efficient: Maintainable, Modular
Michael Pereira
No ratings yet
Yahoo Reset
Document5 pages
Yahoo Reset
Anonymous MfzCG291
100% (1)
Ultra HTML Reference
From Everand
Ultra HTML Reference
Mike Abelar
Rating: 2 out of 5 stars
2/5 (1)
Usernames
Document3,157 pages
Usernames
nidarts
No ratings yet
TA2 14M Motor Grader B9J
Document21 pages
TA2 14M Motor Grader B9J
Tony Wilden Angelo Peña
100% (1)
TR Tourism Promotion Services NCII
Document90 pages
TR Tourism Promotion Services NCII
Burgandy Santos
No ratings yet
API 570 Exam June 1
Document24 pages
API 570 Exam June 1
Sudarshan
79% (14)
WAD LAB Manual
Document96 pages
WAD LAB Manual
amalraj mca
0% (1)
DM-PH&SD-P4-TG14 - (Guidelines For Personal Protective Equipment-Fall Protection-Safety Lines) PDF
Document3 pages
DM-PH&SD-P4-TG14 - (Guidelines For Personal Protective Equipment-Fall Protection-Safety Lines) PDF
demie figueroa
No ratings yet
Secure Your Critical Workload On AWS: Harry Lin, Solutions Architect Amazon Web Services November 2016
Document33 pages
Secure Your Critical Workload On AWS: Harry Lin, Solutions Architect Amazon Web Services November 2016
whenley
No ratings yet
tl9000M3 0E
Document168 pages
tl9000M3 0E
rafiq5002
No ratings yet
Operators Manual
Document65 pages
Operators Manual
Hector Ernesto Cordero Amaro
100% (1)
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
HTML Lang Charset Name Content Rel Href Rel Href Href Rel Href Rel Href Rel
Document7 pages
HTML Lang Charset Name Content Rel Href Rel Href Href Rel Href Rel Href Rel
ayaan khan
No ratings yet
Emmet Cheatsheet
Document4 pages
Emmet Cheatsheet
ROVIROB
No ratings yet
IWP MQP Solpdf
Document49 pages
IWP MQP Solpdf
iamsurya195
No ratings yet
Philippines Information and Communications Technology
Document76 pages
Philippines Information and Communications Technology
cathycama19
No ratings yet
Codigos 1
Document71 pages
Codigos 1
Newbie Shy
No ratings yet
Unit 1 (Wad)
Document9 pages
Unit 1 (Wad)
Anuja Nanaware
No ratings yet
FSD
Document32 pages
FSD
Om Sawant
No ratings yet
Boostrap 1
Document51 pages
Boostrap 1
vino emilio
No ratings yet
Sat and Sun Not Acceptable
Document1 page
Sat and Sun Not Acceptable
lashopee0211
No ratings yet
Ait Imp
Document42 pages
Ait Imp
Aditi Kokane
No ratings yet
Backend Web Development Project Files
Document20 pages
Backend Web Development Project Files
Ram
No ratings yet
Carro Inalambrico Bluetooth
Document53 pages
Carro Inalambrico Bluetooth
Jose antonio Gomez sanchez
No ratings yet
Opendns Top Domains
Document1,011 pages
Opendns Top Domains
online_khoj
No ratings yet
HTMLBasics Usinf HTML Web Authoring
Document50 pages
HTMLBasics Usinf HTML Web Authoring
ronald Lwabala
No ratings yet
Maven+Pro Lora: Family ' Rel 'Stylesheet' Type 'Text/css' ' Rel 'Stylesheet' Type 'Text/css'
Document5 pages
Maven+Pro Lora: Family ' Rel 'Stylesheet' Type 'Text/css' ' Rel 'Stylesheet' Type 'Text/css'
varghuggtander
No ratings yet
Experiment No 1
Document16 pages
Experiment No 1
Uddhav Rodge
No ratings yet
Hianoroni Braip
Document39 pages
Hianoroni Braip
Rudson Lima
No ratings yet
Simple React Page
Document3 pages
Simple React Page
test user
No ratings yet
" Embedding Content" Embedding Content Is An Important Feature of HTML5. It Is Used To Simplify The
Document8 pages
" Embedding Content" Embedding Content Is An Important Feature of HTML5. It Is Used To Simplify The
Zac Mori
No ratings yet
Ramagya Dadri Website1
Document34 pages
Ramagya Dadri Website1
rajputanavansh07
No ratings yet
100QUEDABRAIP
Document35 pages
100QUEDABRAIP
Rudson Lima
No ratings yet
70-480 Exam Notes
Document19 pages
70-480 Exam Notes
Freedom Spirit
No ratings yet
Practical Awp 06
Document20 pages
Practical Awp 06
Archi Jariwala
No ratings yet
Symbol Search
Document3 pages
Symbol Search
pepito_perez_hell
No ratings yet
Index - JSP: Estructura de Un Proyecto Holamundo
Document11 pages
Index - JSP: Estructura de Un Proyecto Holamundo
Marcos Crisostomo
No ratings yet
Cargar Ultimo
Document4 pages
Cargar Ultimo
conejoc870
No ratings yet
Lesson 5
Document19 pages
Lesson 5
Angel
No ratings yet
4abd678e92b2763e40f27e4a40cb2f71
Document19 pages
4abd678e92b2763e40f27e4a40cb2f71
Ina Ina
No ratings yet
!DOCTYPE HTML
Document1 page
!DOCTYPE HTML
Josué Bautista
No ratings yet
ORAL Co-Ordination
Document20 pages
ORAL Co-Ordination
BECOC337 Shambhu Mohite
0% (1)
Coding Download
Document6 pages
Coding Download
sinta
No ratings yet
HTML
Document8 pages
HTML
kamal hamza
No ratings yet
JavaScript Rich Text Editor in HTML5
Document74 pages
JavaScript Rich Text Editor in HTML5
Muhammadimran Ali
No ratings yet
JS-DOM-Events
Document51 pages
JS-DOM-Events
Nour H
No ratings yet
Hangman Words
Document73 pages
Hangman Words
natrubuclathrmacom
No ratings yet
React JS Soc
Document9 pages
React JS Soc
Mohith Nakka
No ratings yet
Wiwiww
Document3 pages
Wiwiww
bjaverie.pandan
No ratings yet
Documentation To Connect A Simple Web App To Wp-Rest Api
Document17 pages
Documentation To Connect A Simple Web App To Wp-Rest Api
Adegoke Bestman
No ratings yet
WPL Mids
Document59 pages
WPL Mids
Talha Mansoor
No ratings yet
Teste Grila La Limba Romana Pentru Acad
Document42 pages
Teste Grila La Limba Romana Pentru Acad
botezatu catalin
0% (1)
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
Document11 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
sahil
No ratings yet
BSC (Hons) in Cse, Part Iv, Eight Semester Examination, 2012 CSE-421 (Web Engineering) Examination Code: 618 Time: 3 Hours, Full Marks: 80
Document15 pages
BSC (Hons) in Cse, Part Iv, Eight Semester Examination, 2012 CSE-421 (Web Engineering) Examination Code: 618 Time: 3 Hours, Full Marks: 80
Leo da Leon
No ratings yet
Presentation
Document156 pages
Presentation
Raj S
No ratings yet
Web Techology Presentation
Document25 pages
Web Techology Presentation
Gowtham P B
No ratings yet
Project Details
Document1 page
Project Details
Mohit Malghade
No ratings yet
F 00176a
Document78 pages
F 00176a
krishna chaitanya
No ratings yet
Assignment 1
Document8 pages
Assignment 1
Sam Kevin
No ratings yet
70 480 JavaScript and HTML5
Document22 pages
70 480 JavaScript and HTML5
JuanMa Capuano
No ratings yet
HTML5 Layout Container
Document48 pages
HTML5 Layout Container
rina mahure
No ratings yet
LA36 Data Sheet Eng
Document40 pages
LA36 Data Sheet Eng
karthik
No ratings yet
100 sg003 - en P
Document266 pages
100 sg003 - en P
abnicolescu
No ratings yet
Dale Resistor Power Ds
Document3 pages
Dale Resistor Power Ds
Eduardo Amezcua
No ratings yet
CNC Usb Controller
Document153 pages
CNC Usb Controller
Valentin Banica
0% (1)
Server API v2
Document23 pages
Server API v2
Ionut Oprea
No ratings yet
Sample Paper-A
Document15 pages
Sample Paper-A
Xinyuan Chen
No ratings yet
Profibus: in The Process Industries #2
Document52 pages
Profibus: in The Process Industries #2
Vijayachandran K
No ratings yet
EN 1057 Standard Product For Copper Tubing
Document9 pages
EN 1057 Standard Product For Copper Tubing
cakhokhe
No ratings yet
Anna University Exams Nov / Dec 2019 - Regulation 2017: CS8492-Database Management Systems 1. 2. 3. 4. 5. 6
Document1 page
Anna University Exams Nov / Dec 2019 - Regulation 2017: CS8492-Database Management Systems 1. 2. 3. 4. 5. 6
sirask
No ratings yet
Manual - 13 - CDI - Multitorq2005
Document4 pages
Manual - 13 - CDI - Multitorq2005
rubensuko
No ratings yet
Ax308 Brochure
Document3 pages
Ax308 Brochure
Razza Willi
0% (1)
Nha Trang Mvac Spec
Document331 pages
Nha Trang Mvac Spec
sonthanhghe
No ratings yet
FE Brosur PDF
Document2 pages
FE Brosur PDF
sudi
No ratings yet
User Manual PDF
Document11 pages
User Manual PDF
joel Jacob castellanos
No ratings yet
Decision ED 2003 19 RM
Document280 pages
Decision ED 2003 19 RM
bipinup
No ratings yet
NAT Inside
Document15 pages
NAT Inside
Dao Thanh Giang
No ratings yet