You are on page 1of 3

# Extracting HTML links using Nokogiri

Here are some common operations you might do when parsing links in HTTP, shown both in
`css` and `xpath` syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

## extracting all the links


We can use xpath or css to nd all the `<a>` elements and then keep only the ones that have
an `href` attribute:

nodeset = doc.xpath('//a') # Get all anchors via xpath


nodeset.map {|element| element["href"]}.compact # => ["http://google.com",
"http://stackoverflow.com"]

nodeset = doc.css('a') # Get all anchors via css


nodeset.map {|element| element["href"]}.compact # => ["http://google.com",
"http://stackoverflow.com"]

In the above cases, the `.compact` is necessary because the search for the `<a>` element
returns the "just a bookmark" element in addition to the others.

But we can use a more re ned search to nd just the elements that contain an `href`
attribute:

attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath


attrs.map {|attr| attr.value} # => ["http://google.com",
"http://stackoverflow.com"]
fi
fi
fi
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com",
"http://stackoverflow.com"]

## nding a speci c link


To nd a link within the `<div id="block2">`

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')


nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use `at_xpath` or `at_css` instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')


element['href'] # => "http://stackoverflow.com"

## nd a link from associated text


What if you know the text associated with a link and want to nd its url? A little xpath-fu (or
css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"

## nd text from a link


For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"

## useful references
fi
fi
fi
fi
fi
fi
In addition to the extensive [Nokorigi documentation][1], I came across some useful links
while writing this up:

* [a handy Nokogiri cheat sheet][2]


* [a tutorial on parsing HTML with Nokogiri][3]
* [interactively test CSS selector queries][4]

[1]: http://nokogiri.org/
[2]: https://github.com/sparklemotion/nokogiri/wiki/Cheat-sheet
[3]: http://ruby.bastardsbook.com/chapters/html-parsing/
[4]: http://try.jsoup.org/

You might also like