Professional Documents
Culture Documents
Manipulating HTML Using Nokogiri
Manipulating HTML Using Nokogiri
Here are some common operations you might do when parsing links in HTTP, shown both in
`css` and `xpath` syntax.
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
In the above cases, the `.compact` is necessary because the search for the `<a>` element
returns the "just a bookmark" element in addition to the others.
But we can use a more re ned search to nd just the elements that contain an `href`
attribute:
nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use `at_xpath` or `at_css` instead:
attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
## useful references
fi
fi
fi
fi
fi
fi
In addition to the extensive [Nokorigi documentation][1], I came across some useful links
while writing this up:
[1]: http://nokogiri.org/
[2]: https://github.com/sparklemotion/nokogiri/wiki/Cheat-sheet
[3]: http://ruby.bastardsbook.com/chapters/html-parsing/
[4]: http://try.jsoup.org/