You are on page 1of 2

$x("//table[@class='maintable']/tbody/tr/td/table[1]/tbody/tr[2]/td/p/a/child::t

ext()")
xpath for link text
//span[contains(@class, 'myclass') and text() = 'qwerty']
//p[@class="main-content"]//text()
This returns three text nodes: This is sample paragraph with, link and inside.
<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li><small>Military</small> Central <a href="/Intelligence_Agency.html">Intellig
ence Agency</a>.</li>
<li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">
America</a>.<br/>Renowned cooking school.</li>
</ol>
</div>
I have the same goal, namely, extracting:
Central Intelligence Agency
Culinary Institute of America
Can I selectively choose which tags are excluded?
I've tried things like (for removing 'Military'):
id('mw-content-text')/ol/li[not(self::small)]
but that condition is applied to the 'li' node as a whole, so it's not affected.
And if I do something similar
id('mw-content-text')/ol/li/*[not(self::small)]
then I'm only filtering on the children, and even though I successfully throw aw
ay 'Military', I've also thrown away 'Central', 'Culinary', i.e. text from the p
arent.
I had understood the tree to be something like:
div -- li
-- small -- Military
-- Central
-- a -- Intelligence Agency
-- li
-- Culinary
-- a -- Institute
-- of
-- a -- America
-- br
-- Renowned cooking school.
Is that correct? Is there a way to say 'text elements of li and li's descendents
EXCEPT descendents of small?' How about '... EXCEPT a br element and all follow
ing text elements'?
Again, use of (partial) Pythonic solutions are also acceptable, though XPath is
preferred.
After sitting down to read Chapter 6 'XPath and XPointer' of 'Learning XML, Seco
nd Edition' by Erik Ray, I think I've got a grasp on it. I came up with the foll
owing formulation:
id('mw-content-text')/ol/li//text()[not(parent::small) and not(preceding-sibling
::br)]

You might also like