Published by admin on 08 Aug 2008
Scrape site content with PHP5 DomXPath + Firebug
This is one of those “WoW” moments that make this game worthwhile the time.
If you need to scrape off information of the other site, forget regular expressions and string parsing. PHP5 has wonderful DOM Xpath functions that you can use to traverse the scraped page DOM and retrieve your information. To make matters even easier for you, aspiring, freebie loving, php enthusiast (that’s me!), you can get Xpath easily via Firebug extension in Firefox.
Now, that we have our Xpath, we ready for some PHP magic. But before going forward, NOTE: Firefox automatically fixes invalid html. For example, it adds tbody to every table that does not have it. Examine the page code and take this extra markup out.
And here is some sweet PHP5 goodness to make it all work. (In this example i’ll print out all link hrefs and anchors for a links in a certain table row).
$html=file_get_contents('dummy.html');
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body/table/tr[2]/td[2]/table/tr/td/table/tr[2]//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$value = $href->nodeValue;
echo "$url => $value<br />";
}
Easy. No regular expressions, no string parsing. Just couple lines of PHP5 code.
