This is one of those “WoW” moments that make this game worthwhile the time.

If you need to scrape off information of the other site, forget regular expressions and string parsing. PHP5 has wonderful DOM Xpath functions that you can use to traverse the scraped page DOM and retrieve your information. To make matters even easier for you, aspiring, freebie loving, php enthusiast (that’s me!), you can get Xpath easily via Firebug extension in Firefox.

Firebug Xpath information

Now, that we have our Xpath, we ready for some PHP magic. But before going forward, NOTE: Firefox automatically fixes invalid html. For example, it adds tbody to every table that does not have it. Examine the page code and take this extra markup out.

And here is some sweet PHP5 goodness to make it all work. (In this example i’ll print out all link hrefs and anchors for a links in a certain table row).


$html=file_get_contents('dummy.html');

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body/table/tr[2]/td[2]/table/tr/td/table/tr[2]//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$value = $href->nodeValue;
echo "$url  => $value<br />";
}

Easy. No regular expressions, no string parsing. Just couple lines of PHP5 code.