This article was inspired by a recent StackOverflow thread.
If you’re dealing with a third-party and/or unmodifiable library that returns a a
DOMDocument object and you’re not sure how to get at a certain node, there’s really only one option
$dom = some_function_that_returns_a_dom_document(); $xml = simplexml_load_string($dom->saveXml()); $nodes = $xml->xpath('/my/expresion/to/nodes');
DOMDocument library came into existence during the transition from PHP4 to PHP5, and was one of the first to provide a Java style OO design. It’s reviled by many day-to-day PHP developers because
- It’s poorly documented
- It attempts to encompass the entire world of XML
- It doesn’t work like you’d think it would
- It wasn’t designed to solve the problems of most PHP developers
To be clear, there’s some things you can only do with
DOMDocument, and its existence is important to people who read, understand, and have strong opinions about the minutia of W3C specifications. For everyone else, it’s a confusing bottleneck that’s better to avoid completely.
Consider the following XML snippet
<foo> <bar id="main_bar">The node we're after</bar> <bar>The node we despise.</bar> </foo>
$xml = '<?xml version="1.0" ?> <foo> <bar id="main_bar">The node we\'re after</bar> <bar id="main_bar">The node we despise.</bar> </foo>'; $dom = new DomDocument(); $dom->loadXml($xml); $node = $dom->getElementById('main_bar'); var_dump($node);
The above code ends up dumping a null value. This usually sends the developer writing the code on a quest to find out why their data is being loaded into the document incorrectly. Hours or days later, they finally realize that this bit of cryptic instruction from the documentation.
or this function to work, you will need either to set some ID attributes with DOMElement::setIdAttribute or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument::validate or DOMDocument->validateOnParse before using this function.
actually applies to them. In an XML document, the idea of an id is abstract. An id attribute is NOT the attribute named id. The document type author actually needs to specify which attribute should be interpreted as an ID element in the DTD.
So, as a client programmer, we’ve ended up with a contextless bit of XML that has no DTD and we’re trying to get at a specific element using
getElementById. The only way we can do this is either
- Find or write an entire DTD for the document
- Give up on
getElementByIdand find another way at our data
- Get a reference to the element so we can use
manually set which attribute should act as an ID.
Let the last option sink in. Following a set of logical steps, we’ve reached the conclusion that in order to get a reference to an element we first need to get a reference to the element.
Adding to the confusion, give this code snippet a try
$html = '<html> <head> <title></title> </head> <body> <div id="the_man"> Foo, Baz, Bar </div> </body> </html>'; $dom = new DomDocument(); $dom->loadHtml($html); $node = $dom->getElementById('the_man'); var_dump($node);
Here we have almost identical code, but this time
getElementById works. What gives? The difference is the
loadHtml method. Behind the scenes
DOMDocument will load this document with an HTML DTD, and the HTML DTD includes assigning the attribute named id to be the id element for all the nodes. (At least, I think that’s what’s going on.)
Perfectly logical if you’re an XML wonk working in 2005. Totally confusing if you’re a middleware developer in 2010 who has used
getElementById with a document loaded via
loadHtml and now has no idea why the same code won’t work with an XML document.
My larger point isn’t to dump all over the existence of
DOMDocument. It’s an important library. When you need that deep level of XML handling it has few peers, but if you found this page via a google search because you can’t parse your document there’s a 99% chance that isn’t you.
DOMDocument object out as an XML string, and import it into a simpler client library like
SimpleXML, and then use xpath methods to get at what you want. You’ll spend less time wrestling with the esoteric, and more time getting your job done.