Categories


Archives


Recent Posts


Categories


Stop Using DOMDocument Unless You Need It

astorm

Frustrated by Magento? Then you’ll love Commerce Bug, the must have debugging extension for anyone using Magento. Whether you’re just starting out or you’re a seasoned pro, Commerce Bug will save you and your team hours everyday. Grab a copy and start working with Magento instead of against it.

No Frills Magento Layout is the only Magento front end book you'll ever need. Get your copy today!

This article was inspired by a recent StackOverflow thread.

If you’re dealing with a third-party and/or unmodifiable library that returns a a DOMDocument object and you’re not sure how to get at a certain node, there’s really only one option

$dom = some_function_that_returns_a_dom_document();
$xml = simplexml_load_string($dom->saveXml());
$nodes = $xml->xpath('/my/expresion/to/nodes');

The DOMDocument library came into existence during the transition from PHP4 to PHP5, and was one of the first to provide a Java style OO design. It’s reviled by many day-to-day PHP developers because

  1. It’s poorly documented
  2. It attempts to encompass the entire world of XML
  3. It doesn’t work like you’d think it would
  4. It wasn’t designed to solve the problems of most PHP developers

To be clear, there’s some things you can only do with DOMDocument, and its existence is important to people who read, understand, and have strong opinions about the minutia of W3C specifications. For everyone else, it’s a confusing bottleneck that’s better to avoid completely.

Example

Consider the following XML snippet

<foo>
    <bar id="main_bar">The node we're after</bar>
    <bar>The node we despise.</bar>
</foo>

To anyone who’s used Javascript’s DOM implementation, the following code looks like it should parse out the node we’re after

$xml = '<?xml version="1.0" ?>    <foo>
    <bar id="main_bar">The node we\'re after</bar>
    <bar id="main_bar">The node we despise.</bar>
</foo>';    
$dom = new DomDocument();
$dom->loadXml($xml);    
$node = $dom->getElementById('main_bar');
var_dump($node);

The above code ends up dumping a null value. This usually sends the developer writing the code on a quest to find out why their data is being loaded into the document incorrectly. Hours or days later, they finally realize that this bit of cryptic instruction from the documentation.

or this function to work, you will need either to set some ID attributes with DOMElement::setIdAttribute or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument::validate or DOMDocument->validateOnParse before using this function.

actually applies to them. In an XML document, the idea of an id is abstract. An id attribute is NOT the attribute named id. The document type author actually needs to specify which attribute should be interpreted as an ID element in the DTD.

So, as a client programmer, we’ve ended up with a contextless bit of XML that has no DTD and we’re trying to get at a specific element using getElementById. The only way we can do this is either

  1. Find or write an entire DTD for the document
  2. Give up on getElementById and find another way at our data
  3. Get a reference to the element so we can use setAttributeId method to
    manually set which attribute should act as an ID.

Let the last option sink in. Following a set of logical steps, we’ve reached the conclusion that in order to get a reference to an element we first need to get a reference to the element.

Adding to the confusion, give this code snippet a try

$html = '<html>
    <head>
        <title></title>
    </head>
    <body>
        <div id="the_man">
            Foo, Baz, Bar
        </div>
    </body>
</html>';    
$dom = new DomDocument();
$dom->loadHtml($html);    
$node = $dom->getElementById('the_man');
var_dump($node);

Here we have almost identical code, but this time getElementById works. What gives? The difference is the loadHtml method. Behind the scenes DOMDocument will load this document with an HTML DTD, and the HTML DTD includes assigning the attribute named id to be the id element for all the nodes. (At least, I think that’s what’s going on.)

Perfectly logical if you’re an XML wonk working in 2005. Totally confusing if you’re a middleware developer in 2010 who has used getElementById with a document loaded via loadHtml and now has no idea why the same code won’t work with an XML document.

My larger point isn’t to dump all over the existence of DOMDocument. It’s an important library. When you need that deep level of XML handling it has few peers, but if you found this page via a google search because you can’t parse your document there’s a 99% chance that isn’t you.

Save your DOMDocument object out as an XML string, and import it into a simpler client library like SimpleXML, and then use xpath methods to get at what you want. You’ll spend less time wrestling with the esoteric, and more time getting your job done.

Originally published October 27, 2010

Copyright © Alan Storm 1975 – 2017 All Rights Reserved

Originally Posted: 27th October 2010