Parsing HTML with PHP

There’s a long held bit of conventional wisdom that says you don’t want to parse HTML with regular expressions. While it’s a great way to learn regular expressions, it’s trivially easy to write something that’s only going to work correctly on a subset of all possible inputs.

For example, the first time you try to parse out an attribute you might do something like this

preg_match('%<td colspan="([0-9])"%', '<td colspan="5"', $matches);

That works great for the simple case, but it’s going to fail on things that look like this

<td border="0" colspan="5"
<td colspan = "5"
<td colspan="idontknowhtmlbutthisworks"

It’s not that it’s impossible to write a single regular expression that handles all these cases, but that doing so quickly becomes a full time job. Rather than working on your application or system, you’re working on your regular expression parsing, and writing code that’s fragile when it doesn’t have to be.

This is one of the problems that XHTML was going to solve. With XHTML, all HTML is also XML, and XML is well formed by default (or else it’s not XML). Because XML is a well specified form, it’s possible to write a general parser that will handle all XML documents. As a programmer, you never have to answer the question “what should I do when I get unspecified input”.

The problem with XHTML as an authoring format is the same draconian error handling that certain programming types love. If you, as an author, make a single mistake when creating your document (or more commonly inserting content provided by others) the rendering of the document is supposed to fail.

This is unacceptable in a format meant to render to something that humans are going to read. It’s always better to see a partially mangled page that still has the information you’re looking for than it is to see a error saying the document can’t be displayed.

When the Mozilla team was faced with this dilemma they chose to take the following approach.

  1. If a document was delivered with a content-type of text/html, do NOT treat it as XML. Instead pass it through the HTML parsing routines even if it uses an XHTML DOCTYPE

  2. If a document was delivered with an explicit XHTML content-type (application/xhtml+xml), then treat it as XML with all the draconian error handling

Internet Explorer, on the other hand, decided early on that “application/xhtml+xml” content should prompt the user with a download dialog. Microsoft stuck with this decision in the name of backwards compatibility, which means actual XHTML parsed as XML in the browser was dead on arrival.

If you’re interested in learning more about how browsers handle HTML vs. XHTML and why, the first part of this article is a great place to start.

Parsing for Mortals

That still leaves us with the problem of parsing HTML documents with PHP. Regular expressions are unreliable, and XML parsers won’t work reliably because the document may not be valid XML.

One common approach is to take an HTML document and clean it up so it’s valid XHTML (and therefore valid XML) using the tidy extension. Tidy is a library that’s meant to cleanup poorly formated HTML, and can be used to transform a HTML document into valid XML.

$string_with_my_file = '<html><body><foo></foo></body>';
$opt = array("output-xhtml" => true, "clean" => true);
$tidy = tidy_parse_file($string_with_my_file, $opt);
tidy_clean_repair($tidy);
echo $tidy;

Once passed through tidy, you can then use DOMDocument, SimpleXML, or any other XML parsing tool.

Dodgy Bachelor Servers

While useful, one problem with the tidy technique is tidy’s not part of PHPs default install/compile. The owner of a PHP system needs to explicitly enable this extension. It’s not a difficult technical challenge to enable tidy, but it’s often a difficult political one. The people making the decisions as to what extensions get installed may have valid (or invalid) reasons for not wanting tidy (or in general, any non-standard extensions) on their systems. This means the tidy approach isn’t fully portable in the real world.

Fortunately, there’s a second option. While the aforementioned DOMDocument extension is meant for parsing XML, it has a method named loadHTML. The loadHTML method accepts documents that aren’t well formed and does the hard work of parsing them into an XML tree that DOMDocument understands.

The PHP team achieved this magic by linking against the fantastic libxml library, which supports parsing non-well formed documents. Unfortunately, this presents a small problem.

In some cases, the loadHTML method will produce warnings and/or errors (not catchable Exceptions) when it’s parsing non-well-formed or non-valid documents. Consider the following

$string_with_my_file = '<html><body><foo></foo></body>';
$dom = DOMDocument::loadHTML($string_with_my_file);
echo $dom->saveXML();

On my MacBook running PHP 5.2.9, the above gives me

PHP Warning:  DOMDocument::loadHTML(): Tag foo invalid in Entity, line: 1 in /Users/alanstorm/Desktop/test.php on line 4

PHP Stack trace:
PHP   1. {main}() /Users/alanstorm/Desktop/test.php:0
PHP   2. DOMDocument::loadHTML() /Users/alanstorm/Desktop/test.php:4
<?xml version="1.0" standalone="yes"?>  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><foo/></body></html>

So DOMDocument consumes things, but you get warnings. PHP has the capacity to intercept and/or suppress theses kinds of warnings, but doing so has consequences. As these are Warnings and not Exceptions, your only choice from a client coder’s perspective is to use the error control operator.

$string_with_my_file = '<html><body><foo></foo></body>';
@$dom = DOMDocument::loadHTML($string_with_my_file);
echo $dom->saveXML();

This will prevent the display of the Warning but you also lose any reporting on critical errors. Also, error suppression comes with performance implications. Your other choice is to override the handling of errors system/application wide, which can have far reaching consequences.

Once again we’ve reached a point where the PHP developer has a method for parsing their HTML, but it’s a non-portable solution that isn’t adequate in certain environments.

HTML5 to the Rescue

A core belief of the folks who started the HTML5 Working Group is deal with the reality of HTML as it exists in the real world. One of the fruits of their labor has been html5lib, “A ruby/python based HTML parser/tokenizer based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.”

The intent of the project is to provide the same level of robustness in parsing that a web browser will give you. No matter how malformed, if it shows up in a browser html5lib’s intent is to read it. Despite the ruby/python mentioned in their project description, in July of 2009 a PHP Version of the parser was released.

If you download and unzip the archive, you’ll find six core files

TreeBuilder.php
Tokenizer.php
Parser.php
named-character-references.ser
InputStream.php
Data.php

Parse.php is the one you care about. Using code something like the following

$document = HTML5_Parser::parse('<html><body><foo /></body>')       
var_dump($document->saveHTML());

will give you a DOMDocument object with no warnings, and the original content preserved.

You can also use the library to parse partial documents into node fragments (Represented as a DOMNodeList object)

$nodelist = HTML5_Parser::parseFragment('<b>I want to love semantic HTML, but it keeps filing restraining orders.</b><br>');    
echo get_class($nodelist)."\n";

Caveats, realities, etcetera

Like anything that sounds too good to be true, it is. From the README

Warning: This is a pre-alpha release, and as such, certain parts of this code are not up-to-snuff (e.g. error reporting and performance). However, the code is very close to spec and passes 100% of tests not related to parse errors. Nevertheless, expect to have to update your code on the next upgrade.

and

We don’t want to ultimately use PHP’s DOM because it is not tolerant of certain types of errors that HTML 5 allows (for example, an element “foo@bar”). But the current implementation uses it, since it’s easy. Eventually, this html5lib implementation will get a version of SimpleTree; and may possibly start using that by default.

That said, it’s certainly another tool in your box. Any parsing routine you come-up with on your own is going to be pre-pre-alpha, and html5lib just may save you when other solutions fall short.

Originally published November 21, 2009
blog comments powered by Disqus