Parsing Html The Cthulhu Way

Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness:


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Why someone use home grown parser? Because it works in 99% cases. Why not using full-blown parser engine? Because it does not work in 100% cases. It does not work in terms of cost and performance. It saves 1%, but it looses 99%. This is why.

Is this considered yet another awesome comment?
http://www.codinghorror.com/blog/archives/001130.html

Jeff… ummm ahhhh well didnt you build an html sanitizer that uses regular expressions? http://refactormycode.com/codes/333-sanitize-html

I think is more of a reference to House of Leaves than to Lovecraft. It’s a good thing Mark Z. Danielewski did not know about the wonders of Unicode though…

Hey, thanks for outing me, you ass.

I am reminded that everytime you try to solve a problem with regular expressions, you now have two problems. The orignal problem and regular expressions used to solve the problem.

I think it should be mentioned that if you can create a fully valid html parser you just created the core of a web browser.

Not a small project…

Jeff, are you high? Tell me you are not still trying to justify using regexes on HTML. On ANY HTML…

sigh

Look, even though libraries have a lot of code to them, so does the implementation of regexes. In other words, the code paths are similar in terms of complexity and execution cycles. You are just choosing the wrong method, the one that is going to be incomplete, buggy and difficult to maintain because it is what you know.

It’s just wrong. Think of the children.

The ignorance problem is that many, many developers don’t know or believe that HTML is not a regular language.

And when you do accept that you’re going for a 98% solution and using regular expressions on HTML, you have to be very aware of potentially creating cross-site scripting vulnerabilities.

The saddest thing is that parsing html with a real parser is easier than using regexp in ruby using hpricot, but that doesn’t stop some people from writing

article_contents = string.scan /(.*?)/

instead of

require ‘hpricot’
(Hpricot(string)/‘div.article’).inner_html

And then get all confused when blahblah breaks everything.

Bah, if you’re going to declare no html on your comments, you could at least escape the html for people.

Jeff is happy to talk about how nasty some practice is, as long as he still gets to justify when he did it himself. Jeff does not admit mistakes easily.

Simple things like finding all the href attributes in a document are easily accomplished with a regex. But once you get into trying to match opening and closing tags, yeah, it becomes hopeless.

XPath > RE

The captcha is ridiculous!

Seriously, the demonoid sounds from hell when I click the audio help are more blatantly evil and stupid than suggesting HTML should be parsed with a regex!

"You didn’t write that awful page. You’re just trying to get some data out of it. Right now, you don’t really care what HTML is supposed to look like.

Neither does this parser."

http://www.crummy.com/software/BeautifulSoup/

Kevin Peterson shares my concern: Where are the parsers that don’t barf on bad HTML?

Matching nested parentheses (from Mastering Regular Expressions, 2nd edition, pages 330-331)

my $LevelN; # This must be predeclared because it’s used in its own definition
$LevelN = qr/ (( [^()] ; (??{ $LevelN }) )+ ) /x;

This matches arbitrarily nested parenthesized text…

So I think, given that construct, it should be possible to generalize this to parse arbitrary nested HTML tags, including arbitrary JavaScript &c.

Also, anyone thinking about letting users put up HTML content from their Rich Text Editor what Generates HTML Output, check out rsnake’s XSS page first (http://ha.ckers.org/xss.html). It becomes apparent that the problem is all the weird quirks of all the different versions of all the browsers out there. And remember, the hackers aren’t going to be using your Rich Text Editor, they’re just going to be submitting Evil HTML of their Own Construction directly, probably using curl or something. So you’ll be trying to sanitize arbitrary HTML snippets such that they can’t cause problems on any browsers, most of which are not installed on your system right now, and until you went to that page you probably didn’t even know about all those possible ways of getting scripts to run. And that list can only get longer, not shorter. So save yourself some headache and use some other kind of markup that you translate to HTML very carefully.

To add another HTML parser to the list, there’s also libxml2’s HTMLParser. It’s probably the best open source HTML parser in C.

http://www.xmlsoft.org/html/libxml-HTMLparser.html