a companion discussion area for blog.codinghorror.com

Parsing Html The Cthulhu Way


Can you please stop using PHP in a derogatory manner? After all, you’re the one actually advocating writing enterprise apps on Windows.


Parsing html with regular expressions = bad idea granted, but locating something specific within html with regex = good idea. Want to find A tags: use regex. Want to locate images: use regex. Want to apply XSLT to html for the purpose of converting it to an RSS feed: Use dedicated parser. Html Agility pack, Tidy, System.Html all fine parsers, all easy to use 99% of the result with 1% of the effort.

“A good artist copies. A great artist steals”. Leverage an API!


It’s not really about using regex vs some other parsing method, its really just about the cohesion between the search mechanism and the rest of the software.

Regex has its place as a simple search mechanism. It’s easy to implement and generally gets the job in a productive fashion. If the searching is complex, then a different mechanism should be used.

The only thing that would irk me is if the searching function call was located deep within a 1000 line module. I wouldn’t care at all if I had to replace a single search class.

If the project had unit tests, that makes replacing the algorithm even easier.

I’m posting the Cthuluhu picture on my wall at work anyway.

Great post.


“Ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.”


Nice post , good to see other developers considering the wrath of the old ones whilst they are working having Shub-Niggurath show up due to faulty error trapping is in no-ones interests


It’s not Cthulhu. It’s ZA̡͊͠͝LGΌ.


Instead of HTML::Sanitiizer. just point to http://search.cpan.org/search?query=html+parser&mode=all and let people pick one.


Meh… use the right tool for the job. And sometimes, that means using regexes - if you’re dealing with a consistently formed XML or HTML file, a simple regex may be a lot less effort than using a dedicated parser…


Didn’t we argue about this a year ago, and you dismissed me with “programming is hard, let’s go shopping”?

Yes, that’s right, you did: http://www.codinghorror.com/blog/archives/001172.html

Instead of putting your time into improving a working, open source HTML parser (which just recently added a selector engine), you wrote a bunch of hacky regex. Now you have 2 problems wasted valuable development hours, and you deserve the pain taunting my warnings.


Also, your wack-ass busted old moveeeablee typee cobol blogg enginne hath wacked my comment formatting. Bah.


I got downvoted on StackOverflow for saying that Regex is not the right solution for parsing HTML. It was offset by 11 upvotes, but some people will just never get it. It’s one thing to use a regex to tokenize HTML, but another thing entirely to use them as if HTML were a regular grammar.


Jeff, didn’t you spend a considerable amount of time in one of the StackOverflow podcasts trying to convince Joel that it was OK for you to try and parse Markup with a bunch of regular expressions, despite the fact that it’s not a regular language and runs into a bunch of the same types of problems?


Whoops heh, that’s what I get for not looking at the date of the post… for some reason this just popped up in my rss reader again.


Back in the day I wrote my own C HTML parser, back before it was a solved problem. I even had my own version of xpath for it.


I was seduced by the RegExHtmlMonster. I woke up screaming and decided it was time to parse the nightmares away.


Jeff, How do you explain the popularity of syntax highlighters that use regular expressions ?



Very informative and trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading Arkadaslik Sitesi - Sohbet Odalari


This guy slaps Cthulhu across the face and laughs heartily!


If the ideology of this post and most of the comments are to believed as gospel then the following book will certainly make the baby Jesus cry…


LOL… now I understand bobince’s persistence in MY post about regex vs HTML: http://stackoverflow.com/questions/3951485/regex-extracting-only-the-visible-page-text-from-a-html-source-document

(…and maybe some of you would be amused by my own persistence, too :slight_smile:

However, as I stated numerous times in my comments, I wasn’t out to parse the HTML per se, but “merely” interested in a much coarser extraction. And for my purposes, the regex approach works - it’s a tradeoff between efficiency and total robustness. But the outcome is surprisingly solid. The final implementation can be found here: http://www.martinwardener.com/regex/

Mind you, regarding the “secondary” issue (extracting all links/URLs from an HTML document), it is of no concern that this implementation is over-eager (by design, btw) and picks out a few invalid URLs (mostly pertaining to script blocks) - those will be filtered out during the subsequent URL validation anyway.