Parsing Html The Cthulhu Way

Find me a good html parsing engine and I will gladly use it. Tidy is the best I have found so far, and you still have to do quite a few rounds of additional cleaning using regex’s after Tidy is done.

I’ve done a lot of HTML parsing with regular expressions and rarely had a problem. That’s because I’m usually working on one file or a small set and I’m doing the regular expressions in my text editor, so I get immediate feedback when it doesn’t do what I want.

I tend to think there should be an inverse correlation between code elegance and the number of times the code gets run. If you’re only going to run your code once, feel free to throw in the most Lovecraftian regex you can concoct. Just make sure to comment your tome^H^H^H^Hfile “This code was not meant for mere mortals to understand. If you value your sanity, make sure you can roll well on your Refactoring skill check.”

I have this feeling about Finite State Automata vs. Pushdown Automata…

@Joe:

You might want to take a look at HTMLPurifier for PHP. It’s a whitelist-based approach to HTML filtering. Their comparison page also lists a few other libraries, although, as you can imagine, they are in favor of their own approach.

http://htmlpurifier.org/

I worked on a large complex scraping system that scraped tens of thousands of pages each day to extract structured data from them. 90% of the extraction was regex, and it has been working fine for many years. In some places we would use html parsing libraries and sanitizers, but often regexs worked great and were simpler to code. As a side note we would also often run into invalid HTML that broke the parsers we tried.

Use regex to parse html it’s a temptation of the evil.

Though I have no quarrel with the statement that html should not be parsed with RE’s in general, there are cases where it makes sense. An internal tool I wrote for my company needs to parse an html page. The format of that page never changes, and it does not need to parse any other pages. A couple of RE’s made quick work of the parsing, and work well. They got the job done. Of course, if that page ever changes the RE’s will necessarily change along with it, but that’s not really a big deal.

In sum, the solution isn’t robust, but it works, it was easy, it will be easy to modify / fix, and it saves my coworkers and I untold frustration and time every week.

Not sure how there could be anything wrong with that.

Now, that was actually funny.
Thanks god I still don’t get requested of parsing HTML to pull out stuff from it.

It’s somehow ironic that this post doesn’t allow HTML (no HTML in red) either!

Again with the PHP Sass, it seems like no matter how many times you talk about “it’s the programmer not the language” you just can’t forgive PHP for having such a low barrier of entry.

To contribute, though: when scraping an 80k+ file (they exist, trust me [[shudder]]), the Regex were significantly less awful than loading up the whole DOM parser and praying that there isn’t an “unrecoverable error” in there somewhere.

Then again, had the task been too much more complex I may have had to start eating 5 babies a weak (only up to 3 right now). This post does strike me as arguing both sides of the coin, but at least it hits them in the right way: If you’re going in with Regex in your hand, know that you carry your sanity in those same hands.

And please, Jeff, lay off PHP; it’s not funny or clever anymore, and just as many horrible, horrible things can be said of VB developers as PHP, it’s just that a much higher percentage of the VB developers are not “hobbyists” :wink:

I think to contend that HTML parsing is a solved problem there should be a few more examples. I’d like to caveat this by saying that the approach of converting HTML to well formed XML (or XHTML) does not work for everyone, and I would think that a HTML parser that qualifies an established solution would be robust enough to handle the flexibility that HTML allows.

While I agree that regex is not the right approach I disagree that this is a solved problem.

Okay, sure, you can’t parse HTML with regex, and you shouldn’t try. There is a problem not served any the available libraries, and that’s parsing the garbage that sort of looks like HTML if you don’t look too deeply that’s littered all over the web. For these, using a regular expression to look for what you need will work better and more reliably than trying to figure out how to get your parser to not blow up when it discovers that what it’s parsing doesn’t actually validate.

You know, there was actually a question just about that on SO a couple of days ago.

A small snippet of code was sighted in this post.

Ah well, we can still hope.

i think the biggest problems arise when the html (or xml) is not well formed. But then you get in trouble both with most libraries or pre-built parsers i know too.

For the brave of heart: Write a regular language to recognize all strings of balanced parens.

Actually, don’t, because this is provably impossible.

Why? Because regular expressions recognize regular languages, a specific, well-defined class of languages. HTML, like the balanced parens problem above, doesn’t conform to this pattern.

Every once in a while, I’m reminded of why studying bona fide computer science in college was the right idea. It won’t necessarily make you a better programmer, but it has saved me from doing really stupid things from time to time, like trying to parse html with a regex.

Write a regex to identify all balanced parenthetical strings. I dare you.

I didn’t study computer science because it was easy, rather, because it has some nice ass-saving properties that prevent you from doing stupid things. Like parsing HTML with a regex.

Jeff, sorry abt the double post above, something is amiss with your website. I tried submitting the first one 3 times. The first time, I got an error about a temp file, the second time I got a CAPTCHA error, and the third time (I hit refresh each time), the comment somehow went through. Weird.

Also, your captchas are kind of hard. Just saying.

Thank you VERY much! Now we can link to this article when explaining to SO posters why this is a bad idea.

BTW, .NET users can use http://htmlagilitypack.codeplex.com/

  • TrueWill

I worked at one company in the 1990’s (before the days of CMS’s) where I maintained web pages for a knowledgebase about the product I supported. The official website team at this company periodically changed the design of the website, and then they had a huge task editing hundreds of pages one by one, to match the new design.

Of course, to update the pages I was responsible for, I wrote a Perl script as a crude form of HTML templates, and my pages were done in five minutes. I offered my script to them to help them get their work done. They refused, saying, “we don’t have time to learn new tools, we have hundreds of pages to edit!”

I was appalled at the time, but I’ve learned something since then: There are all sorts of people working with data, with HTML, and with code. To some people, it doesn’t make a task easier to learn a new library – it makes the task HARDER. To them, using a tool they know how to use already is a huge win, even if that tool solves the task inefficiently.

Eventually, a person trying to manipulate HTML with a regular expression hits a wall, where their tool simply can’t solve the task. Some people will simply not be able to do some things. That’s why they need to hire someone who has more tools.

That is why I proposed a feature on meta.stackoverflow a long time ago to support question templates (like google code does), that would avoid such common cases