The HTML::Sanitizer 0.04 module is available on BackPan at http://backpan.perl.org/authors/id/N/NE/NESTING/. However, it does not appear to pass its own test suite (2 of 4 tests fail in t/03security.t) using Perl 5.10.1 on MacOS X 10.5.8. Sadly, that makes it of limited relevance.
Many programmers have a RegEx hammer and donât want to learn a DOM/XPath based screwdriver and ratchet set.
Sadly, (X)HTML is mostly nuts, bolts and screws. Yeah, you can hammer it together, but it will fall back apart soon enough.
Personally, I always use a HTML parser whenever possible.
As a beginner in regular expressions, itâs a huge pain in the arse to write a regular expression - let alone one to parse HTML.
Good points, but I think you left one important piece of advise out: donât do it at all. Both a library and regex approach are broken solutions if your source HTML isnât up to the standard. Therefore, it is much more preferred to tap into a structured data source, like XML, RSS, JSON, a RDBMS. The HTML has to come from somewhere, right?
Of course, there are scenarios where you do not have that kind of access to the original data source, like when you write your own search engine
Ie! Ie! Microsoft Fhtagn!
What a timely post. Youâve just convinced me to abandon my RegEx parsing hack and try to find a more âstableâ approach.
Found Html Agility Pack on codeplex - http://htmlagilitypack.codeplex.com/ Had working code in 10 minutes. Hmm, maybe thereâs a lesson to be learned hereâŚ
lol. See a much better, more sophisticated treatment over at esrâs blog.
You use becoming a follower of Cthulu like itâs a bad thing ?
I really enjoyed this article today. You really nailed being a good developer.
I almost always use regular expressions to sanitize scraped content (add missing quotes, remove attributes that my parser of choice chokes on etc) and then run it through the parser. So far, so good.
I donât waste time debating how to parse HTML since finding BeautifulSoup
I scrape HTML that is purposefully malformed to muck up the scraping process, using Regex. Had been using the DOM structure, but that has itâs own problems.
If it worksâŚ
There are no definitives really to this. The thing is most people parsing HTML are doing it for a specific set of pages usually in the same format. No RegEx could not perfectly parse HTML but it can parse it when you know the exact form of the HTML.
I started a project intending to use a library to parse the HTML but it became more trouble than it was worth. I knew the sections of information I wanted to pull out and I knew the WYSIWYG editor only allowed a small set of HTML for formatting and links e.g. strong, italic, underline, a link, bullets, numbers⌠In the end it was not using anything more than a simple bit of code to pull out the same content in plain text.
@craigybear
The problem is that (x)html is not a markup language, itâs an adhoc hacked together AST notation, and malformed html in particular is difficult because the rules for properly resolving html into its requisite tree structure are complicated and obtuse, and involve painful reverse engineering of multiple browsers. (it works in IE, so my markup must be correct!)
And so, if all you wanted to do was build a simple markup language, and a simple stylesheet language for sending your technical manual to the printers, yes, thatâs drop dead simple for any slightly âcompetantâ programmer. But if youâre Donald Knuth (Youâve heard of him, right?!), it takes about 10-20 years.
However, then using that markup language to extract useful information is an entirely different task for which a markup language is not really designed for. html was hacked into doing that task in the form of xml, but malformed tag soup, the sort of html youâd find out in the wildâ well letâs just look at the facts: It takes a team of hundreds of developers several years to make a tolerably compatible html parser/renderer. And youâre just gonna hack one up in a day, are you?
So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.
So the conclusion that can be drawn here is that dogmatism is always bad? And yes, I realize the irony in that statement.
You forgot to close the tag! Luckily, I think I got in there before all hell was unleashed.
What an awesome painting of Cthulhu.
I bet Chuck Norris can parse HTML using RegEx.
ElderSign
I bet Chuck Norris can parse HTML using RegEx.
/ElderSign