Parsing Html The Cthulhu Way

Hi, Jeff.

You say “It’s considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that’s just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.” Well, let me tell you my story.

Once, I had a rep of 1486 in Stack Overflow. I was so excited because finally, FINALLY, I could create my own tags. This was the objective of my life. I got 616 rep points in one month. I deleted my Twitter and Google + accounts for not losing a second. I just needed mere fourteen points! My question at http://stackoverflow.com/q/6873945 finally would have a “mozmill” tag; http://stackoverflow.com/q/6797631 and http://stackoverflow.com/q/6797779 would have the “rhinounit” tag; I could solve problems such as http://meta.stackoverflow.com/q/98584 by myself whether I find them. I rejoiced in anticipation.

Then, I found a quite innocent question about extracting some data from HTML. It seemed to be a pretty stably structured document, so I answered with a regex that could solve the problem: http://stackoverflow.com/q/6878032#6878203 Note that I emphasized that the solution was quick’n’dirty, an unstable document required some more sophisticated tool.

And I got a downvote. I could see my dreamt tags going away. I just give two steps behind, my journey would be longer. What if more people find my answer and downvote it too? What if I lost hundred of rep points?! My tags! MY TAGS! I panicked. I just managed to refrain my mourning to, between hiccups, give my testimony here.

There is a clear lesson here: do not parse HTML with regular expressions in any way. It can destroy your dreams, your soul, your life. If you do it, you’ll end up smoking crack. I learned the lesson and am trying to rebuild my life, maybe - MAYBE - with the ability of creating tags in SO. Do not make my mistake. It is not worth it.

Jeff, I really enjoyed your article. I posted an answer to the question on SO you referred to in this article here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/7564061#7564061. Seeing as there are so many answers, it may never be read, but what do you think about Balancing Group Definitions? I just find it interesting b/c it allows a regex engine to have state and act as a PDA.

Holler if you find my response interesting.

I see all the discussion about Parsing Html but I still havnt been able to find an example that would parse

<CLIOutput>
  <Results>
    <ReturnCode>0</ReturnCode>
    <EventCode>23000</EventCode>
    <EventSummary>CLI command completed successfully.</EventSummary>
  </Results>
  <Data>
    <Row>
      <Group>DNS</Group>
      <Domain>/CSCi</Domain>
      <Type>Normal</Type>
    </Row>
    <Row>
      <Group>GBS</Group>
      <Domain>/CSCi</Domain>
      <Type>Normal</Type>
    </Row>
    <Row>
      <Group>CSCi_7PM_Group</Group>
      <Domain>/</Domain>
      <Type>Normal</Type>
    </Row>
....   Etc

html cleaner is a parser library that i used in the past to handle malformed html, it also provides a limited amount of xpath selectors.

I think we should use the time saved using regex to argue about why not to use regex

Someone said:

“Simple things like finding all the href attributes in a document are easily accomplished with a regex.”

Not even that is true.

Say I have a document that includes the following:

...
<script>
const pwn='<a href="example.com">fail</a>';
</script>
...

The Cthulhu regexp will most likely extract the “example.com” contained in the script, which was probably not the intent.

2 Likes