Parsing Html The Cthulhu Way

Hi, Jeff.

You say “It’s considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that’s just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine.” Well, let me tell you my story.

Once, I had a rep of 1486 in Stack Overflow. I was so excited because finally, FINALLY, I could create my own tags. This was the objective of my life. I got 616 rep points in one month. I deleted my Twitter and Google + accounts for not losing a second. I just needed mere fourteen points! My question at finally would have a “mozmill” tag; and would have the “rhinounit” tag; I could solve problems such as by myself whether I find them. I rejoiced in anticipation.

Then, I found a quite innocent question about extracting some data from HTML. It seemed to be a pretty stably structured document, so I answered with a regex that could solve the problem: Note that I emphasized that the solution was quick’n’dirty, an unstable document required some more sophisticated tool.

And I got a downvote. I could see my dreamt tags going away. I just give two steps behind, my journey would be longer. What if more people find my answer and downvote it too? What if I lost hundred of rep points?! My tags! MY TAGS! I panicked. I just managed to refrain my mourning to, between hiccups, give my testimony here.

There is a clear lesson here: do not parse HTML with regular expressions in any way. It can destroy your dreams, your soul, your life. If you do it, you’ll end up smoking crack. I learned the lesson and am trying to rebuild my life, maybe - MAYBE - with the ability of creating tags in SO. Do not make my mistake. It is not worth it.

Jeff, I really enjoyed your article. I posted an answer to the question on SO you referred to in this article here Seeing as there are so many answers, it may never be read, but what do you think about Balancing Group Definitions? I just find it interesting b/c it allows a regex engine to have state and act as a PDA.

Holler if you find my response interesting.

I see all the discussion about Parsing Html but I still havnt been able to find an example that would parse

    <EventSummary>CLI command completed successfully.</EventSummary>
....   Etc

html cleaner is a parser library that i used in the past to handle malformed html, it also provides a limited amount of xpath selectors.

I think we should use the time saved using regex to argue about why not to use regex

Someone said:

“Simple things like finding all the href attributes in a document are easily accomplished with a regex.”

Not even that is true.

Say I have a document that includes the following:

const pwn='<a href="">fail</a>';

The Cthulhu regexp will most likely extract the “” contained in the script, which was probably not the intent.