Regular Expression Cheat in Python
The right way probably would have involved using an xml or html parser to extract the text from the document and then my program could have dealt with it. On the other hand, the right thing can be to do the simplest thing that might possibly work.
The data was a log from a chat program. The problem was to extract certain lines. The issue was that the log was in html.
I was in python, where I love to be. I could have gotten any of the wonderful parsers and spent time working though the code, and would have had the right answer. It would have handled all the tags and nested tags and what-have-you. And there are other tools in the world to de-html-ify text that I could have used. All of that would have been more “proper” than what I did.
What I did was create a regular expression < [^>]*$>, and compile it with re.compile. Then I read lines from the file and used my tag pattern to substitute all patterns for blanks: tag_pattern.sub('', line). Was it as wonderful and perfect? No. Could it be confused by tags that split across lines? Sure. Did it parse my input jolly well? It sure did.
Okay, a commercial tool needs to be smarter, but this was for fun and for friends. I didn’t care enough to be that careful, though. I wanted something to get the job done, and I got it done. Sue me.


