Tim\'s picture      Blogging Ottinger (tim)

2005-October-28

Regular Expression Cheat in Python

Filed under: Linux, Programming

The right way probably would have involved using an xml or html parser to extract the text from the document and then my program could have dealt with it. On the other hand, the right thing can be to do the simplest thing that might possibly work.

The data was a log from a chat program. The problem was to extract certain lines. The issue was that the log was in html.

I was in python, where I love to be. I could have gotten any of the wonderful parsers and spent time working though the code, and would have had the right answer. It would have handled all the tags and nested tags and what-have-you. And there are other tools in the world to de-html-ify text that I could have used. All of that would have been more “proper” than what I did.

What I did was create a regular expression < [^>]*$>, and compile it with re.compile. Then I read lines from the file and used my tag pattern to substitute all patterns for blanks: tag_pattern.sub('', line). Was it as wonderful and perfect? No. Could it be confused by tags that split across lines? Sure. Did it parse my input jolly well? It sure did.

Okay, a commercial tool needs to be smarter, but this was for fun and for friends. I didn’t care enough to be that careful, though. I wanted something to get the job done, and I got it done. Sue me.

Comments »

The URI to TrackBack this entry is: http://tottinge.blogsome.com/2005/10/28/i-cheated-today/trackback/

No comments yet.

RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>



Anti-spam measure: please retype the above text into the box provided.

Get free blog up and running in minutes with Blogsome | Theme designs available here