I’ve seen a number of questions about removing HTML tags from strings, but I’m still a bit unclear on how my specific case should be handled.
I’ve seen that many posts advise against using regular expressions to handle HTML, but I suspect my case may warrant judicious circumvention of this rule.
I’m trying to parse PDF files and I’ve successfully managed to convert each page from my sample PDF file into a string of UTF-32 text. When images appear, an HTML-style tag is inserted which contains the name and location of the image (which is saved elsewhere).
In a separate portion of my app, I need to get rid of these image tags. Because we’re only dealing with image tags, I suspect the use of a regex may be warranted.
My question is twofold:
- Should I use a regex to remove these tags, or should I still use an HTML parsing module such as BeautifulSoup?
- Which regex or BeautifulSoup construct should I use? In other words, how should I code this?
For clarity, the tags are structured as <img src="/path/to/file"/>
Thanks!
I would vote that in your case it is acceptable to use a regular expression. Something like this should work:
I found that snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)
edit: version which will only remove things of the form
<img .... />: