I am using http://code.google.com/p/feedparser/ to write a simple news integrator.
But I want pure text ( with <p> tags), but no urls or images (ie. no <a> or <img> tags).
Here are two methods to do that:
1.Edit the source code. http://code.google.com/p/feedparser/source/browse/branches/f8dy/feedparser/feedparser.py
class _HTMLSanitizer(_BaseHTMLProcessor):
acceptable_elements =[....]
Simply remove the a & img tags.
2.
import feedparser
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('a')
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('img')
When I use feedparser, first remove the two tags.
Which method is better?
Are there any other good methods?
Thanks a lot!
Usually, the quicker is better, and this can be determined using python’s timeit module. But in your case, I’d prefer not to alter the source code but stick with the second option. It helps maintainability.
Other options include writing a custom parser (use a C extension for maximum speed) or just let your site’s templating engine (Django maybe?) strip those tags. Well, I’ ve changed my mind, the last solution seems the best all-around…