I am using http://code.google.com/p/feedparser/ to write a simple news integrator. But I want pure

Question

0

Editorial Team

Asked: May 31, 20262026-05-31T07:12:04+00:00 2026-05-31T07:12:04+00:00

I am using http://code.google.com/p/feedparser/ to write a simple news integrator. But I want pure

0

I am using http://code.google.com/p/feedparser/ to write a simple news integrator.

But I want pure text ( with <p> tags), but no urls or images (ie. no <a> or <img> tags).

Here are two methods to do that:

1.Edit the source code. http://code.google.com/p/feedparser/source/browse/branches/f8dy/feedparser/feedparser.py

class _HTMLSanitizer(_BaseHTMLProcessor):
    acceptable_elements =[....]

Simply remove the a & img tags.

2.

import feedparser 
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('a')
feedparser._HTMLSanitizer.acceptable_elements = feedparser._HTMLSanitizer.acceptable_elements.remove('img')

When I use feedparser, first remove the two tags.

Which method is better?

Are there any other good methods?

Thanks a lot!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T07:12:05+00:00

Usually, the quicker is better, and this can be determined using python’s timeit module. But in your case, I’d prefer not to alter the source code but stick with the second option. It helps maintainability.

Other options include writing a custom parser (use a C extension for maximum speed) or just let your site’s templating engine (Django maybe?) strip those tags. Well, I’ ve changed my mind, the last solution seems the best all-around…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using http://code.google.com/p/feedparser/ to write a simple news integrator. But I want pure

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply