I know about utils like html2text, BeautifulSoup etc. but the issue is that they

Question

0

Asked: May 15, 20262026-05-15T15:26:21+00:00 2026-05-15T15:26:21+00:00

I know about utils like html2text, BeautifulSoup etc. but the issue is that they

0

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

Alternately,

from stripogram import html2text
extract = html2text(webPage)

Both of these extract all the javascript on the page as well, this is undesired.

I just wanted the readable text which you could copy from your browser to be extracted.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T15:26:22+00:00

If you want to avoid extracting any of the contents of script tags with BeautifulSoup,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do that for you, getting the root’s immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

I’m using childGenerator (in lieu of findAll) so that I can just get all the children in order and do my own filtering.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I know about utils like html2text, BeautifulSoup etc. but the issue is that they

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply