I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.
htmlDom = BeautifulSoup(webPage)
htmlDom.findAll(text=True)
Alternately,
from stripogram import html2text
extract = html2text(webPage)
Both of these extract all the javascript on the page as well, this is undesired.
I just wanted the readable text which you could copy from your browser to be extracted.
If you want to avoid extracting any of the contents of
scripttags with BeautifulSoup,will do that for you, getting the root’s immediate children which are non-script tags (and a separate
htmlDom.findAll(recursive=False, text=True)will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:I’m using
childGenerator(in lieu offindAll) so that I can just get all the children in order and do my own filtering.