I have downloaded the web page into an html file. I am wondering what’s

Question

0

Asked: May 13, 20262026-05-13T22:40:00+00:00 2026-05-13T22:40:00+00:00

I have downloaded the web page into an html file. I am wondering what’s

0

I have downloaded the web page into an html file. I am wondering what’s the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.

To be clear:

Input:

<html><head><title>Page title</title></head>
       <body><p id="firstpara" align="center">This is paragraph <b>one</b>.
       <p id="secondpara" align="blah">This is paragraph <b>two</b>.
       </html>

Output:

Page title This is paragraph one. This is paragraph two.

putting together:

from BeautifulSoup import BeautifulSoup
import re

def removeHtmlTags(page):
    p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
    return p.sub('', page)

def removeHtmlTags2(page):
    soup = BeautifulSoup(page)
    return ''.join(soup.findAll(text=True))

Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don’t use regex to parse html rant)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T22:40:01+00:00

Editorial Team

2026-05-13T22:40:01+00:00Added an answer on May 13, 2026 at 10:40 pm

Parse the HTML with Beautiful Soup.

To get all the text, without the tags, try:

''.join(soup.findAll(text=True))

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have downloaded the web page into an html file. I am wondering what’s

Related

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply