My offline code works fine but I’m having trouble passing a web page from

Question

0

Asked: June 15, 20262026-06-15T23:37:43+00:00 2026-06-15T23:37:43+00:00

My offline code works fine but I’m having trouble passing a web page from

0

My offline code works fine but I’m having trouble passing a web page from urllib via lxml to BeautifulSoup. I’m using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.

#! /usr/bin/python
import urllib.request 
import urllib.error 
from io import StringIO
from bs4 import BeautifulSoup 
from lxml import etree 
from lxml import html 

file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly

With that working, I tried to feed it a page via urllib:

# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes

Trying to deal with the error message, I tried:

# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file

I didn’t know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:

attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer

This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don’t know what to do about them.

python 3.3, lxml 3.0.1, BeautifulSoup 4. I’m new to python. Thanks to the internet for code fragments and examples.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T23:37:44+00:00

Editorial Team

2026-06-15T23:37:44+00:00Added an answer on June 15, 2026 at 11:37 pm

BeautifulSoup can use the lxml parser directly, no need to go to these lengths.

BeautifulSoup(doc, 'lxml')

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My offline code works fine but I’m having trouble passing a web page from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply