I’m trying to parse the HTML of a webpage that requires being logged in. I can get the HTML of a webpage using this script:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com
But trying to get the source of a webpage that I’m logged into proves to be more difficult.
I tried replacing the (‘https://www.example.com’) with (‘https://user:pass@example.com’) but I got an Invalid URL error.
Anyone know how I could do this?
Thanks in advance.
Selenium WebDriver ( http://seleniumhq.org/projects/webdriver/ ) might be good for your needs here. You can log in to the page and then print the contents of the HTML. Here’s an example: