I’m fetching and parsing a medium-large quantity of webpages. I noticed my script was spontaneously ending with a Python session restart. Thus far it only seems to happen when I try to make soup out of the nasa.gov page. i.e.:
import urllib2
from bs4 import BeautifulSoup
page=urllib2.urlopen('http://www.nasa.gov')
soup=BeautifulSoup(page)
=====================================RESTART=======================================
Does anyone know why this might be occurring and whether there’s anyway I can avoid it? It doesn’t throw an exception or anything, the session just restarts. This happens on two different machines, although I’d be interested if it isn’t reproducible by others (I’m using Python 2.7.2 – Enthought Distribution)
EDIT/UPDATE:
I’ve just tried to substitute lxml for BeautifulSoup, but it causes the same spontaneous restart. i.e.
from lxml import html
page=html.parse('http://www.nasa.gov')
============================== RESTART =================================
As soon as Python opens and tries to parse the page the session restarts. Interestingly, reading the page and printing it to the console works fine.
The Doctype is wrong for that url. Try this: