I’m trying to scrape some content off another site and I’m not sure why BeautifulSoup is producing this output. It is only finding a blank space inside the match, but the real HTML contains a large amount of markup. I apologize if this is something stupid on my part. I’m new to python.
Here’s my code:
import sys
import os
import mechanize
import re
from BeautifulSoup import BeautifulSoup
def scrape_trails(BASE_URL, data):
#Get the trail names
soup = BeautifulSoup(data)
sitesDiv = soup.findAll("div", attrs={"id" : "sitesDiv"})
print sitesDiv
def main():
BASE_URL = "http://www.dnr.state.mn.us/skiing/skipass/list.html"
br = mechanize.Browser()
data = br.open(BASE_URL).get_data()
links = scrape_trails(BASE_URL, data)
if __name__ == '__main__':
main()
If you follow that URL you can see the sitesDiv contains a lot of markup. I’m not sure if I’m doing something wrong or if this is just malformed markup that the script can’t handle. Thanks!
The problem is that the HTML served from that URL has an empty div.sitesDiv:
There’s a script on the page that fills in the div after the page is loaded. Your Python code doesn’t execute the Javascript, so the div is never modified, so it’s still empty when your code parses it.
The good news is that the data you’re looking for is served to the HTML as JSON from this URL: http://maps.dnr.state.mn.us/cgi-bin/mapserv54?map=/usr/local/mapserver/apps/prk/ski_pass/sites.map&mode=nquery&qformat=geojson . So you can skip BeautifulSoup altogether, and just read and parse the JSON directly to get the info you want.