I’m trying to use python to navigate through a website that have auth forms on its landing page, rendered by ASP scripts.
But when I use python (with mechanize, requests, or urlibs) to get the HTML of that site, I always end up with a semi-blank HTML file, due to such ASP scripts.
Would anyone know any method that I can use to get the final (as displayed on a browser) version of an ASP site?
Your target page is a
frameset. There is nothing fancy going on from the server side that I can tell. When I userequestsorurllibto download it, even sending no headers at all, I get exactly the same HTML that I see in Chrome or Firefox. There is some embedded JS, but it doesn’t do anything. Basically, all there is here is aframesetwith a singleframein it.The
frametarget is also a perfectly normal page with nothing fancy going on from the server side that I can tell. Again, if I fetch it with no headers, I get the exact same contents as in Chrome or Firefox. There is plenty of embedded JS here, but it’s not building the DOM from scratch or anything; the static contents that I get from the server have the whole page contents in them. I can strip out all the JS and render it, and it looks exactly the same.There is a minor problem that neither the server nor the HTML specifies a charset anywhere, and yet the contents aren’t ASCII, which means you need to guess what charset to decode if you want to process it as Unicode. But if you’re in Python 2.x, and just planning to grab things out of the DOM by ID or something, that won’t matter.
I suspect your real problem is just that you don’t know how HTML
framesets work. You’re downloading theframeset, not downloading the referencedframe, and wondering why the resulting page looks like an emptyframeset.Frames are an obsolete feature that nobody uses anymore for anything but a common trick for letting the user pop up a new window even in ancient browsers, and some obscure tricks for fooling popup blockers. In HTML 5 they’re finally gone. But as long as ancient websites are out there and need to be scraped, you need to know how they work.
This isn’t a substitute for the full documentation, but here’s the short version of what a web browser does with a
frameset: For eachframetag, it follows thesrcattribute, then it replaces the contents of theframetag with a#documenttag with no attributes, with the results of reading thesrcURL as its contents. Beyond that, of course, frames affect layout, but that probably doesn’t affect you.Meanwhile, if you’re trying to learn web scraping, you really want to install your browser’s “Web Developer Tools” (different browsers have different names), or a full-on debugger like Firebug. That way, you can inspect the live tree that your browser is rendering, and compare it to what you get from your script (or, more simply, from
wget). So, next time you can say “In Chrome’s Inspect Page, I see a#documentunder theframe, with a whole bunch of stuff underneath that, but when I try to read the same page myself, theframehas no children”.