Yes that sounds overly complicated.
I am trying to mine data from pages on our intranet. The pages are secure. The connection is refused when I try to get the contents with urllib.urlopen().
So I would like to use python to open a web browser to open the site then click some links that trigger javascript pop ups containing tables of info that I want to collect.
Any suggestions on where to begin?
I know the format of the page. It is something like this:
<div id="list">
<ul id="list item">
<li><a onclict="Openpopup('1');">blah</a></li>
</ul>
<ul></ul>
etc
Then a hidden frame becomes visible and the fields in the table within are filled.
<div>
<table>
<tr><td><span id="info_i_want">...
First off, I suggest that it’s better to figure out what the page needs that JS is providing, and fake that – you’ll have an easier time scraping the page if a browser isn’t involved.
If it’s just Javascript making an XMLHttpRequest, you can find the page from which the Javascript fetches the
iframedata and connect directly to that.But in spite of that you may need a library that does Javascript execution (if the reverse-engineering is too hard or it uses challenge tokens). A web-rendering framework like Gecko or WebKit might be appropriate.
Take a good look at Selenium if you insist on using a true web browser or cannot get the programmatic methods to work.
Once you’ve gotten the page contents via whatever method, you need an HTML parser (such as
sgmllibor [almost]xml.dom). I suggest a DOM library. Parse the DOM and extract the contents from the appropriate node in the resulting tree.