I would like to start by pointing out that I know this is probably failing because of cross domain restrictions – just want that confirming really.
I have a window which I open with javascript. I then use an Ajax request to pull the contents of a site and echo that (including in a base href link to force it to work relatively) into the new window.
The idea is that I can scrape the JS rendered HTML to see if the site is really running our banners or not ( we have a suspicion that they are not! )
I open the window with this:
msaScrape.msaWin = window.open ('null.php', 'msa_weed', "scrollbars=yes,toolbar=no,status=no,width=1000,height=1000");
This loads the new window with the contents of the target page and correctly loads and renders the JS fired stuff too ( the banners is the bit im after ).
I have tried msaScrape.msaWin.document.body, msaScrape.msaWin.document.body.innerHTML and many – MANY other combinations but none will give me back the fully rendered HTML.
When I run the test on the raw buffer from the Ajax request I can detect embedded strings fine – but since the banners are being loaded via JS I need them to be loaded into the DOM before I can search the HTML for the banner ID.
Is what I am trying to do possible or am I trying to do something that cannot be done? I find it odd that I can write into this popup window, and that I can scan (and find matches in) the raw, unrendered buffer. Its as soon as I have allowed the popup page to render the HTML that it falls down and I can’t get at the source.
If required I can post the entire (small) JS bit that I am trying to do the scrape and match – just checking with the client if they mind me doing that ( its for a private client and don’t want to upset them! )
Here is how I got it to scan the innerHTML of a remotely loaded window:
stopScraper was just a form input that allowed me to give the focus back to the calling page.
The problem was being caused by the popup not having enough time to render its Dom ( plus I had to inject a base href=”http://www.example.com” when I grabbed the content as a string with PHP to ensure that paths worked when I echo’d out the string into null.php)
I ran it, with an interval of 8.5 seconds between requests and then give the popup another second to fully load its Dom before trying to read the stuff that was loaded by the in-page JS files.
Final results from live, Cross domain tests:
Requests: 4024
Scrapes: 4024 ( didnt miss a beat! )
Hits: 147 ( was looking for a particular banner in Dom )
If people want more explanation on how I did this then its probably better to email me and Ill just send you the whole engine – it has a test mode built in to test it with before you try it on your other domain! Several files though – plus I’m not too sure on the legality of what I was doing so don’t think I should make the whole answer public!
In short though, if you load your content via same domain using a PHP file_get_contents, add the base href (if missing), echo as content for null.php ( open this window as a popup using javascript as shown in top question ) – the code here WILL match your test string against the fully loaded Dom
I would like to stress at this point that I needed to test everything (including banners loaded by external JS files ) so HAD to render the raw HTML in a browser to cause the JS to fire. I had also looked at PhantomJS but didn’t need it in the end! Managed to solve the problem with nothing but JS 🙂