I have a web page that has the following content (I’ve changed the URL in the src tag for privacy purposes, otherwise viewing the page source is identical):
<HTML>
<BODY>
<script type="text/javascript" src="http://localhost/servlet?publicKey=abcdefg12345678&"></script>
</BODY>
</HTML>
The resulting page displays an image when viewed in a browser and I’m trying to scrape that image. After I scrape the image I attempt to index the images (see http://www.tineye.com for the idea of image search engine) and store them. If anybody knows how to scrape images from such web sites please let me know.
Note: the src does not contain ANY information about the image… it only calls the given servlet with a public key as the parameter. What I’ve posted above is EXACTLY what I see when I click View->Page Source in my browser (Firefox). Of course I’ve changed the actual URL and the public key for privacy issues, otherwise everything is identical.
I’ve seem similar techniques used for some banners: http://coldjava.hypermart.net/servlets/banner.htm
The JavaScript is probably manipulating the DOM and adding an image. Therefore the image (.jpg, .png or .gif) should be somewhere inside the JavaScript file, and should look something like this:
You can use Regular Expressions to filter the path and filename out of the javascript code.