I’m trying to capture some images from an old database.
When writing scrapers, I use ruby (but am comfortable with php as well) to directly open() a website and read its contents. I sometimes also use the script to call the appropriate curl ... command.
However, the database I’m scraping some pieces out of returns a page and then embeds the target image with an image name using a series of random numbers I assume by the server side script. For example:
<img ... show_image.jsp?343523.jpg
However, I cannot call this show_image script directly (denied), it only works when embedded in the website as a whole.
Can I use curl, or within ruby or php do something download the entire page, for example, 1929.2.14.aspx in such a way that it includes the embedded image generated by show_image.jsp?343523.jpg?
If I simply curl the aspx file directly, I naturally just get the html – how might one save both the html and embedded image via scripting in the manner that a browser-based “web archive” feature works manually?
Any tips, links to tutorials, etc. appreciated…
You should probably be using mechanize to scrape websites in ruby. When you do it will set cookies and referer for you so getting the image will be as easy as: