Today in my code im downloading the images from a website like this:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using HtmlAgilityPack;
using System.IO;
using System.Text.RegularExpressions;
using System.Xml.Linq;
using System.Net;
using System.Web;
using System.Threading;
using DannyGeneral;
using GatherLinks;
namespace GatherLinks
{
class RetrieveWebContent
{
HtmlAgilityPack.HtmlDocument doc;
string imgg;
int images;
public RetrieveWebContent()
{
images = 0;
}
public List<string> retrieveImages(string address)
{
try
{
doc = new HtmlAgilityPack.HtmlDocument();
System.Net.WebClient wc = new System.Net.WebClient();
List<string> imgList = new List<string>();
doc.Load(wc.OpenRead(address));
HtmlNodeCollection imgs = doc.DocumentNode.SelectNodes("//img[@src]");
if (imgs == null) return new List<string>();
foreach (HtmlNode img in imgs)
{
if (img.Attributes["src"] == null)
continue;
HtmlAttribute src = img.Attributes["src"];
imgList.Add(src.Value);
if (src.Value.StartsWith("http") || src.Value.StartsWith("https") || src.Value.StartsWith("www"))
{
images++;
string[] arr = src.Value.Split('/');
imgg = arr[arr.Length - 1];
wc.DownloadFile(src.Value, @"d:\MyImages\" + imgg);
}
}
return imgList;
}
catch
{
Logger.Write("There Was Problem Downloading The Image: " + imgg);
return null;
}
}
}
}
But sometimes in many cases the images are behind or under java script and cant be downloaded regular. How can i get/download the images and/or the whole complete website content including images and everything so later on in my hard disk i will have the complete website with all its content tree so i can surf to it Offline.
I would use an actual browser and then save images from there.. Take a look at Watir Webdriver for a solution in Ruby. This library helps you automate a browser… I would use it in combination with Nokogiri to achieve what you are trying to do above..
Python equivalents also exist..
Webdriver does not yet support the save functionality but the older “Watir” does. You might also want to look into CasperJS which provides some browser automation in the Javascript language.