Is there a standard way to tell when a page was last modified? Currently I am doing this:
URLConnection uCon = url.openConnection();
uCon.setConnectTimeout(5000); // 5 seconds
String lastMod = uCon.getHeaderField("Last-Modified");
System.out.println("last mod: "+lastMod);
However it looks like some sites do not have a Last-Modified field.
http://www.cbc.ca has these header fields:
X-Origin-Server
Connection
Expires
null
Date
Server
Content-Type
Transfer-Encoding
Cache-Control
I could parse a page to try and get its date but this seems like a major pain. What is the standard?
(If possible I would like to stick with using URLConnection because that is what I use to download the webpage)
There is no standard. Dynamically generated web pages generally do not have a Last-Modified field, and different web pages include dates in different ways. Some sites do not even include such a date, including “© <current year>” at the bottom. You could try looking for a date near the bottom or the top, but reliably extracting the date from the web page would have to be site-specific.