I need to parse some webpage in my application, but I have 1 big problem – data. Page I want to parse has something between 400-500kb, depending on time. I need to parse it few times per day – depending on user request etc, but standard should be 10-20 times per day. However, I’m worried about data – if I parse it 10-20 times per day its 150-300mb in 1 month (10 x 30 x 0,5mb). Which is too much, as many people have 100mb limit. Or even 500mb limit, and I can’t eat half of it with my app.
I need only very small part of web page data, is there a way to download for example only a part of web page source, or only some specific tags, or download it compressed, or any other kind of download whithout eating hundreds of mb per month?
Doing this would probably need some co-operation from the web-server, if you are downloading the page from a server that isn’t under your control then this is probably not possible.
One thing to bear in mind is that modern web browsers and servers typically gzip text-based data, so the actual amount of data being transferred will be significantly less than the uncompressed size of the pages (to get a rough idea of how big the transfer will be, try using a zip utility to squash the raw HTML).
One further thing that might help is the HTTP Range header, which may or may not be supported by your server – this lets you request particular parts of a resource, specified by a byte range.