I want my script to download only text/html content and not binary or images that could take significantly more time to download. I know about the max_size parameter but I would like to add a check on the Content-Type header. Is this doable ?
I want my script to download only text/html content and not binary or images
Share
As pointed out by others you can perform a
HEADrequest before yourGETrequest. You ought to do this as a way of being polite to the server because it actually is easy for you to abort the connection, but not necessarily easy for the web server to abort sending a bunch of data and doing a bunch of work on its end.There are some different ways to do this depending on how sophisticated you want to be.
You can send an
Acceptheader with your request which only liststext/html. A well-implemented HTTP server will return a406 Not Acceptablestatus if you say you don’t accept whatever it is the file is. Of course, they might send it to you anyway. You can do this as yourHEADrequest as well.When using a recent version of LWP::UserAgent, you can use a handler subroutine to abort the rest of the request after the headers and before the content body.
See the Handlers section of the LWP::UserAgent documentation for details on handlers.
I haven’t caught the exception thrown or made sure to deal with the 406 responses carefully here. I leave that as an exercise for the reader.