For some time I have been trying to solve fairly common problem consisting of basically three steps:
- fetch html page with specified URL and store its content in a String
- detect content encoding either from html meta information or HTTP header
- recode the content into UTF-8 for further processing
In the real usage I have the first step a little extended with functionalities like having a “user-agent” instance with cookie-jar, configurable timeout and amount of GET attempts, configurable request count per time frame limitation, etc…
I implemented rest-client wrapper but I run into several problems:
- class-global
RestClient.proxysettings conflicting with e.g.couchrest(usingrest-clientitself) - freezing: sometimes the timeout causes freezing of the process. AFAIK more of my friends run into the same problem with
rest-client - redirect
LocationURI parsing:rest-clientfails to fetch “http://www.ofertacarioca.com.br/index.aspx?cidade=4,Belo%20Horizonte” correctly complaining about invalid URI ‘/indexnew.aspx?cidade=4,Belo Horizonte’ inLocationheader of the 302 result butcurbhandles this perfectly through to the target page. I’m about to reimplement the wrapper with the use ofcurb - recoding problems in the third step: I attempted to detect encoding from html page meta information and HTTP header (in this order) for some pages still to no avail
I would love to know of some cool gem out there that would handle such needs or of some intriguing universal solution hints if any.
As nobody has answered, I needed to implement the
curb-based solution:curburger
Perhaps somebody finds it useful.