I am using a query like this in jSoup:
Document doc = Jsoup.connect(urlString).timeout(1000).post();
It works for some sites, however:
-
it doesn’t work for Google search queries (e.g. urlString = “http://www.google.com/search?q=text”) – I don’t know why, how it is special
-
result documents contain messages like “JavaScript should be turned on in your browser” which I would rather avoid
-
there are probably more quirks, but I haven’t tested it fully yet…
My question: could these problems be avoided if we could mimic a web browser more closely? What is the best way to do it?
What are the other differences that can be encountered between getting pages via web browser and via Java (URLConnection or jSoup)?
I realized that the problem with some sites not responding was actually that I was using post() instead of get(). With get() it works fine now!
It also probably helps to add userAgent to the query, for example:
In the meantime, I’ve also tested HtmlUnit for the same task, and it worked, but it seems like an overkill for the purpose to simply get an HTML file (for some kind of processing). It basically runs a whole invisible web browser in the background to do this task.