I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).
I used scanner objects with reg. expressions and jsoup with its html parser.
Both methods are slow and with jsoup I get the following error:
java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)
Is there anything better?
EDIT:
Now that I’ve gotten jsoup to work, I think a better question is how do I speed it up?
Did you try lengthening the timeout on JSoup? It’s only 3 seconds by default, I believe. See e.g. this.