I’m trying to extract all the links from a website. I used Jsoup in previous programs to do this, the problem here is that “more content” is generated by pressing the more button and it doesn’t change pages, it simply loads more content so I’m not sure how to see all of the available links using Java and Jsoup.
The website is http://seekingalpha.com/symbol/msft and I’m simply trying to extract all the links to articles for a specific company, such as Microsoft.
You need to get yourself something you can spy on the requests you’re making over the wire. You could watch the http traffic using the Network tab in Chrome, but personally I like Charles. Anyway, if you check out what happens when you click the more button you would see that a POST request is being made (using AJAX of course) the looks like this:
http://seekingalpha.com/account/ajax_headlines_content 200 POST seekingalpha.com /account/ajax_headlines_content 432 ms 5.94 KB Complete
In the headers the params are:
type all
page 2
slugs msft
is_symbol_page true
So if I were you I’d just emulate that by making the POST request with the page param counting up until you’ve got all the content you want. By the way the content returned was an html fragment so easy to parse, eg: