I’m trying to parse a bunch of webpages from an adult website using Ruby:
require 'hpricot'
require 'open-uri'
doc = Hpricot(open('random page on an adult website'))
However, what I end up getting instead is that initial ‘Site Agreement’ page making sure that you’re 18+, etc.
How do I get past the Site Agreement and pull the webpages I want? (If there’s a way to do it, any language is fine.)
You’re going to have to figure out how the site detects that a visitor has accepted the agreement.
The most obvious choice would be cookies. Likely when a visitor accepts the agreement, a cookie is sent to their browser, which is then passed back to the site on every subsequent request.
You’ll have to get your script to act like a visitor by accepting the cookie, and sending it with every subsequent request. This will require programming on your part to request the “accept agreement” page first, find the cookie, and store it for use. It’s likely that they don’t use a specific cookie for the agreement, but rather store it in a session, in which case you just need to find the session cookie.