I am using SimpleHtmlDOM PHP quite successfully to scrape some of my favorite webpages. Some of these pages, however, require me to log in before I can get at the information that I really care about. Does anyone know how (or if it’s possible) to get this library to access a page that requires a username and password be enterred before you gain access to the page? Everything I’ve done to date starts with something like…
$html = file_get_html('http://www.google.com/');
Very few sites use authentication mechanisms that are identical, so there’s no one way to always authenticate with a site.
Your best bet will be to use cURL and make your scraper look like a real browser. This means using cookies (search for "cookie" on the page, you might want to use a cookie file/jar) and storing them somewhere, navigating to the login form, submitting it successfully, then continuing to use that "browser" session to perform your scraping.
Please make sure that the sites don’t mind being scraped in this way. If discovered, you may be banned from the site depending on how much the site owners dislike scraping.