For a company project, I need to create a web scraping application with PHP and JavaScript (including jQuery) that will extract specific data from each page of our clients’ websites. The scraping app needs to get two types of data for each page: 1) determine whether certain HTML elements with specific IDs are present, and 2) extract the value of a specific JavaScript variable. The JS variable name is the same on each page, but the value is usually different.
I believe I know how I can get the first data requirement: using the PHP file_get_contents() function to get each page’s HTML and then use JavaScript/jQuery to parse that HTML and search for elements with specific IDs. However, I’m not sure how to get the 2nd piece of data – the JavaScript variable values. The JavaScript variable isn’t even found within each page’s HTML; instead, it is found in an external JavaScript file that is linked to the page. And even if the JavaScript were embedded in the page’s HTML, I know that file_get_contents() would only extract the JavaScript code (and other HTML) and not any variable values.
Can anyone suggest a good approach to getting this variable value for each page of a given website?
EDIT: Just to clarify, I need the values of the JavaScript variables after the JavaScript code has been run. Is such a thing even possible?
presumably this is impossible because it seems so simple, but if it’s your .js you’re trying to detect, why not just have that .js do something detectable via scrape to the page?
use the js to populate a tag like this somewhere (via element.innerHTML, presumably):
edit: alternately, maybe use a document.write, if the script needs to be detectable onload