pjscrape is great for scraping, however I’m having a hard time figuring out how to pass in arguments to my scraper.
It takes command line arguments for config files but I’m not sure what the scope of my config functions/variables is.
I would love to have configs per domain which includes base URLs, selectors etc. and build a somewhat generic scraper that would be able to read from this config.
How can I do that?
Pjscrape will evaluate all arguments as config files in the global scope, and you can pass in as many config files as you want. So configuring the per-domain scrapers in one or multiple files should be straightforward. For example:
base_config.js
my_site.js
scraper.js
Then invoke like:
The tricky part here is when you want to use scraper functions, not just selectors. PhantomJS runs the scrapers in a “sandboxed” environment, which does not have access to your global scope variables. So this will not work:
This is a trivial example, but you get the idea. PhantomJS now has native support for passing in arguments to
page.evaluate, but these aren’t built into Pjscrape yet. There are basically two ways to deal with this:Always deal with functions that don’t need access to the outer scope. So each site config file would specify full scraper functions, not just pluggable variables.
Create your scrapers with
new Function("..."), passing in your variables as you create the string. This is how Pjscrape does it under the hood, but fair warning – it can get ugly quickly in all but the most straightforward cases. One method I’ve used here is to useFunction#toStringand pass in arguments. This might look like this: