If there are other classes written to do this, a link would be awesome. If not, how can I do it with PHPCrawl?
Is it possible to store specific information from a crawled site based upon a set of rules specific to the site? Ex., [div.wantThis, img#defaultPicture] is the array returned for site A and only [div.shortTextContent] is the array returned for site B?
In PHPCrawl, how can I get this information out of the $page_data array?
Needs
Must be able to target only certain elements.
Able to read the data storage rule from a variable (which could be an array specifying the element(s) to target).
What you are asking is how to parse specific content from site A and some other specific content from site B using PHPCrawl.
For site specific parsing style following if-else approach can be followed:
For specific content extracting following algo can be used:
Note: There are spectrum of parsing techniques avaliable, I am implmeneting HTML DOM Parsing here..
Reference:
HTML DOM
PHPCrawl Example