I’m considering developing a site where the server will crawl another site periodically, in order to gather content for certain entries in my database. My quesitons are as follows…
- How do you get the server to execute a crawl timely?
- Can you get it to execute php or what language do you use to perform the crawl?
- Are there any good APIs to do this?
- Should I consider building my own? If so, some advice on how to get started would be great
Basically, the kind of thing I want to do, is for the server to execute a script (say every hour), which finds all entries in the database which haven’t yet been crawled on another site. It will take a certain value from those entries, and will use them to crawl another site… it might request a url like this: www.anothersite.com/images?q=entryindb.
What I want it to do is then crawl the HTML, return an array, and log the values in the database. This is what I want the crawler to look for
Find all instances of
<img> inside <a> inside <td> inside <tr> inside <tbody> inside <table> inside <div id='content'>
Return array of the img.src from all instances.
Is something like that possible? – If so, how would I go about doing it? – Please bear in mind that web dev wise, the only experience I have so far (server-side) is with PHP.
UPDATE: I will be using a linux-based server, so I guess chron-scripting is how I should do it?
1. You need phpQuery to make your life easier with this
Download
phpQuery-0.9.5.386-onefile.zipfrom here.2. Your PHP file would be something like this
The
$imagesarray will have a list of all the image sources.3. Save the above code in a file, say
crawler.phpThen in the cron tab, if you want the crawler to run every hour, you would do: