As part of my web app, I built a system that periodically pulls an RSS feed and scrapes its content. I also look for any image tags present in the feed item, and attempt to pull it to query its size and such to determine which “picture” to use.
Here is a rough sketch of that part of the code:
- Is there an
<image>node? If so, that is the image. Exit. - Parse the content of the
descriptionnode through simplehtmldom and look for any and allimgtags - Iterate through all
imgtags:getimagesize();- If the image size is greater than one I found earlier, use this picture.
- Exit.
At step 3, the script can take awhile, especially for feeds that have lots of images for me to check. I assume that each call to getimagesize() takes a certain amount of time and it adds up quickly. I’m not too worried about it taking a long time (although if it could be reduced, that would be best), but the fact that while this script is running, it effectively leaves all other concurrent users hanging until the script has finished.
I’d like to avoid this, but am not too proficient at server admin – perhaps someone could give me some guiding pointers?
Thanks!
Run it on a separate server if you need the performance boost.
getimagesize()can really slow things down. I’d recommend running the scraping script on it’s own server and host everything else on your current server.