Is it recommended to work with persistent connections when screen-scraping? What are the possible advantages/disadvantages?
I’m using PHP/cURL to scrape.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
It won’t make that much of a difference. The real performance decision you need to make is concurrent scraping. Because, persistent or not, a single connection can only process 1 request/response at a time.
Which brings me to my next point:
PHP is probably the wrong tool for this job. It’s not very good at concurrency. Or, at least, the default build isn’t.