I need to scrape some websites, and would like to avoid downloading images from the pages I am scraping – I only need the text. I am hoping this will speed up the process. Any ideas on how to manage this?
Thanks,
Jon
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
While scraping you do not download images but the reference
IMGtag along with the entirebody. You can always remove theIMGtag on the server side before storing into your database/rendering to the view. I would suggest you use nokogiri to parse the content received and remove all occurrences of theIMGtag.This however does not speed up the process. Its just plain old
htmlthat is scraped. If you want fast fetching and parsing go forFeedzirraif you are dealing with feeds orTyphoeusfor fetching just the html content.