I’m a newbie to web development (and development in general) and I’m building out a rails app which scrapes data from a third party website. I’m using Nokogiri to parse for specific html elements that I’m interested in and these elements are stored in a database.
However, I’d like to save the html of the whole page I’m scraping as a back-up in case I change my mind on what type of information I want and in case the website removes the site (or updates it).
What’s the best practice for storing the archived html?
Should I extract it as a string and put it in a database, write it to a log or text file, or what?
Edit:
I should have clarified a bit. I am crawling on the order of 10K websites a week and anticipate only needing to access the back-ups on once-off basis if I redefine the type of data I want.
So as an example, if was crawling UN data on country population data and originally was looking at age distributions but later realized I wanted to get the gender distributions as well, I’d want to go back to all my HTML archives and pull the data out. I don’t anticipate this happening much (maybe 1-3 times a month) but when it does I’ll want to retrieve it across 10K-100K listings. The task should only take a few hours to do around 10K records so I guess each website fetch should take at most a second. I don’t need any versioning capability. Hope this clarifies.
I’m not sure what the “best practice” for this case is (it will vary by the specifics of your project), but as a starting point I’d suggest creating a model with a string field for the URL and a text field for the HTML itself, and save the pages there. You might add a uniqueness validator for the URL, to make sure you don’t store the same HTML twice.
You could then optionally add model methods to initiate a nokogiri document from the HTML text, thus using the HTML string as the “master” record (in the DB) and generating the nokogiri document on the fly when needed. But again, as @dave-newton points out, a lot of this will depend on what you’re going to do with this HTML.