I’m creating a web service that often scrapes data from remote web pages. After scraping this data, I have a simple multidimensional array of information to use. The scraping process is fairly taxing on my server, and the page load takes a while. I was considering adding a simple cache system using a MySQL database, where I create one row per remote web page with a the array of information pulled from it stored as a JSON encoded string. Is this a good enough system? Or would something like a text file per web page be a better idea?
Share
Since you’re scraping multiple web pages, and you want to your data to be persistently cached, you have a few options — the best of which would be to use memcache or a database such as MySQL. Using text files is not a good idea, because you would have to serialize / deserialize your data, and read from your filesystem. To query a database or a memcache is many times more efficient.
Since you’re probably looking for your cache to be somewhat persistent, I would suggest going with MySQL. You would simply create a table that has an auto-incrementing primary key, which a column for each element in your parsed JSON object. (Note that MySQL currently does not support arrays. In order to emulate them, you will need to use relational tables, or serialize your array data and provide it to a text field. The former method is preferred).
Every time you scrape a page, you would run an
UPDATEstatement to update that individual page’s information in the database. If you specify a unique index on whatever you use to uniquely identify your page (URL / etc), you will achieve optimal look-up performance.