I am writing a cron script that will routinely go through rows in a specific table, parse the text, and then generate system-based tags for use in other operations.
This table is the life-blood of our site, and is quite large – and I am wondering if it is better to have the cron script work directly with this table, or syphon the text to be parsed off to another table, with which my cron script can safely work.
Here’s a diagram of my thoughts:
Option 1:
Table 1: "blogs"
Table 2: "blog tags"
** cron script 'scrapes' the blogs table, marks each scraped blog to prevent duplicate scrapes, and then puts tags in the blog tags table
Option 2
Table 1: "blogs"
Table 2: "blogs to be parsed"
Table 3: "blog tags"
** when blogs are posted, some of their text and metadata is also inserted into "blogs to be parsed", which is the only table the cron script will have to then deal with.
Is there a performance / safety benefit for adding a layer of abstraction like this?
I don’t see any benefit to the extra table unless you are worried your script will accidentally mess up data in the blogs table. Since this is a read only operation, you’d have serious bug for that to happen. Using the original table also means you are using the indexes defined there, so it should be fast.
You are just reading rows and setting a flag. Your blog engine is doing way more with the same table.
Edit
If you are worrying about the flag when updating the table, address that. Create a table with a foreign key to blogs, and a bit that says updated. Now you don’t do any writes to the blog table in the cron job. Reads don’t require any locks whatsoever.
However, unless the blog table is being hit hundreds of times per second, the update is so fast it will be imperceptible. As long as you are using innodb or a storage engine that supports row level locking, you are good.
Also make sure you do your updates at midnight when there is low traffic