First, I’ll explain what I need to do, then how I think I can achieve it. My current plan seems very inefficient in theory, so my question is whether there is a better way of accomplishing it.
I have 2 Tables – lets call them ‘Products’ and ‘Products_Temp’, both are identical. I need to download a large number of files (XML or XLS) which contain product details (stock, pricing etc) from suppliers. These are then parsed into the Products_Temp table. Right now, I plan to use CF Scheduled Tasks to handle the downloading, and Navicat to do the actual parsing – I’m happy enough this is adequate and efficient enough.
The next step is where I’m struggling – once the file has been downloaded and parsed, I need to look for any changes in the data. This will be compared against the Products table. If a change is found, then that row should be added or updated (if it should be removed, then I’ll need to flag it rather than just delete it). Once all the data has been compared, the products_temp table should be emptied.
I’m aware of methods to compare tables and sync them accordingly, however the issue I have is the fact I’ll be handling multiple files from different sources. I had considered using only the products table and append/update, but I’m unsure how I could manage the ‘flag deleted’ requirement.
Right now, the only way I know I can make it work is to loop through the products_temp table, do various cfquerys and delete the row once complete. However, that seems incredibly inefficient, and given the fact we’re likely to be dealing with hundreds of thousands of rows, unlikely to be effective if we update everything daily.
Any pointers or advice on a better route would be appreciated!
Both responses have possibilities. Just to expand on your options a little ..
Option #1
IF mySQL supports some sort of hashing, on a per row basis, you could use a variation of comodoro’s suggestion to avoid hard deletes.
Identify Changed
To identify changes, do an inner join on the primary key and check the hash values. If they are different, the product was changed and should be updated:
Identify Deleted
Use a simple outer join to identify records that do not exist in the temp table, and flag them as “deleted”
Identify New
Finally, use a similar outer join to insert any “new” products.
Option #2
If per row hashing is not feasible, an alternate approach is a variation of Sharondio’s suggestion.
Add a “status” column to the temp table and flag all imported records as “new”, “changed” or “unchanged” through a series of joins. (The default should be “changed”).
Identify UN-Changed
First use an inner join, on all fields, to identify products that have NOT changed. (Note, if your table contains any nullable fields, remember to use something like
coalesceOtherwise, the results may be skewed becausenullvalues are not equal to anything.Identify New
Like before, use an outer join to identify “new” records.
By process of elimination, all other records in the temp table are “changed”. Once you have calculated the statuses, you can update the Products table: