I’m making a feeds aggregator using php and mysql. And writting a paper about it which must contain math.
I have a table feeds (id, title, description, link) where id is the primary key.
When I collect new feeds I need to add them to the database, but I must not let any duplicates in. I see two ways to do that:
1) for each feed run something like this:
SELECT id FROM feeds
WHERE title=$feed.title AND description=$feed.description;
And see if it returns any feeds.
2) Assume that feeds which came from different sources never match. In this case:
for each source of feeds run something like this:
SELECT title, description, source FROM feeds WHERE source=$source;
Then use PHP to match collected feeds against this array.
I admit, I don’t have any performance problem. But I’m writing a paper about it and I must find some way to apply math to the problem. I’ve choosen the second approach because it allows me to go into math details about why it can be faster.
But I suspect that php might do the work much slower then mysql would and it might actually be faster to run a query for each feed.
Am I right? Is there any practical reason to choose the second approach? How can I justify my choise?
have you considered using a composite unique index instead?
this would prevent adding new rows when title and description taken together are already present in the table.
you would have to do large number of inserts in a large database to really get performance values though.
Edit:
This does have one downfall in MYSQL Null is always considered unique so you could have several rows input that are title=null and description=null. You should check for this before attempting insert of data.