I have multiple entries in a temporary table in Database, and I need to merge them to make permanent entries. Now the information is coming from multiple XML Feeds, and I have all sorts of information, but the closest that I have is the “title” or in my case, name of the product.
Unfortunately, I don’t have any other way (no same ID’s or anything like that) than to match them by their name.
So for example I have:
$primary = array('feedid' => 2, 'entry_name' => 'ACME Product Black Model #23');
$secondary = array('feedid' => 3, 'entry_name' => 'ACME Product Model #23');
The ACME Product May Vary from "ACME Product Model #23" to "Model 23", to “Black Model #23", etc.
Also, in the same feed I may have "ACME Product Model Black #22" and “CHOAM Product Black - Model 11".
The problem is that I can’t just use similar_text() or levenshtein(), because they would sometimes match wrong items, or sometimes just don’t match at all. Each feed has 100+ entries, and I can have up to about 10 feeds.
Edit:
To put in real terms, for example: “iPhone 4” and “iPhone 4 White” and “iPhone 4 Black” should all be merged ( I can handle the merging, need to match first ).
So the rules are – Match the phones in this case.
It could also be “Barby Doll White hair” and “Barby Doll Black Hair”, but not “Some other Doll with White Hair”. …
Any ideas appreciated 🙂
I think it is worth to go with the pregmatch that hakre suggests.
I would go like this:
(Optionally) In the old-temporary table would add one more field of tinyint called flag.
I would go with pregmatch and in a pregmatch success I would put a positive flag on the old table to indicate that this record was managed successfuly by pregmatch.
If pregmatch failed would I would go with text similarity as hakre suggests again and would put a flag that was managed with text similarity.
In the end I hope a big percentage of the records would have been managed by pregmatch and only few would hae a flag indicating “text similrity” management. This would make the problem smaller, I think. wouldn’t it?
If you later find a better solution you can use the flag to know what records were not managed by pregmatch.
Then as for retrieving the new data I would go with the whith text similarity, for example something like mysql like ‘%string%’.
As for pregmatch being slow you will only do this process once,so shouldn’t be a problem. In addition I would add a conditioned loop in order not to exceed max execution time.