I have a quite interesting task. But I don’t know how to call it in one word in order to search for related topics. Even this topic title might not reflect what I need. So, if somebody has better title – welcome.
I’ll try to explain my problem.
I have about 100,000 rows in MySQL db table. And I need to “compare” entries from the table.
“compare” doesn’t mean just equal. There is an algorithm for calculation comparison level. I have weight coefficient for each table column. Means that if entry#1’s column1 equals to entry#2’s column2 then I give, say, 5 point to this pair. And so on for each column.
The most straight forward way to do this – apply calculation rules for each couple of entries. Why am I afraid of this? 100,000 entries means about 5 billion “compare” operations. For sure, I can calculate this on demand and store the result somewhere in cache. But I believe that the most obvious way is not the most effective.
So, my first question is: Is there any other better way to achive my goal except of brute force?
My second question is related to tool which is better for calculations.
- Application language is PHP. Hence, I need to load into memory whole
table and iterate over the data. - Create stored procedure in MySQL.
- Using MongoDB’s aggregation framework or MapReduce.
The least of all I like the first way. The most of all – the last.
I’m looking for any suggestion or advice from people who have experience in such sort of cases.
Since, I don’t know how to ask google for help, any links will be appreciated.
UPDATE:
Calculation rules are a bit more complicated then I described…
Table has a set of related columns which are to be used at once as group(not one by one).
Let’s assume:
table has fields, say, tag_1, tag_2, .., tag_n.
row_1 and row_2 – entries in the table.
The rule(pseudo-code):
if(row_1.tag_1==row_2.tag_1)
{
// gives 10 points
}
elseif(row_1.tag_1 is in row_2.tags && row_1.tag_1!=row_2.tag_1)
{
// gives 5 points
}
....
// and so on
Basically, I need to check find intersection of two arrays. If it is not empty – points are given. If indexes of tags in two rows match the additional points are given.
I’m wondering, how this can be accomplished using Stored Procedures Language? Because it can be done pretty easy using any programming language.
If stored procedure can do this then it is my choice.
If you have a static table, then it doesn’t make a difference which you choose, so long as you store the results somewhere (presumably back in the database).
If your data is changing, then you need to compare each new row to all rows, which is essentially a full-table scan. This is probably best done in a database.
If the data fits into memory (and 500,000 rows should fit into memory), then (2) will probably be faster than (3) on equivalent hardware. “Equivalent hardware” is a very important consideration.
In most cases, I would opt for (2). It sounds like the query is something like:
If you are much more comfortable with map-reduce, then you might find it easier to code there. I know both languages and prefer SQL for something like this.