CREATE TABLE hostname_table
(
id INT NOT NULL AUTO_INCREMENT,
hostname CHAR(65) NOT NULL,
interval_avg INT,
last_update DATETIME NOT NULL,
numb_updates INT,
PRIMARY KEY (id)
)
I have this table, and I import 500-600k rows of data into it. I do not check for duplicates when writing to the database, because I want to know how many duplicates of each host there is, and I also want to know the intervals between each update of said hostname.
Example values in hostname_table:
id hostname interval_avg last_update numb_updates
1 www.host.com 60 2012-04-25 20:22:21 1
2 www.hostname.com 10 2012-04-25 20:22:21 5
3 www.name.com NULL 2012-04-25 20:22:21 NULL
4 www.host.com NULL 2012-04-25 20:22:26 NULL
5 www.host.com NULL 2012-04-25 20:22:36 NULL
Example of what I want it to look like when I have cleaned it up:
id hostname interval_avg last_update numb_updates
1 www.host.com 25 2012-04-25 20:22:36 3
2 www.hostname.com 10 2012-04-25 20:22:21 5
3 www.name.com NULL 2012-04-25 20:22:21 NULL
With a huge database like this, I dont want to send too many queries to obtain this goal, but I believe 3 queries are the minimum for an operation like this(if I am wrong, please correct me). Each hour there will be ~500k new rows where ~50% or more will be duplicates, therefore its vital to get rid of those duplicates as efficiently as possible while still keeping a record of how many and how often the duplicates occured(hense the interval_avg and numb_update update).
This is a three step problem, and I was hoping the community here would give a helping hand.
So to summarize in pseudocode, I need help optimizing these queries;
- select all last_update and interval_avg values, get sum(numb_update), get count(duplicates) foreach hostname,
- update interval_avg in min(id), update numb_updates in min(id), update last_update in min(id) with the value from max(id),
- delete all duplicates except min(id)
SOLVED.
I have optimized one part by 94%, and another part by ~97% over the course of a couple of days researching. I truely hope this will help other searching for the same solutions. mySQL and large databases can be a big problem if you choose the wrong solution.
(I changed the last_update column from DATETIME to INT(10), and I changed from a formated time to a timestamp as value in my final solution to be able to get the max(last_update) and min(last_update) values)
(Thanks to GolezTrol for helping with parts of the problem)
You cannot get each different value of interval_avg and numb_updates for a hostname if you want to aggregate by that hostname. Did you mean to
SUMor maybeAVGthem? Or do you just want to keep the value of the lowest id?In the query below I sum them.
After this, you will need to update each found id with the right values for
interval_avgandnumb_updates.After that, you will need to delete each id that is not found by this query.