I have been trying to work out a few reports based off some log files (~50 million records and can grow ten times this going forward) – I have this loaded in a table and make the necessary changes (removing dups etc.) – The table is supposed to hold the number of requests per product per type and per day, so I am attempting to cut this down to just distinct products with a count column representing the number of requests
Here is the original table with the log data:
*************************** 1. row ***************************
Table: cdnlog2
Create Table: CREATE TABLE `cdnlog2` (
`serial` int(32) DEFAULT NULL,
`ip` varchar(100) DEFAULT NULL,
`country` varchar(100) DEFAULT NULL,
`productid` int(11) DEFAULT NULL,
`type` varchar(100) DEFAULT NULL,
`query_date` date DEFAULT NULL,
KEY `aaa` (`country`),
KEY `ccc` (`productid`),
KEY `type` (`type`),
KEY `date_index` (`query_date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Destination table:
*************************** 1. row ***************************
Table: cdnlogfinal
Create Table: CREATE TABLE `cdnlogfinal` (
`country` varchar(100) DEFAULT NULL,
`productid` int(11) DEFAULT NULL,
`type` varchar(100) DEFAULT NULL,
`request_count` int(11) DEFAULT NULL,
`query_date` date DEFAULT NULL,
KEY `aaa` (`country`),
KEY `ccc` (`productid`),
KEY `type` (`type`),
KEY `date_index` (`query_date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
I am attempting to now reduce the number of records to grouped values with just the distinct rows and their count (the log can contain dups since the same product can be selected multiple times on the same day), however, the insert into a secondary table has been running for several hours with the status “Copying to tmp table on disk” – I have changed the temp directory to allow for sufficient space – Any pointers?
Thanks in advance
Your idea is a good one, and the end result will speed up your reporting queries very much. You just need one more piece to solve the puzzle:
The problem is there are too many rows in the base table to create all the rows in the derived table in one query – the transaction takes so long, and the number of rows created is so large, it times out and/or log space for the transaction is exceeded.
Instead, you must do this one day at a time:
Run this query separately for every day in your data range, changing the start/end timestamp accordingly.
Once your historic data is calculated, run this once per day for the previous day as part of your batch processing.
What you are doing is creating a data warehouse. Consider strongly putting this data on a separate, dedicated server. There are many advantages to doing this – read up to find out what.