I have a table for logging access data of a lab. The table struct like this:
create table accesslog
(
userid int not null,
direction int not null,
accesstime datetime not null
);
This lab have only one gate that is under access control. So the users must first “enter” the lab before they can “leave”. In my original design, I set the “direction” field as a flag that is either 1 (for entering the lab) or -1 (for leaving the lab). So that I can use queries like:
SELECT SUM(direction) FROM accesslog;
to get the total user count within the lab. Theoretically, it worked; since the “direction” will always be in the patterns of 1 => -1 => 1 => -1 for any given userid.
But soon I found that the log message would lost in the transmission path from lab gate to server, being dropped either by busy network or by hardware glitches. Of course I can enforce the transmission path with sequence number, ACK, retransmission, hardware redundancy, etc., but in the end I might still get something like this:
userid direction accesstime
-------------------------------------
1 1 2013/01/03 08:30
1 -1 2013/01/03 09:20
1 1 2013/01/03 10:10
1 -1 2013/01/03 10:50
1 -1 2013/01/03 13:40
1 1 2013/01/03 18:00
It’s a recent log for user “1”. It’s clear that I’ve lost one log message for that user entering the lab between 10:50 to 13:40. While I query this data, he is still in the lab, so there is no exiting logs after 2013/01/03 18:00 yet; that’s affirmative.
My question is: is there any way to “find” this data inconsistence with SQL command ? There are total 5000 users within my system and the lab is operating 24 hour, there is no such “magic time” that the lab would be cleared. I’d be horrible if I’ve to write codes checking the continuity of “direction” field line-by-line, user-by-user.
I know it’s not possible to “fix” the log with correct data. I just want to know “Oh, I have a data inconsistency issue for userid=1” so that I can add an marked amending data to the correct the final statistic.
Any advice would be appreciated, even changing the table structure would be OK.
Thanks.
Edit: Sorry I didn’t mentioned the details.
Currently I’m using mixed SQL solution. The table showed above is MySQL, and it contains only logs within 24 hrs as the “real time” status for fast browsing.
Everyday at 03:00 AM a pre-scheduled process written in C++ on POSIX will be launched. This process will calculated the statistic data, and add the daily statistic to an Oracle DB, via a proprietary-protocol TCP socket, then it will remove the old data from MySQL.
The Oracle part is not handled by me and I can do nothing about it. I just want to make sure that the final statistics of each day is correct.
The data size is about 200,000 records per day — I know it’s sound crazy but it’s true.
Ok I figured it out. Thanks for the idea provided by a_horse_with_no_name.
My final solution is this query:
First I created a pattern with @inout that will yield 1 => -1 => 1 => -1 for each row in the “rule” column. Than I compare the direction field with rule column by calculating multiplication product.
It’s OK even if there are odd records for certain users; since each user is supposed to follow identical or reversed pattern as “rule”. So the total sum of multiplication product should be equal to either COUNT() or -1 * COUNT().
By checking SUM() and COUNT(), I can know exactly which userid had go wrong.