I have one large database table of request data, much like Apache request logs, of about 50 million rows:
request_url
user_agent
created
that contains data like this:
/profile/Billy
Mozilla.....
2012-06-17...
/profile/Jane
Mozilla.....
2012-06-17...
I then have my user database table, with all my user data including usernames.
At the moment, every night, I process the request data for the previous day, row by row and see if it contains an URL that matches one of the usernames in the users table. If it does, I increment a total in another table that stores stats that allows users to see how many pageviews they got for any particular day.
However as the datasets grow, this is becoming resource intensive and can also take a long time to complete, even when grouping the request data by URL and grabbing a count for that group.
Is there a better way of processing this information to get the end result I need? The request data is going to be logged anyway, so it would be preferable to to generate the stats after the fact rather than incrementing the total on every page view.
I’m running this on one server, so distributed processing of the data on multiple servers isn’t required.
Start with a fresh log-table every day. When the day is done, use it to increment the totals, then append it to that huge main log-table and delete it.