I have a database which looks a bit like this:
# user1, user2, action, days since 01/03/2010, week number, age1, age2, gen1, gen2
['1181206', '3560076', '2', 0, 0, '46', '45', 'M', 'F']
['1291903', '3675534', '2', 0, 0, '32', '30', 'M', 'F']
['3723809', '3686568', '1', 7, 1, '29', '26', 'M', 'F']
['3440145', '3258134', '1', 14, 2, '42', '42', 'M', 'F']
['3720125', '3147358', '1', 15, 2, '50', '51', 'F', 'M']
['2568920', '3753709', '1', 23, 3, '46', '43', 'M', 'F']
['3759313', '3541126', '1', 30, 4, '43', '42', 'M', 'F']
['3372869', '3409372', '1', 37, 5, '44', '45', 'F', 'M']
['2580655', '3816967', '1', 47, 6, '54', '48', 'M', 'F']
['3784183', '1978056', '1', 51, 7, '61', '50', 'M', 'F']
['4462684', '4406304', '1', 59, 8, '52', '51', 'F', 'M']
['3649081', '4524487', '1', 72, 10, '49', '47', 'M', 'F']
['4627173', '4537773', '3', 95, 13, '30', '37', 'F', 'M']
['4697735', '3144685', '1', 106, 15, '28', '29', 'F', 'M']
['3643353', '4740556', '1', 125, 17, '24', '29', 'F', 'M']
...
There are around 5 million rows. Each row represents an activity. user1 does the action on user2.
I need to order it somehow to make it easy to work out the activity for each user,
and in the end what I want to know is:
- Time in days between a user’s first activity and last activity.
- Number of users which e.g. have 10-15 days between their first activity and last activity.
I’ve tried sorting it so that each user activity is grouped together but it would take my machine too long! (Around 3 days) Though a quick way of grouping each users activity would be nice.
I’m thinking of setting up a class called Users() which each user is an object in the class with attributes: age , gender and activity.
Then saying:
for each line in database:
if user doing the action is an object in the class:
invoke a method which adds this activity to their activity attribute
else:
invoke a method which creates a new user object and add this to their
activity.
I’m not entirely sure how to do this though, is there a method which can create new objects?
Then somehow looping through all objects in the class working out the number of days between their first and last activity.
I know there are quite a few parts to this so help with any of them is greatly appreciated.
A rather naïve approach : your ‘Users” structure could be a dictionnary ( http://docs.python.org/tutorial/datastructures.html#dictionaries ) of user activity, where the key could be a username, and the value in the dictionnary would be structure with the oldest and latest activity.
Then, you browse the list of activities, and each line is either :
a new user (you add an object to the dictionnary, and start the oldest and latest activity date with the activity date of the current line).
an existing user (you find the activity of this user in the dictionnary, and update the oldest / latest activity dates accordingly)
In the end you’ll have the activity span of all users.
Now the database will grow as big as the number of users in your database ; and the processing time will be take a time proportionnal to the amount of rows. This could pose the issues of :
But both depends on the database and the system you use.
Hoping this helps.