I have an app (for a given twitter user) that gets a list of twitter users that you follow but don’t follow you back. It does this:
- compare two lists, one from time x and time y, too see if more people followed you back or less.
- See how long it took for twitter user x to follow you back.
- See how many retweets/comments it took for user x to follow you back
The easy way I came up with is just a have a has-many belongs to relationship w/ a user and people not following you back, e.g.:
User table
-id
TwitterUser table
-user_id
-timestamp
-isFollowing
So w/ that SQL schema I can get all the non-following back users for a given user and they can be compared by timestamp to match requirements above.
However, I was hoping that there was a better DB backend to represent this dataset than an sql database. I’ve been experimenting w/ redis but not sure how to pull it off.
I’m thinking maybe a document store – b/c all I want to do is take a diff of two data sets. Or more precisely: I want to diff two lists of twitter user ids.
Any ideas?
Bruteforce approach of comparing two arrays will have a time complexity of O(N*M), where N and M are sizes of arrays. So, we should instead store them using some intelligent data structure to do this efficiently.
I’ve come up with the following approaches:
List of twitter ids’ is a set because ids are unique. Redis supports
sets and allows performing set operations like difference. Suppose
you have 2 sets with the keys
ids_at_time_xandids_at_time_y.Add elements to them using
SADDlike this:
When you’re ready to perform a diff execute
This will return a list of ids from
ids_at_time_xthat are NOTpresent in
ids_at_time_y. If you want to do reverse operation,i.e. retrieve a list of ids that are not present in
ids_at_time_x,just swap arguments:
The best thing about SDIFF is that it operates very efficiently –
time complexity is O(N) where N is the total number of elements in
these 2 sets. Even if you do 2 diff operations, time complexity will
still be linear.
Store them as a sorted list. Redis supports sorted sets. When adding
id you have to include a score of element (Redis will do sorting based on score) which equals to id in your
case:
When lists are ready, we retrieve both of them and compare them in
code. Here is pseudocode:
Explanation: We use the fact that A and B are sorted. We have two indexes, both starting at zero. Compare the
two first elements of A and B. If A[0] is less than B[0], we know
that A[0] is present only in A so we add it to the list setA and
increase index of A by one. If B[0] is less than A[0], we add B[0]
to the list setB and increase index of B by one. If A[0] == B[0] we
add A[0] to the list of intersections and increment both indexes.
This code also works in linear time O(N) where N is total number of
elements in both A and B.
Note that this approach will work with any database which can return sorted list, meaning you can store it in a traditional SQL database and retrieve lists using
ORDER BY twitter_id).Have a look at all Data types supported by Redis and full list of their commands, they are nicely documented. Redis also have official clients available in many languages so this shouldn’t be a problem.
You can still store important data in an SQL database and let Redis handle lists of ids.