As part of trying to learn hadoop, I’m working on a project using a large number of tweets from the twitter streaming API. Of ~20M tweets, I’ve generated a list of the N most active users, who I want to try to cluster based on the text of all their tweets.
So I have a list of a few thousand user names, and what I want to do is concatenate the content of all the tweets from each user together, and eventually generate a word count vector for each user.
I can’t figure out how to accomplish the concatenation though. I want to be able to write some mapper that takes in each tweet line, and says “if this tweet comes from a user I’m interested in, map it with key username and value tweetText, otherwise ignore it.” Then it would be simple for the reducer to concatenate the tweets like I want to.
My problem is, how do I tell the mapper about this big list of users that I’m interested in? It seems like it would be nice if the mapper could have a Hashtable with all the users, but I have no idea if that’s possible.
Is there a good way to accomplish this, or is the problem just not a good fit for Map/Reduce?
Aw, nevermind. I’ve been thinking about this for a while, but once I wrote it out here, I realized how I think I should be doing it. Instead of making a list of all the users with X number of tweets, and then going through the data again and trying to find their tweets, I can do it all at once.
Currently I am mapping [username,1] and then having the reducer sum all of the 1’s together to generate tweet counts. Then I try to find the tweets of all users with more than X tweets.
To do it all at once, I should map [username,completeTweet] and then have the reducer concatenate and output data for only users who have more than X tweets, and just ignore the other users.