I am aware of Twissandra which is an example twitter clone using Cassandra but I was interested to see if anyone has shared a Cassandra schema not to clone Twitter but to use for storing tweets coming through Twitter Streaming API?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
It very much depends what sort of queries you want to do with the data after you have ingested it – I see from your previous question “Dumping Twitter Streaming API tweets…” you probably just want to do big batch processing on it.
If this is the case, you just need to worry about load balancing, making sure each node in the cluster handles 1/n of the write load, and contains 1/n of the data – using the random partition and inserting one row per tweets with the status id as the row key will achieve this.
However, if you want to do queries like “give me all tweets for a given user” you will need a slightly more complicated schema, as the schema suggested above will require you to scan all the data. You could insert multiple tweets per row, the row key being the userid, the column key being the tweet id and the value being the tweet. Then you could use get_slice to answer that query.
A good (somewhat related) blog post: http://blog.insidesystems.net/basic-time-series-with-cassandra