Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9052083
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T13:04:47+00:00 2026-06-16T13:04:47+00:00

In order to get a Cassandra insert going faster I’m using multithreading, its working

  • 0

In order to get a Cassandra insert going faster I’m using multithreading, its working ok, but if I add more threads it doesnt make any difference, I think I’m not generating more connections, I think maybe I should be using pool.execute(f, *args, **kwargs) but I dont know how to use it, the documentation is quite scanty. Heres my code so far..

import connect_to_ks_bp
from connect_to_ks_bp import ks_refs
import time
import pycassa
from datetime import datetime 
import json
import threadpool
pool = threadpool.ThreadPool(20)
count = 1
bench = open("benchCassp20_100000.txt", "w")

def process_tasks(lines):

    #let threadpool format your requests into a list
    requests = threadpool.makeRequests(insert_into_cfs, lines)

    #insert the requests into the threadpool
    for req in requests:
        pool.putRequest(req) 

    pool.wait()

def read(file):
    """read data from json and insert into keyspace"""
    json_data=open(file)
    lines = []
    for line in json_data:
        lines.append(line)
    print len(lines)
    process_tasks(lines)


def insert_into_cfs(line):
    global count
    count +=1
    if count > 5000:
            bench.write(str(datetime.now())+"\n")
            count = 1
    #print count
    #print kspool.checkedout()
    """
    user_tweet_cf = pycassa.ColumnFamily(kspool, 'UserTweet')
    user_name_cf = pycassa.ColumnFamily(kspool, 'UserName')
    tweet_cf = pycassa.ColumnFamily(kspool, 'Tweet')
    user_follower_cf = pycassa.ColumnFamily(kspool, 'UserFollower')
    """
    tweet_data = json.loads(line)
    """Format the tweet time as an epoch seconds int value"""
    tweet_time = time.strptime(tweet_data['created_at'],"%a, %d %b %Y %H:%M:%S +0000")
    tweet_time  = int(time.mktime(tweet_time))

    new_user_tweet(tweet_data['from_user_id'],tweet_time,tweet_data['id'])
    new_user_name(tweet_data['from_user_id'],tweet_data['from_user_name'])
    new_tweet(tweet_data['id'],tweet_data['text'],tweet_data['to_user_id'])

    if tweet_data['to_user_id'] != 0:
        new_user_follower(tweet_data['from_user_id'],tweet_data['to_user_id'])


""""4 functions below carry out the inserts into specific column families"""        
def new_user_tweet(from_user_id,tweet_time,id):
    ks_refs.user_tweet_cf.insert(from_user_id,{(tweet_time): id})

def new_user_name(from_user_id,user_name):
    ks_refs.user_name_cf.insert(from_user_id,{'username': user_name})

def new_tweet(id,text,to_user_id):
    ks_refs.tweet_cf.insert(id,{
    'text': text
    ,'to_user_id': to_user_id
    })  

def new_user_follower(from_user_id,to_user_id):
    ks_refs.user_follower_cf.insert(from_user_id,{to_user_id: 0})   

    read('tweets.json')
if __name__ == '__main__':

This is just another file..

import pycassa
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily

"""This is a static class I set up to hold the global database connection stuff,
I only want to connect once and then the various insert functions will use these fields a lot"""
class ks_refs():
    pool = ConnectionPool('TweetsKS',use_threadlocal = True,max_overflow = -1)

    @classmethod
    def cf_connect(cls, column_family):
        cf = pycassa.ColumnFamily(cls.pool, column_family)
        return cf

ks_refs.user_name_cfo = ks_refs.cf_connect('UserName')
ks_refs.user_tweet_cfo = ks_refs.cf_connect('UserTweet')
ks_refs.tweet_cfo = ks_refs.cf_connect('Tweet')
ks_refs.user_follower_cfo = ks_refs.cf_connect('UserFollower')

#trying out a batch mutator whihc is supposed to increase performance
ks_refs.user_name_cf = ks_refs.user_name_cfo.batch(queue_size=10000)
ks_refs.user_tweet_cf = ks_refs.user_tweet_cfo.batch(queue_size=10000)
ks_refs.tweet_cf = ks_refs.tweet_cfo.batch(queue_size=10000)
ks_refs.user_follower_cf = ks_refs.user_follower_cfo.batch(queue_size=10000)
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T13:04:48+00:00Added an answer on June 16, 2026 at 1:04 pm

    A few thoughts:

    • Batch sizes of 10,000 are way too large. Try 100.
    • Make your ConnectionPool size at least as large as the number of threads using the pool_size parameter. The default is 5. Pool overflow should only be used when the number of active threads may vary over time, not when you have a fixed number of threads. The reason is that it will result in a lot of unnecessary opening and closing of new connections, which is a fairly expensive process.

    After you’ve resolved those issues, look into these:

    • I’m not familiar with the threadpool library that you’re using. Make sure that if you take the insertions to Cassandra out of the picture that you see an increase in the performance when you increase the number of threads
    • Python itself has a limit to how many threads may be useful due to the GIL. It shouldn’t normally max out at 20, but it might if you’re doing something CPU intensive or something that requires a lot of Python interpretation. The test that I described in my previous point will cover this as well. It may be the case that you should consider using the multiprocessing module, but you would need some code changes to handle that (namely, not sharing ConnectionPools, CFs, or hardly anything else between processes).
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm attempting to get order data from a Yahoo store. I'm using the docs
I am using saxparser in order to get xml tag contents from file. All
I'm using System.Xml in order to get data from xml data: string test =
I am using a PHP while loop in order to get data from a
I'm working on a project to automatically process scanned invoices. In order get a
In order to get my setup a bit closer to one click deployment, I
In order to get better SEO and cleaner URLs, I tend to export certain
In order to get the next element in the list one just needs to
What exactly needs to happen in order to get detailed SQL logs out of
How to fix this regex in order to get the mentioned outcome ? regex:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.