I’m a python developer with pretty good RDBMS experience. I need to process a

Question

0

Asked: June 16, 20262026-06-16T07:23:54+00:00 2026-06-16T07:23:54+00:00

I’m a python developer with pretty good RDBMS experience. I need to process a

0

I’m a python developer with pretty good RDBMS experience. I need to process a fairly large amount of data (approx 500GB). The data is sitting in approximately 1200 csv files in s3 buckets. I have written a script in Python and can run it on a server. However, it is way too slow. Based on the current speed and the amount of data it will take approximately 50 days to get through all of the files (and of course, the deadline is WELL before that).

Note: the processing is sort of your basic ETL type of stuff – nothing terrible fancy. I could easily just pump it into a temp schema in PostgreSQL, and then run scripts onto of it. But, again, from my initial testing, this would be way to slow.

Note: A brand new PostgreSQL 9.1 database will be it’s final destination.

So, I was thinking about trying to spin up a bunch of EC2 instances to try and run them in batches (in parallel). But, I have never done something like this before so I’ve been looking around for ideas, etc.

Again, I’m a python developer, so it seems like Fabric + boto might be promising. I have used boto from time to time, but never any experience with Fabric.

I know from reading/research this is probably a great job for Hadoop, but I don’t know it and can’t afford to hire it done, and the time line doesn’t allow for a learning curve or hiring someone. I should also not, that it’s kind of a one time deal. So, I don’t need to build a really elegant solution. I just need for it to work and be able to get through all of the data by the end of the year.

Also, I know this is not a simple stackoverflow-kind of question (something like “how can I reverse a list in python”). But, what I’m hoping for is someone to read this and “say, I do something similar and use XYZ… it’s great!”

I guess what I’m asking is does anybody know of any thing out there that I could use to accomplish this task (given that I’m a Python developer and I don’t know Hadoop or Java – and have a tight timeline that prevents me learning a new technology like Hadoop or learning a new language)

Thanks for reading. I look forward to any suggestions.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T07:23:55+00:00

I often use a combination of SQS/S3/EC2 for this type of batch work. Queue up messages in SQS for all of the work that needs to be performed (chunked into some reasonably small chunks). Spin up N EC2 instances that are configured to start reading messages from SQS, performing the work and putting results into S3, and then, and only then, delete the message from SQS.

You can scale this to crazy levels and it has always worked really well for me. In your case, I don’t know if you would store results in S3 or go right to PostgreSQL.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m a python developer with pretty good RDBMS experience. I need to process a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply