I have a computational intensive project that is highly parallelizable: basically, I have a function that I need to run on each observation in a large table (Postgresql). The function itself is a stored python procedure.
Amazon EC2 seems like an excellent fit for the project.
My question is this: Should I make a custom image (AMI) that already contains the database? This would seem to have the advantage of minimizing data transfers and making parallelization simple: each image could get some assigned block of indices to compute, e.g., image 1 gets 1:100, image 2 101:200 etc. Splitting up the data and the instances (which most how-to guides suggest) doesn’t seem to make sense for my application, but I’m very new to this so I’m not confident my intuition is right.
you will definitely want to keep the data and the server instance separate in order for changes in your data to be persisted when you are done with the instance. your best bet will be to start with a basic image that has the OS & database platform you want to use, customize it to suit your needs, and then mount one or more EBS volumes containing your data. You may also want to create your own server instance once you are finished with your customization, unless what you are doing is fairly straightforward.
some helpful links:
http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-10-01/creating-an-image.html
http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=100&externalID=1663
(you said postgres but this mysql tutorial covers the same basic concepts you’ll want to keep in mind)