I have a conceptual question.
Suppose I have a procedure (any language) which takes a data set as input, process it and write output to an array. This array is used down the stream for further processing. The problem is that code has large run time. So large that it needs to be optimized!
What I am proposing is to partition the input data set into smaller chunks and call the procedure for each of the data set in parallel. Sounds simple!
Hence I thought to write the procedure in a separate file, create a separate executable. Submit this executable for smaller data sets for batch processing.
But the problem with this method is that since each of the batch job is a separate process, how to create the array that I was creating earlier from all of these jobs! I can think of writing each job output to files and then process them to create the array back.
Is there a better way to do things in parallel?
Thanks for your suggestions 🙂
I agree it looks like MapReduce.
You might like to look at Erlang, which supports very elegant ways of partitioning and distributing work across processes, processors, and machines.
Joe Armstrong’s Erlang book “Programming Erlang – Software for a Concurrent World” gives a simplistic MapReduce which can be used across processes.
I found these blogs which talk about Joe’s simple MapReduce:
http://bc.tech.coop/blog/070520.html
http://bc.tech.coop/blog/070601.html
which might explain the idea, and gives Erlang code.
Erlang is Open Source, so you could do a few experiments for a small investment in time.
Concurrency and communication are built into the language, and it all works ‘out of the box’ on a single machine. You do need to set up a ‘key’ so that Erlang Virtual Machines can commun icate, but once that’s done, A program can be run across a local area network.