Does anyone have a good framework to recommend that handles the stages of processing of
file oriented transactions. Our scenario is simple – we receive file, validate it, if something wrong we abort and generate report. If file is good, goes to next stage. At some stages some valid data can be extracted and further processed and some that is invalid would be held and errors reported.
One can argue ESB can be used to do this, but I am really looking for something that is just a bit more automated than having a shell scripts and cron jobs.
Does anyone have a good, open source framework to recommend for these file watching, movement and job triggering tasks?
Very Small Scale
incron (inotify based cron) and every job as a single script.
Very simple, allows you to drop your files into designated directories and a job will be submitted automatically to the script.
This does however require you to implement logging and file shuffling yourself. It also requires you to implement a (simplistic) framework for identifying the jobs which is required when logging/submitting back a success or error.
Small/Medium Scale
Celery and shared storage*.
The initial investment for setting up celery is worth it, you get error reporting and a solid processing framework.
Small/Medium Scale
Delayed Job and shared storage*.
Like celery but ruby specific. Has a neat gui.
Large Scale (just for kicks)
Luigi and Hadoop
Luigi for processing, Hadoop for storing and providing your job data.
shared storage*: NFS would be the absolute simplest way of sharing files across nodes, you submit a file to your storage solution and a reference to that file in a job submitted to the associated framework.
Full disclosure: I work for Spotify.