I am working on a code which takes a dataset and runs some algorithms on it.
User uploads a dataset, and then selects which algorithms will be run on this dataset and creates a workflow like this:
workflow =
{0: {'dataset': 'some dataset'},
1: {'algorithm1': "parameters"},
2: {'algorithm2': "parameters"},
3: {'algorithm3': "parameters"}
}
Which means I’ll take workflow[0] as my dataset, and I will run algorithm1 on it. Then, I will take its results and I will run algorithm2 on this results as my new dataset. And I will take the new results and run algorithm3 on it. It goes like this until the last item and there is no length limit for this workflow.
I am writing this in Python. Can you suggest some strategies about processing this workflow?
You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:
This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):
And that your algorithms look like:
algo_by_name takes a name and gives you an algo function; for example:
(old edit: if you want a framework for writing pipelines, you could use Ruffus. It’s like a make tool, but with progress support and pretty flow charts.)