I come from a computer science. background, but I am now doing genomics. My

Question

0

Asked: May 17, 20262026-05-17T00:23:03+00:00 2026-05-17T00:23:03+00:00

I come from a computer science. background, but I am now doing genomics. My

0

I come from a computer science. background, but I am now doing genomics.

My projects include a lot of bioinformatics typically involving: aligning sequences, comparing overlap, etc. between sequences and various genome-annotation-features, from different classes of biological samples, time-course data, microarray, high-throughput sequencing (“next-generation” sequencing, though it’s the current generation actually) data, this kind of stuff.

The workflow with this kind of analyses is quite different from what I experienced during my computer science studies: no UML and thoughtfully designed objects shining with sublime elegance, no version management, no proper documentation (often no documentation at all), no software engineering at all.

Instead, what everyone does in this field is hacking out one Perl-script or AWK-one-liner after the other, usually for one-time usage.

I think the reason is that the input data and formats change so fast, the questions need to be answered so soon (deadlines!), that there seems to be no time for project organization.

One example to illustrate this: Let’s say you want to write a raytracer. You would probably put a lot of effort into the software engineering first. Then program it, finally in some highly-optimized form. Because you would use the raytracer countless of times with different input data and would make changes to the source code over a duration of years to come. So good software engineering is paramount when coding a serious raytracer from scratch. But imagine you want to write a raytracer, where you already know that you will use it to raytrace one, single picture ever. And that picture is of a reflecting sphere over a checkered floor. In this case you would just hack it together somehow. Bioinformatics is like the latter case only.

You end up with whole directory trees with the same information in different formats until you have reached the one particular format necessary for the next step, and dozen of files with names like “tmp_SNP_cancer_34521_unique_IDs_not_Chimp.csv” where you don’t have the slightest idea one day later why you created this file and what it exactly is.

For a while I was using MySQL which helped, but now the speed in which new data is generated and changes formats is such that it is not possible to do proper database design.

I am aware of one single publication which deals with these issues (Noble, W. S. (2009, July). A quick guide to organizing computational biology projects. PLoS Comput Biol 5 (7), e1000424+). The author sums the goal up quite nicely:

The core guiding principle is simple:
Someone unfamiliar with your project
should be able to look at your
computer files and understand in
detail what you did and why.

Well, that’s what I want, too! But I am following the same practices as that author already, and I feel it is absolutely insufficient.

Documenting each and every command you issue in Bash, commenting it with why exactly you did it, etc., is just tedious and error-prone. The steps during the workflow are just too fine-grained. Even if you do it, it can be still an extremely tedious task to figure out what each file was for, and at which point a particular workflow was interrupted, and for what reason, and where you continued.

(I am not using the word “workflow” in the sense of Taverna; by workflow I just mean the steps, commands and programs you choose to execute to reach a particular goal).

How do you organize your bioinformatics projects?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T00:23:04+00:00

I’m a software specialist embedded in a team of research scientists, though in the earth sciences, not the life sciences. A lot of what you write is familiar to me.

One thing to bear in mind is that much of what you have learned in your studies is about engineering software for continued use. As you have observed a lot of what research scientists do is about one-off use and the engineered approach is not suitable. If you want to implement some aspects of good software engineering you are going to have to pick your battles carefully.

Before you start fighting any battles, you are going to have to critically examine your own ideas to ensure that what you learned in school about general-purpose software engineering is valid for your current situation. Don’t assume that it is.

In my case the first battle I picked was the implementation of source code control. It wasn’t hard to find examples of all the things that go wrong when you don’t have version control in place:

some users had dozens of directories each with different versions of the ‘same’ code, and only the haziest idea of what most of them did that was unique, or why they were there;
some users had lost useful modifications by overwriting them and not being able to remember what they had done;
it was easy to find situations where people were working on what should have been the same program but were in fact developing incompatibly in different directions;
etc etc etc

Once I had gathered the information — and make sure you keep good notes about who said what and what it cost them — it became relatively easy to paint a picture of a better world with source code control.

Next, well, next you have to choose your own next battle. But one of the seeds of doubt you have to sow in your scientist-colleagues minds is ‘reproducibility’. Scientific experiments are not valid if they are not reproducible; if their experiments involve software (and they always do) then careful software engineering is essential for reproducibility. A lot of this is about data provenance, but that’s a topic for another day.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I come from a computer science. background, but I am now doing genomics. My

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply