Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1097121
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T00:23:03+00:00 2026-05-17T00:23:03+00:00

I come from a computer science. background, but I am now doing genomics. My

  • 0

I come from a computer science. background, but I am now doing genomics.

My projects include a lot of bioinformatics typically involving: aligning sequences, comparing overlap, etc. between sequences and various genome-annotation-features, from different classes of biological samples, time-course data, microarray, high-throughput sequencing (“next-generation” sequencing, though it’s the current generation actually) data, this kind of stuff.

The workflow with this kind of analyses is quite different from what I experienced during my computer science studies: no UML and thoughtfully designed objects shining with sublime elegance, no version management, no proper documentation (often no documentation at all), no software engineering at all.

Instead, what everyone does in this field is hacking out one Perl-script or AWK-one-liner after the other, usually for one-time usage.

I think the reason is that the input data and formats change so fast, the questions need to be answered so soon (deadlines!), that there seems to be no time for project organization.

One example to illustrate this: Let’s say you want to write a raytracer. You would probably put a lot of effort into the software engineering first. Then program it, finally in some highly-optimized form. Because you would use the raytracer countless of times with different input data and would make changes to the source code over a duration of years to come. So good software engineering is paramount when coding a serious raytracer from scratch. But imagine you want to write a raytracer, where you already know that you will use it to raytrace one, single picture ever. And that picture is of a reflecting sphere over a checkered floor. In this case you would just hack it together somehow. Bioinformatics is like the latter case only.

You end up with whole directory trees with the same information in different formats until you have reached the one particular format necessary for the next step, and dozen of files with names like “tmp_SNP_cancer_34521_unique_IDs_not_Chimp.csv” where you don’t have the slightest idea one day later why you created this file and what it exactly is.

For a while I was using MySQL which helped, but now the speed in which new data is generated and changes formats is such that it is not possible to do proper database design.

I am aware of one single publication which deals with these issues (Noble, W. S. (2009, July). A quick guide to organizing computational biology projects. PLoS Comput Biol 5 (7), e1000424+). The author sums the goal up quite nicely:

The core guiding principle is simple:
Someone unfamiliar with your project
should be able to look at your
computer files and understand in
detail what you did and why.

Well, that’s what I want, too! But I am following the same practices as that author already, and I feel it is absolutely insufficient.

Documenting each and every command you issue in Bash, commenting it with why exactly you did it, etc., is just tedious and error-prone. The steps during the workflow are just too fine-grained. Even if you do it, it can be still an extremely tedious task to figure out what each file was for, and at which point a particular workflow was interrupted, and for what reason, and where you continued.

(I am not using the word “workflow” in the sense of Taverna; by workflow I just mean the steps, commands and programs you choose to execute to reach a particular goal).

How do you organize your bioinformatics projects?

  • 1 1 Answer
  • 2 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T00:23:04+00:00Added an answer on May 17, 2026 at 12:23 am

    I’m a software specialist embedded in a team of research scientists, though in the earth sciences, not the life sciences. A lot of what you write is familiar to me.

    One thing to bear in mind is that much of what you have learned in your studies is about engineering software for continued use. As you have observed a lot of what research scientists do is about one-off use and the engineered approach is not suitable. If you want to implement some aspects of good software engineering you are going to have to pick your battles carefully.

    Before you start fighting any battles, you are going to have to critically examine your own ideas to ensure that what you learned in school about general-purpose software engineering is valid for your current situation. Don’t assume that it is.

    In my case the first battle I picked was the implementation of source code control. It wasn’t hard to find examples of all the things that go wrong when you don’t have version control in place:

    • some users had dozens of directories each with different versions of the ‘same’ code, and only the haziest idea of what most of them did that was unique, or why they were there;
    • some users had lost useful modifications by overwriting them and not being able to remember what they had done;
    • it was easy to find situations where people were working on what should have been the same program but were in fact developing incompatibly in different directions;
    • etc etc etc

    Once I had gathered the information — and make sure you keep good notes about who said what and what it cost them — it became relatively easy to paint a picture of a better world with source code control.

    Next, well, next you have to choose your own next battle. But one of the seeds of doubt you have to sow in your scientist-colleagues minds is ‘reproducibility’. Scientific experiments are not valid if they are not reproducible; if their experiments involve software (and they always do) then careful software engineering is essential for reproducibility. A lot of this is about data provenance, but that’s a topic for another day.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I come from a Java background, where packages are used, not namespaces. I'm used
I come from a mainly PHP background and make good use of the Apache
I come from a CVS background. I'm currently investigating using SVN for a project.
I come from a background of MoM. I think I understand ESB conceptually. However,
I come from more of a Java background. In the last year or two,
I come from a Java background and with any servlets-based technology, it's trivial to
I come from the Microsoft world (and I come in peace). I want to
I come from a .NET world and I'm new to writting C++. I'm just
I come from the Java world, where you can hide variables and functions and
I come from classes object orientation languages and recently I have been learning those

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.