As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the features and values I need and transform the data from one format to the other. This usually results in a very large number of very large files and a very large number of small programs which process them in order to get the input for some machine learning framework (like the arff files for Weka).
One needs to be extremely well organised to deal with that and program with great care not to miss any important peculiarities, exceptions or errors in the tons of data. Many principles of good software design like design patterns or refactoring paradigms are no big use for these tasks because things like security, maintainability or sustainability are of no real importance – once the program successfully processed the data one doesn’t need it any longer. This has gone so far that I even stopped bothering about using classes or functions at all in my Python code and program in a simple procedural way. The next experiment will require different data sets with unique characteristics and in a different format so that their preparation will likely have to be programmed from scratch anyway. My experience so far is that it’s not unusual to spend 80-90% of a project’s time on the task of preparing training data. Hours and days go by only on thinking about how to get from one data format to another. At times, this can become quite frustrating.
Well, you probably guessed that I’m exaggerating a bit, on purpose even, but I’m positive you understand what I’m trying to say. My question, actually, is this:
Are there any general frameworks, architectures, best practices for approaching these tasks? How much of the code I write can I expect to be reusable given optimal design?
I find myself mostly using the textutils from GNU coreutils and flex for corpus preparation, chaining things together in simple scripts, at least when the preparations i need to make are simple enough for regular expressions and trivial filtering etc.
It is still possible to make things reusable, the general rules also apply here. If you are programming with no regard to best practices and the like and just program procedurally there is IMHO really no wonder that you have to do everything from scratch when starting a new project.
Even though the format requirements will vary a lot there is still many common tasks, ie. tag-stripping, tag-translation, selection, tabulation, some trivial data harvesting such as number of tokens, sentences and the like. Programming these tasks aming for high reusability will pay off, even though it takes longer at first.