I have a function calc_dG that, for any array corresponding to a short DNA sequence (3 to 15 bases or so), gives me the binding energy of that sequence. Actually, it’s just an array lookup. nndG is an array of binding energies for adjacent pairs of bases, and thus the binding energies can be calculated with nndG[4*S[:-1]+S[1:]] when using an a,g,c,t -> 0,1,2,3 way of denoting sequences numerically: this means that arrays of many sequences can be calculated at once very quickly in numpy.
I need to find, for a length L, every sequence that both fits some template and results in a binding energy value in a certain range.
This is very easy to do with iterators: just iterate through every possible array input, calculate the binding energy, and then record the arrays that are in the range. This, however, is far too slow when implemented in Python (for length 15 with 4 possible values for each element there are 4**15 possible arrays, etc etc). I could use Weave or some other method of implementing it in C, but I’d prefer to find an array-based solution that is simple and fast.
For example, if every element has the same possible values (eg, [0,1,2,3]), then generating an array of every possible length L 1D array with those values can be done with lambda x: indices(repeat([4],L)).reshape((L,-1)).transpose(); then I can just do calc_dG( result ), and use result[results that are in the desired range] to get the arrays that I want as a final result. This is much faster than using Python iterators, and likely almost as fast, if not faster, than using C iterators. Unfortunately, it doesn’t work for arbitrary templates, and for longer sequences, will run out of memory, as it has to store every possible array in memory before calculating values.
Is there some way to do all of this without resorting to C?
If I understand your problem correctly, you are maximizing a function
f(i_1, i_2, ..., i_n)over integers in the set {0, 1, 2, 3}.You can use a combination of iteration and vectorized indexing.