I’m looking for an algorithm to segment a sequence of positive numbers into n subsequences, such that the standard deviation of the sum of the numbers in each subset is minimized.
The ordering of the numbers in each subsequence needs to be the same as the ordering in the original sequence
For example:
Suppose I have a sequence {1,1,1,1,1,1,10,1} that i wanted to segment into 2 subsequences.
I believe the optimal solution would be {1,1,1,1,1,1}, {10,1} .
The sum of the 1st subsequence is 6, the sum of the 2nd subsequence is 11
The standard deviation of the two numbers is ~3.5, which i believe is the lowest possible.
Suppose I have a sequence {4,1,1,1,1,6} that i wanted to segment into 3 subsequences.
I believe the optimal solution would be {4}, {1,1,1,1}, {6}
The sum of the subsequences is 4, 4, and 6.
The standard deviation of the 3 numbers is ~1.15, which i believe is the lowest possible.
The best algorithm i was able to come up with was to find the cumulative sum of each of the numbers in the sequence, and segment the sequence at each interval of [totalSum/numSubsequences].
For example, given the sequence {4,1,1,1,1,6} , the cumulative sums of the numbers of each sequence is {4,5,6,7,8,14}. The total of all numbers in the sequence is 14, so, given that i want 3 subsequences, i should segment the sequence when the total reaches 14/3 = 4.66 and 2 * 14/3 = 9.333333.
However, there is no actual place in the sequence where the cumulative total is equal to 4.66 – the first cumulative total is 4, and next cumulative total is 5. So should i round up or should i round down? In this case, rounding down to 4 gives the optimal solution, but that isn’t always the case. The best I can think of is to try every combination of rounding up and down, but that results in O(2^numSubsequences) complexity.
This seems to be the type of thing that would have a preexisting algorithm to apply, however my Googling has failed me. I am aware of the Partition Problem, which is NP-complete, but that deals with unordered sets, and not ordered sequences.
Any help would be appreciated.
Suppose the length of the original sequence is
Land the number of subsequences isN.You may simplify the expression for standard deviation to get
sqrt(E[X^2] - E[X]^2), whereEdenotes expectation/average andXdenotes your random variable — in your case, the sum of the subsequences. (A similar formula applies for the “sample standard deviation”.) Note thatE[X]does not depend on how you split your sequence, because it will always be the total sum divided byN. Thus, we just want to minimizeE[X^2]or equivalently, the sum ofX^2(they differ by a factor ofNby the definition of average).At this point, we can see that this problem can be solved with dynamic programming. Let
f(i,j), forifrom0toMandjfrom1toN, be the minimal sum of squares of sums of subsequences from the split of the firstielements of your sequence intojsubsequences. Then we see thatf(i,j)may be computed in terms of all thef(i',j')withi' <= iandj < j'. More specifically, if your sequence isa[k]indexed from0toM-1:Having minimized
f(N,L), you can use standard dynamic programming techniques to recover the splits. In particular, you can store thelthat minimizesf(i,j).The runtime of this solution is
O(L^2 N)because you computeO(L N)different values offand theminimumis overO(L)different values ofl.Here’s a straightforward implementation in Perl: