Here is my problem scenario:
I have a few thousand objects. Each object has 256 Boolean dimensions (true or false). I want to find clusters such that
- Each cluster has a minimum amount of true dimensions (a dimension of a cluster is true iff any object in that cluster has this dimension market as true).
- The overall sum of all true dimensions over all clusters is minimal.
- Each cluster is not bigger than a certain predefined value.
The optimality of the solution is not required, however the algorithm should be fast.
How should I best approach this problem? Is there an algorithm that you would recommend?
Note: I already implemented a brute force approach to this problem, but it is quite slow.
You can write this as a mixed-integer linear program (MILP):
You have a fixed amount of clusters and objects.
is equal to 1 if dimension i is true in object k.
Each cluster can have at most 256 true dimensions.
Parameter
You have the following variables:
You have the following constraints:
The second constraint is a tricky one because it doesn’t feel linear, but actually you can write it linearly.
The constraints can be written as:
The objective function can be the sum of all
, so you minimize the overall sum of all true dimensions over all clusters.
Let me explain the second constraint: on the right-hand-side, you compute the number of elements inside cluster i, minus the number of objects having dimension j set to one. This is equal to zero if all objects have dimension j, or something positive if not.
If this evaluates to zero, then
must be equal to one to avoid violating the constraint. If not,
can be anything (zero or one). This works because
will appear in the objective function, which means that when the program has the choice between zero or one, it will choose zero.
Once you write this up, you can solve it using a commercial solver (if you have one, they give free licenses to students, in case you are one) or Coin-OR just to name one.
Just as a reminder: solving MILPs is an NP-complete problem.