What are some good algorithms for extracting hierarchical structure from sequences?
My primary concern is compressing the sequence, and the sequence has some hierarchical structure to it. I’m not too worried about runtime of the algorithm, though the length of the sequence is up to 256k symbols, and it shouldn’t run longer than a few seconds.
So far I’m aware of the sequitur algorithm, and I’d like to know of any other algorithms/ideas that could be similarly useful.
EDIT: The decompression needs to be very simple.
EDIT2: I am compressing code. I have elaborated a rather large function into a huge basic block of code that runs faster than the original recursive function for some sizes, but then the code grows to be unwieldy and large as I vary a parameter. I have been experimenting with sequitur to compress the fully elaborated function, and it works well — it allows me to achieve some middle ground between the recursive function and the fully elaborated basic block. I’m now wondering if there are other algorithms I should try as well.
LZ77 and LZ78 and the Burrows-Wheeler Transform are a good way to start. The first two work well with streamed data and can have very fast implementations. The pure dictionary style of LZ78 is well suited to extracting hierarchical structures.
If you were less concerned about fast compression and just wanted the structure, the sequitur algorithm will be hard to beat — AFAICT, it is the best in class.