I was thinking about compressing large blocks of text using most frequent english words, but now I doubt it would be efficient, since lzw seems to be achieving just this in a better way.
Still, I can’t shake the feeling compressing character one by one is a little “brutal”, since one could just analyze the structure of sentences to better organize it into smaller chunks of data, and the structure is not exactly the same when decompressed, it could use classic compression methods.
Does “basic” NLP allows that ?
NLP?
Standard compression techniques can be applied to words instead of characters. These techniques would assign probabilities to what the next word is, based on the preceding words. I have not seen this in practice though, since there are so many more words than characters, resulting in prohibitive memory usage and excessive execution time for even low-order models.