I have a large database containing words and their inflected forms, e.g.:
BASIC_FORM ##### INFLECED_FORM
talk ----- talk
talk ----- talking
talk ----- talked
talk ----- talks
paragraph ----- paragraph
paragraph ----- paragraphs
...
This database requires a lot of disk space, of course, as soon as it has 1 million entries or more.
What is the best method to “compress” that set of data, i.e. reduce the required amount of disk space while no information is lost?
My first idea was to create an extra column which holds the number of characters that can be copied from the beginning of the basic form. Then you just have to save the part of the inflected form that differs, e.g.:
BASIC_FORM ##### NUM_EQUAL ##### INFLECED_FORM
talk ----- 4 -----
talk ----- 4 ----- ing
talk ----- 4 ----- ed
talk ----- 4 ----- s
try ----- 3 -----
try ----- 2 ----- ied
paragraph ----- 9 -----
paragraph ----- 9 ----- s
...
This should save some amount of disk space as “NUM_EQUAL” can be saved as TINYINT in MySQL (for example) so it requires only 1 byte and in the string “INFLECTED_FORM” you usually save more than 1 character (i.e. more than 1 byte).
Do you have other suggestions to save disk space?
You should normalize the model. That means, create a separate table for the basic_form. I’m not sure how much space you will save because that way because that will depend on the data (the longer the words you have and the more inflections you have, the more space you’ll save). However, let’s say you only have one word and one inflected word for each (I know that’s not the case, but let’s take it to that extreme), then having two tables would increase the storage needed.
Now, after aplying the previous refactor (that will also save you some headaches, as normalization always do!) you can also apply YOUR system for reducing the size it takes to store the inlections too.