i have a task to compress a stock market data somehow…the data is in a file where the stock value for each day is given in one line and so on…so it’s a really big file.
Eg,
123.45
234.75
345.678
889.56
…..
now the question is how to compress the data (aka reduce the redundancy) using standard algorithms like Huffman or Arithmetic coding or LZ coding…which coding is most preferable for this sort of data??…
I have noticed that if i take the first data and then consider the difference between each consecutive data, there is lot of repetition in the difference values…this makes me wonder if first taking these differences, finding their frequency and hence probalility and then using huffman coding would be a way??…
Am i right?…can anyone give me some suggestions.
I think your problem is more complex than merely subtracting the stock prices. You also need to store the date (unless you have a consistent time span that can be inferred from the file name).
The amount of data is not very large, though. Even if you have data every second for every day for every year for the last 30 years for 300 stockd, you could still manage to store all that in a higher end home computer (say, a MAC Pro), as that amounts to 5Tb UNCOMPRESSED.
I wrote a quick and dirty script which will chase the IBM stock in Yahoo for every day, and store it “normally” (only the adjusted close) and using the “difference method” you mention, then compressing them using gzip. You do obtain savings: 16K vs 10K. The problem is that I did not store the date, and I don’t know what value correspond to what date, you would have to include this, of course.
Good luck.
Now compare the “raw data” (raw.dat) versus the “compressed format” you propose (comp.dat)