I’m extracting a large CSV file (200Mb) that was generated using R with Python (I’m the one using python).
I do some tinkling with the file (normalization, scaling, removing junk columns, etc) and then save it again using numpy’s savetxt with data delimiter as ‘,’ to kee the csv property.
Thing is, the new file is almost twice as large than the original (almost 400Mb). The original data as well as the new one are only arrays of floats.
If it helps, it looks as if the new file has really small values, that need exponential values, which the original did not have.
Any idea on why is this happening?
Have you looked at the way floats are represented in text before and after? You might have a line “1.,2.,3.” become “1.000000e+0, 2.000000e+0,3.000000e+0” or something like that, the two are both valid and both represent the same numbers.
More likely, however, is that if the original file contained floats as values with relatively few significant digits (for example “1.1, 2.2, 3.3”), after you do normalization and scaling, you “create” more digits which are needed to represent the results of your math but do not correspond to real increase in precision (for example, normalizing the sum of values to 1.0 in the last example gives “0.1666666, 0.3333333, 0.5”).
I guess the short answer is that there is no guarantee (and no requirement) for floats represented as text to occupy any particular amount of storage space, or less than the maximum possible per float; it can vary a lot even if the data remains the same, and will certainly vary if the data changes.