I have performed some simple z-transforms on some variables in a pandas DataFrame. Of total 216 columns in the dataframe, I transformed 196 of them and then concatenated the 197 onto the original 216 for a total of 412 total columns.
Then I used the to_csv function to write the new dataframe to a CSV file. The original data is about 300MB, while the new dataset is 1.2GB. It seems odd that adding less than double of the columns leads to around 4x increase in size for the final file.
The code is:
import pandas as pd
full_data = pd.read_csv('data.csv')
names = full_data.columns.tolist()
names = names[16:-2]
len(names) #197 as expected
transform = (full_data[names] - full_data[names].mean())/full_data[names].std() #Transform has 197 columns as expected.
column_names = transform.columns.tolist()
new_names = {}
for name in column_names:
new_names[name] = name + '_standardized'
transform = transform.rename(columns=new_names)
to_concat = [full_data, transform]
final_data = pd.concat(to_concat, axis=1)
final_data.to_csv('transformed_data.csv', index = False)
Everything looks fine with the first row of the data. Also, the number of rows are the same between all three of the DataFrames.
Am I missing something? Is there a more efficient way to write DataFrames to CSV files?
The CSV stores string representations of data, so it’s not necessarily going to scale in an obvious way with the number of columns unless all columns have roughly the same size in string representation. It’s quite plausible that your CSV could increase a lot in size if your original data had only a few decimal places. If you read in numbers like 0.1, 0.2, 3, 1.7, whatever, and then z-scale them, you’re likely to get results with many decimal places. As a simple example, I did this:
I didn’t add any rows or columns to the data at all, just took the square root, but the second CSV is 4 times the size of the first, because the second one has lots of decimal places that take more bytes to write out in string form. You’re likely to get similarly long decimals when you divide by the standard deviation.