Suppose I have a csv file with 400 columns. I cannot load the entire file into a DataFrame (won’t fit in memory). However, I only really want 50 columns, and this will fit in memory. I don’t see any built in Pandas way to do this. What do you suggest? I’m open to using the PyTables interface, or pandas.io.sql.
The best-case scenario would be a function like: pandas.read_csv(...., columns=['name', 'age',...,'income']). I.e. we pass a list of column names (or numbers) that will be loaded.
There’s no default way to do this right now. I would suggest chunking the file and iterating over it and discarding the columns you don’t want.
So something like
pd.concat([x.ix[:, cols_to_keep] for x in pd.read_csv(..., chunksize=200)])