I’m working with a CSV file in python, which will have ~100,000 rows when

Question

0

Asked: May 19, 20262026-05-19T22:11:26+00:00 2026-05-19T22:11:26+00:00

I’m working with a CSV file in python, which will have ~100,000 rows when

0

I’m working with a CSV file in python, which will have ~100,000 rows when in use. Each row has a set of dimensions (as strings) and a single metric (float).

As csv.DictReader or csv.reader return values as string only, I’m currently iterating over all rows and converting the one numeric value to a float.

for i in csvDict:
    i[col] = float(i[col])

Is there a better way that anyone could suggest to do this? I’ve been playing around with various combinations of map, izip, itertools and have searched extensively for some samples of doing it more efficiently, but unfortunately haven’t had much success.

In case it helps:
I’m doing this on appengine. I believe that what I’m doing may be resulting in me hitting this error:
Exceeded soft process size limit with 267.789 MB after servicing 11 requests total – I only get it when the CSV is quite large.

Edit: My Goal
I’m parsing this CSV so that I can use it as a data source for the Google Visualizations API. The final data set will be loaded in to a gviz DataTable for querying. Type must be specified during the construction of this table. My problem could also be solved if anyone knew of a good gviz csv->datatable converter in python!

Edit2: My Code

I believe that my issue has to do with the way I attempt to fixCsvTypes(). Also, data_table.LoadData() expects an iterable object.

class GvizFromCsv(object):
  """Convert CSV to Gviz ready objects."""

  def __init__(self, csvFile, dateTimeFormat=None):
    self.fileObj = StringIO.StringIO(csvFile)
    self.csvDict = list(csv.DictReader(self.fileObj))
    self.dateTimeFormat = dateTimeFormat
    self.headers = {}
    self.ParseHeaders()
    self.fixCsvTypes()

  def IsNumber(self, st):
    try:
        float(st)
        return True
    except ValueError:
        return False

  def IsDate(self, st):
    try:
      datetime.datetime.strptime(st, self.dateTimeFormat)
    except ValueError:
      return False

  def ParseHeaders(self):
    """Attempts to figure out header types for gviz, based on first row"""
    for k, v in self.csvDict[0].items():
      if self.IsNumber(v):
        self.headers[k] = 'number'
      elif self.dateTimeFormat and self.IsDate(v):
        self.headers[k] = 'date'
      else:
        self.headers[k] = 'string'

  def fixCsvTypes(self):
    """Only fixes numbers."""
    update_to_numbers = []
    for k,v in self.headers.items():
      if v == 'number':
        update_to_numbers.append(k)
    for i in self.csvDict:
      for col in update_to_numbers:
        i[col] = float(i[col])

  def CreateDataTable(self):
    """creates a gviz data table"""
    data_table = gviz_api.DataTable(self.headers)
    data_table.LoadData(self.csvDict)
    return data_table

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T22:11:26+00:00

I had first exploited the CSV file with a regex, but since the data in the file is very strictly arranged in each row, we can simply use the split() function

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    def transf(surname,x,y):
        return (surname,float(x),float(y))

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
    # to populate the data table by iterating in the CSV file

Or without a function to be defined:

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file

At one moment, I believed I was obliged to populate the data table with one row at a time because I was using a regex and that needed to obtain the matches’ groups before floating the numbers’ strings. With split() all can be done in one instruction with LoadData()

.

Hence, your code can be shortened. By the way, I don’t see why it should continue to define a class. Instead, a function seems enough for me:

def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table

.

Now you must examine if the way in which the CSV data is read from another API can be inserted in this code to keep the iterating principle to populate the data table.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working with a CSV file in python, which will have ~100,000 rows when

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply