When I first started developing this project, there was no requirement for generating large

Question

0

Asked: May 17, 20262026-05-17T18:33:14+00:00 2026-05-17T18:33:14+00:00

When I first started developing this project, there was no requirement for generating large

0

When I first started developing this project, there was no requirement for generating large files, however it is now a deliverable.

Long story short, GAE just doesn’t play nice with any large scale data manipulation or content generation. The lack of file storage aside, even something as simple as generating a pdf with ReportLab with 1500 records seems to hit a DeadlineExceededError. This is just a simple pdf comprised of a table.

I am using the following code:

    self.response.headers['Content-Type'] = 'application/pdf'
    self.response.headers['Content-Disposition'] = 'attachment; filename=output.pdf'
    doc = SimpleDocTemplate(self.response.out, pagesize=landscape(letter))

    elements = []

    dataset = Voter.all().order('addr_str')

    data = [['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE']]

    i = 0
    r = 1
    s = 100

    while ( i < 1500 ):
        voters = dataset.fetch(s, offset=i)
        for voter in voters:
            data.append([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname ])
            r = r + 1
        i = i + s

    t=Table(data, '', r*[0.4*inch], repeatRows=1 )
    t.setStyle(TableStyle([('ALIGN',(0,0),(-1,-1),'CENTER'),
                           ('INNERGRID', (0,0), (-1,-1), 0.15, colors.black),
                           ('BOX', (0,0), (-1,-1), .15, colors.black),
                           ('FONTSIZE', (0,0), (-1,-1), 8)
                           ]))

    elements.append(t)

    doc.build(elements)

Nothing particularly fancy, but it chokes. Is there a better way to do this? If I could write to some kind of file system and generate the file in bits, and then rejoin them that might work, but I think the system precludes this.

I need to do the same thing for a CSV file, however the limit is obviously a bit higher since it’s just raw output.

    self.response.headers['Content-Type'] = 'application/csv'
    self.response.headers['Content-Disposition'] = 'attachment; filename=output.csv'

    dataset = Voter.all().order('addr_str')

    writer = csv.writer(self.response.out,dialect='excel')
    writer.writerow(['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE'])

    i = 0
    s = 100
    while ( i < 2000 ):
        last_cursor = memcache.get('db_cursor')
        if last_cursor:
            dataset.with_cursor(last_cursor)
        voters = dataset.fetch(s)
        for voter in voters:
            writer.writerow([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname])
        memcache.set('db_cursor', dataset.cursor())
        i = i + s
    memcache.delete('db_cursor')

Any suggestions would be very much appreciated.

Edit:
possible solutions

Above I had documented three possible solutions based on my research, plus suggestions etc

They aren’t necessarily mutually exclusive, and could be a slight variation or combination of any of the three, however the gist of the solutions are there. Let me know which one you think makes the most sense, and might perform the best.

Solution A: Using mapreduce (or tasks), serialize each record, and create a memcache entry for each individual record keyed with the keyname. Then process these items individually into the pdf/xls file. (use get_multi and set_multi)

Solution B: Using tasks, serialize groups of records, and load them into the db as a blob. Then trigger a task once all records are processed that will load each blob, deserialize them and then load the data into the final file.

Solution C: Using mapreduce, retrieve the keynames and store them as a list, or serialized blob. Then load the records by key, which would be faster than the current loading method. If I were to do this, which would be better, storing them as a list (and what would the limitations be…I presume a list of 100,000 would be beyond the capabilities of the datastore) or as a serialized blob (or small chunks which I then concatenate or process)

Thanks in advance for any advice.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T18:33:14+00:00

Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.

Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.

Within each task process the entities, then store the serialized result in a text or blob property on a ‘processing’ model. I would make the model’s key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.

To serialize your table pretty fast you could use:

serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)

Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you’ll know the model’s keys since you know the last task name. Again, you’ll want to be mindful size of your fetches from the datastore!

To deserialize:

list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))

Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you’ll have to look into doing more of the processing in each task.

Don’t forget in the ‘build’ task you’ll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

When I first started developing this project, there was no requirement for generating large

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply