I’m trying to simulate some code that I have working with SQL but using

Question

0

Asked: May 21, 20262026-05-21T11:41:59+00:00 2026-05-21T11:41:59+00:00

I’m trying to simulate some code that I have working with SQL but using

0

I’m trying to simulate some code that I have working with SQL but using all Python instead..
With some help here
CSV to Python Dictionary with all column names?

I now can read my zipped-csv file into a dict Only one line though, the last one. (how do I get a sample of lines or the whole data file?)

I am hoping to have a memory resident table that I can manipulate much like sql when I’m done eg Clean data by matching bad data to to another table with bad data and correct entries.. then sum by type average by time period and the like.. The total data file is about 500,000 rows.. I’m not fussed about getting all in memory but want to solve the general case as best I can,, again so I know what can be done without resorting to SQL

import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file    = zipfile.ZipFile(sys.argv[0])
items_file  = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    pass 
# Then is my result is
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])  
key=YEAR_BUILT_DESC, value=EXIST
key=SUBDIVISION, value=KNOLLWOOD
key=DOM, value=2
key=STREET_NAME, value=ORLEANS RD
key=BEDROOMS, value=3
key=SOLD_PRICE, value=
key=PROP_TYPE, value=SFR
key=BATHS_FULL, value=2
key=PENDING_DATE, value=
key=STREET_NUM, value=3828
key=SOLD_DATE, value=
key=LIST_PRICE, value=324900
key=AREA, value=200
key=STATUS_DATE, value=3/3/2011 11:54:56 PM
key=STATUS, value=A
key=BATHS_HALF, value=0
key=YEAR_BUILT, value=1968
key=ZIP, value=35243
key=COUNTY, value=JEFF
key=MLS_ACCT, value=492859
key=CITY, value=MOUNTAIN BROOK
key=OWNER_NAME, value=SPARKS
key=LIST_DATE, value=3/3/2011
key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM 
key=PARCEL_ID, value=28-15-3-009-001.0000
key=ACREAGE, value=0
key=WITHDRAWN_DATE, value=
>>>

I think I’m barking up a few wrong trees here…
One is that I only have 1 line of my about 500,000 line data file..
Two is it seems that the dict may not be the right structure since I don’t think I can just load all 500,000 lines and do various operations on them. Like..Sum by group and date..
plus it seems that duplicate keys may cause problems ie the non unique descriptors like county and subdivision.

I also don’t know how to read a specific small subset of line into memory (like 10 or 100 to test with, before loading all (which I also don’t get..) I have read over the Python docs and several reference books but it just is not clicking yet..

It seems that most of the answers I can find all suggest using various SQL solutions for this sort of thing,, but I am anxious to learn the basics of achieving the similar results with Python. As in some cases I think it will be easier and faster as well as expand my tool set. But I’m having a hard time finding relevant examples.

one answer that hints at what I’m getting at is:

Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn’t normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.

An example of getting a column-oriented data structure (however, involving loading the whole file):

import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary. 
The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:

matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate 
functions like datetime.datetime.strptime.

via Yann Vernier

Surely there is some good reference for this general topic?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T11:42:00+00:00

You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:

rows = []
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
    rows.append(row)

# rows[0]
{'keyA': 13, 'keyB': 'dataB' ... }
# rows[1]
{'keyA': 5, 'keyB': 'dataB' ... }

Then, to do aggregations and calculations:

sum(row['keyA'] for row in rows)

You may want to transform the data before it goes into rows, or use a friendlier data structure. Iterating over 500,000 rows for each calculation could become quite inefficient.

As a commenter mentioned, using an in-memory database could be really beneficial to you. another question asks exactly how to transfer csv data into a sqlite database.

import csv
import sqlite3

conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table t (col1 text, col2 float);")

# csv.DictReader uses the first line in the file as column headings by default
dr = csv.DictReader(open('data.csv', delimiter=','))
to_db = [(i['col1'], i['col2']) for i in dr]
c.executemany("insert into t (col1, col2) values (?, ?);", to_db)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to simulate some code that I have working with SQL but using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply