The long (winded) version: I’m gathering research data using Python. My initial parsing is

Question

0

Asked: May 22, 20262026-05-22T16:05:43+00:00 2026-05-22T16:05:43+00:00

The long (winded) version: I’m gathering research data using Python. My initial parsing is

0

The long (winded) version:
I’m gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.

For a typical experiment, I’ll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).

I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I’m using the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.

Once they’re in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).

The essentials:
I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. ~~I’m also trying to come up with a dynamic (graphical) way to associate column combinations with categories.~~ Will make a new post fot this.

I’m looking for suggestions on how to do both.

    # data is a list of tab separated records
    # fields is a list of my field names

    # get a list of fieldtypes via gettype on our first row
    # gettype is a function to get type from string without changing data
    fieldtype = [gettype(n) for n in data[1].split('\t')]

    # get the indexes for fields that aren't floats
    mask =  [i for i, field in enumerate(fieldtype) if field!="float"]

    # for each row of data[skipping first and last empty lists] we split(on tabs)
    # and take the ith element of that split where i is taken from the list mask
    # which tells us which fields are not floats
    records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]

    # we now get a unique set of combos
    # since set doesn't happily take a list of lists, we join each row of values
    # together in a comma seperated string. So we end up with a list of strings.
    uniquerecs = set([",".join(row) for row in records])


    print len(uniquerecs)
    quit()

def gettype(s):
    try:
        int(s)
        return "int"
    except ValueError:
        pass
    try:
        float(s)
        return "float"
    except ValueError:
        return "string"

Sample Data:

field0  field1  field2  field3  field4  field5  field6  field7  field8  field9  field10 field11 field12 field13 field14 field15
10  0   2   1   Right   Right   Right   5.76765674196   0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3   1   3   0   Left    Left    Right   8.00982745764   0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5   19  1   0   Right   Left    Left    4.69440026591   0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3   1   4   2   Left    Right   Left    9.58648184552   0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9   0   0   7   Left    Left    Left    7.65374257547   0.030318719717  0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T16:05:44+00:00

Not sure if I understand your question, but here are a few thoughts:

For parsing the data files, you usually use the Python csv module.

For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:

from collections import defaultdict
import csv

reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask =  [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
    category = ','.join([line[i] for i in mask])
    data_of_category[category].append(line)

This way you don’t have to calculate the categories in the first place an can process the data in one pass.

And I didn’t understand the part about “a dynamic (graphical) way to associate column combinations with categories”.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The long (winded) version: I’m gathering research data using Python. My initial parsing is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply