Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5941335
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T16:05:43+00:00 2026-05-22T16:05:43+00:00

The long (winded) version: I’m gathering research data using Python. My initial parsing is

  • 0

The long (winded) version:
I’m gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.

For a typical experiment, I’ll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).

I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I’m using the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.

Once they’re in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).

The essentials:
I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. I’m also trying to come up with a dynamic (graphical) way to associate column combinations with categories. Will make a new post fot this.

I’m looking for suggestions on how to do both.

    # data is a list of tab separated records
    # fields is a list of my field names

    # get a list of fieldtypes via gettype on our first row
    # gettype is a function to get type from string without changing data
    fieldtype = [gettype(n) for n in data[1].split('\t')]

    # get the indexes for fields that aren't floats
    mask =  [i for i, field in enumerate(fieldtype) if field!="float"]

    # for each row of data[skipping first and last empty lists] we split(on tabs)
    # and take the ith element of that split where i is taken from the list mask
    # which tells us which fields are not floats
    records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]

    # we now get a unique set of combos
    # since set doesn't happily take a list of lists, we join each row of values
    # together in a comma seperated string. So we end up with a list of strings.
    uniquerecs = set([",".join(row) for row in records])


    print len(uniquerecs)
    quit()

def gettype(s):
    try:
        int(s)
        return "int"
    except ValueError:
        pass
    try:
        float(s)
        return "float"
    except ValueError:
        return "string"

Sample Data:

field0  field1  field2  field3  field4  field5  field6  field7  field8  field9  field10 field11 field12 field13 field14 field15
10  0   2   1   Right   Right   Right   5.76765674196   0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3   1   3   0   Left    Left    Right   8.00982745764   0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5   19  1   0   Right   Left    Left    4.69440026591   0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3   1   4   2   Left    Right   Left    9.58648184552   0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9   0   0   7   Left    Left    Left    7.65374257547   0.030318719717  0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T16:05:44+00:00Added an answer on May 22, 2026 at 4:05 pm

    Not sure if I understand your question, but here are a few thoughts:

    For parsing the data files, you usually use the Python csv module.

    For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:

    from collections import defaultdict
    import csv
    
    reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
    data_of_category = defaultdict(list)
    lines = [line for line in reader]
    mask =  [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
    for line in lines[1:]:
        category = ','.join([line[i] for i in mask])
        data_of_category[category].append(line)
    

    This way you don’t have to calculate the categories in the first place an can process the data in one pass.

    And I didn’t understand the part about “a dynamic (graphical) way to associate column combinations with categories”.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Just some background, sorry so long winded. I'm using the System.Data.SQLite ADO.net adapter to
I have the following code (Yes I know it's quite long winded, but I
Long winded title, short question: If one wants to develop for Windows but not
Long-Winded Background I'm working on parallelising some code for cardiac electrophysiology simulations. Since users
Apologies for the long winded title but looking for a solution to what might
Apologies in advance for the long-winded question. I'm really a database programmer, but have
Apologies for the rather verbose and long-winded post, but this problem's been perplexing me
How can I improve this code? What has made this long winded is the
I was going to write a long-winded post, but I'll boil it down here:
I've found some fairly long-winded ways to do this, but can't believe it's that

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.