I have data in a CSV file. One of the column lists a persons

Question

0

Asked: May 13, 20262026-05-13T08:09:36+00:00 2026-05-13T08:09:36+00:00

I have data in a CSV file. One of the column lists a persons

0

I have data in a CSV file. One of the column lists a persons name and all the rows that follow in that column provide some descriptive attributes about that person until the next persons name shows up. I can tell when the row has a name or an attribute by the LTYPE column, N in that column indicates that in that row the NAME value is actually a name, an A in that column indicates that the data in the NAME column is an attribute. The attributes are coded and I have 600K lines of the data. Here is a sample. The data is grouped and the befinning of each grouping is indicated by RID resetting to 1.

{'LTYPE': 'N', 'RID': '1', 'NAME': 'Jason Smith'}
{'LTYPE': 'A', 'RID': '2', 'NAME': 'DA'}
{'LTYPE': 'A', 'RID': '3', 'NAME': 'B'}
{'LTYPE': 'N', 'RID': '4', 'NAME': 'John Smith'}
{'LTYPE': 'A', 'RID': '5', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '6', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '7', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '8', 'NAME': 'DA'}
{'LTYPE': 'N', 'RID': '9', 'NAME': 'Robert Smith'}
{'LTYPE': 'A', 'RID': '10', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '11', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '12', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '13', 'NAME': 'RB'}
{'LTYPE': 'A', 'RID': '14', 'NAME': 'VC'}
{'LTYPE': 'N', 'RID': '15', 'NAME': 'Harvey Smith'}
{'LTYPE': 'A', 'RID': '16', 'NAME': 'SA'}
{'LTYPE': 'A', 'RID': '17', 'NAME': 'AS'}
{'LTYPE': 'N', 'RID': '18', 'NAME': 'Lukas Smith'}
{'LTYPE': 'A', 'RID': '19', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '20', 'NAME': 'AS'}

I want to create the following:

{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'N', 'RID': '1', 'PERSON_NAME': 'Jason Smith', 'NAME': 'Jason Smith'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '2', 'PERSON_NAME': 'Jason Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '3', 'PERSON_NAME': 'Jason Smith', 'NAME': 'B'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'N', 'RID': '4', 'PERSON_NAME': 'John Smith', 'NAME': 'John Smith'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '5', 'PERSON_NAME': 'John Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '6', 'PERSON_NAME': 'John Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '7', 'PERSON_NAME': 'John Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '8', 'PERSON_NAME': 'John Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'N', 'RID': '9', 'PERSON_NAME': 'Robert Smith', 'NAME': 'Robert Smith'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '10', 'PERSON_NAME': 'Robert Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '11', 'PERSON_NAME': 'Robert Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '12', 'PERSON_NAME': 'Robert Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '13', 'PERSON_NAME': 'Robert Smith', 'NAME': 'RB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '14', 'PERSON_NAME': 'Robert Smith', 'NAME': 'VC'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'N', 'RID': '15', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'Harvey Smith'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '16', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'SA'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '17', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'AS'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'N', 'RID': '18', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'Lukas Smith'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '19', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '20', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'AS'}

I started off by getting the index positions of LTYPE

nameIndex=[]
attributeIndex=[]
for line in thedata:
    if line['LTYPE']=='N':
        nameIndex.append(int(line["RID"])-1)
    if line['LTYPE']=='A':
        attributeIndex.append(int(line["RID"])-1)

So I have the list index of each of the rows classified as a name in one list and the list index of each of the rows classified as an attribute in another list. It is then easy to attach the name to each observation as follows

for counter, row in enumerate(thedata):
    if counter in nameIndex:
        row['PERSON_NAME']=row['NAME']
        person_NAME=row['NAME']
    if counter not in nameIndex:
        row['PERSON_NAME']=person_NAME

I am struggling to determine and assign the list of attributes to each person.

First I need to combine the attributes that belong together so I did this:

 newAttribute=[]
 for counter, row in enumerate(thedata):
     if counter in attributeIndex:
         tempAttribute=tempAttribute+' '+row['NAME']

     if counter not in attributeIndex:
         if counter==0:
             tempAttribute=""
             pass
         if counter!=0:
             newAttribute.append(tempAttribute.lstrip())
             tempAttribute=""

one problem with my approach is that I still have to add the last group to the newAttribute list since the loop finishes before it is added. So to get the list of grouped attributes I have to run

newAttribute.append(tempAttribute)

But even then I can’t seem to find a clean way to add the attributes I have to do it in two steps. First, I create a dictionary with the nameIndex positions as the key and the attributes as the values

tempDict={}
for each in range(len(nameIndex)):
    tempdict[nameIndex[each]]=newAttribute[each]

I cycle through the list once putting in the attribute on the name line

for counter,row in enumerate(thedata):
    if counter in tempDict:
        thedata[counter]['TA']=tempDict[counter]

and then I go through it again checking if the key ‘TA’ exists and using the existence to set the PERSON_ATTRIBUTE key

for each in thedata:
    if each.has_key('TA'):
        each['PERSON_ATTRIBUTES']=each['TA']
        holdAttribute=each['TA']
    else:
        each['PERSON_ATTRIBUTES']=holdAttribute

There has got to be a cleaner way to think about this and so I was wondering if anyone would like to point me in the direction of some functions that I could read about that would let me clean up this code. I know I still have to drop the ‘TA’ key but I figured that I have taken enough space.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T08:09:37+00:00

I suggest a different, index-free approach based on itertools.groupby:

import itertools, operator

data = [
{'LTYPE': 'N', 'RID': '1', 'NAME': 'Jason Smith'},
{'LTYPE': 'A', 'RID': '2', 'NAME': 'DA'},
{'LTYPE': 'A', 'RID': '3', 'NAME': 'B'},
{'LTYPE': 'N', 'RID': '4', 'NAME': 'John Smith'},
{'LTYPE': 'A', 'RID': '5', 'NAME': 'BC'},
{'LTYPE': 'A', 'RID': '6', 'NAME': 'CB'},
{'LTYPE': 'A', 'RID': '7', 'NAME': 'DB'},
{'LTYPE': 'A', 'RID': '8', 'NAME': 'DA'},
]

for k, g in itertools.groupby(data, operator.itemgetter('LTYPE')):
  if k=='N':
    person_name_record = next(g)
  else:
    attribute_records = list(g)
    person_attributes = ' '.join(r['NAME'] for r in attribute_records)
    addfields = dict(PERSON_ATTRIBUTES=person_attributes,
                     PERSON_NAME=person_name_record['NAME'])
    person_name_record.update(addfields)
    for r in attribute_records: r.update(addfields)

for r in data: print r

This prints your desired results for the first couple people (and each person is treated separately, so it should work just the same for a few hundred thousand people;-).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have data in a CSV file. One of the column lists a persons

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply