I have downloaded this csv file , which creates a spreadsheet of gene information.

Question

0

Asked: June 15, 20262026-06-15T09:49:17+00:00 2026-06-15T09:49:17+00:00

I have downloaded this csv file , which creates a spreadsheet of gene information.

0

I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-* columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03 then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01, then the :01 tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02, so that it has two levels of resolution after DQB1*. How can I tell python to look for these formats, and ignore the data stored in them.
e.g.

if (csvCell is of format DQB1*03:02:01):
   delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
   delete row
else:
   goto next line

UPDATE: Edited code I referenced

import csv
import re
import sys

csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]

for rowfields in csvdictreader:
  keep = True
  for field in targets:
    value = rowfields[field]
    if re.match(r'^\w+\*\d\d$', value):
      keep = False
      break # quit processing target fields
    elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
      rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
    else: # reduce gene resolution if too high
              # by only keeping first two alles if three are present
      rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
  if keep:
     csvdictwriter.writerow(rowfields)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T09:49:18+00:00

Here’s something that I think will do what you want. It’s not as simple as Peter’s answer because it uses Python’s csv module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.

import csv
import re
import sys

csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]

for rowfields in csvdictreader:
    keep = True
    for field in targets:
        value = rowfields[field]
        if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
            keep = False
            break # quit processing target fields
        else: # reduce gene resolution if too high
              # by only keeping first two alles if three are present
            rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
                                      r'DQB1*\1:\2', value)
    if keep:
        csvdictwriter.writerow(rowfields)

The hardest part for me was determining what you wanted to do.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have downloaded this csv file , which creates a spreadsheet of gene information.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply