I’m not sure if my title is correct for what I’m looking for, but I think that the referencing is the problem.
I have a Reader object through which I can loop:
msrun = pymzml.run.Reader(mzmlFile)
for feature in msrun:
print feature['id']
With this code I get the id’s, starting at 1, of all the features in msrun. However, I need to loop through the code first and get all the keys that I want and put them in a list, like this:
def getKeys(msrun, excludeList):
spectrumKeys = []
done = False
for spectrum in msrun:
if done:
break
if spectrum['ms level'] == 2:
for key in spectrum:
if key not in excludeList and not key.startswith('MS:'):
done = True
spectrumKeys.append(key)
spectrumKeys.extend(spectrum['precursors'][0].keys())
precursorKeys = spectrum['precursors'][0].keys()
break
return spectrumKeys, precursorKeys
However, if I would run this code:
msrun = pymzml.run.Reader(mzmlFile)
specKeys, precursKeys = getKeys(msrun, ['title','name'])
for feature in msrun:
print feature['id']
it starts of at the id that hasn’t been in the loop in getKeys() (it starts at 11 instead of 1). So I guess pymzml.run.Reader() works like a generator object. So I tried copying the object. First I tried
copyMsrun = msrun
specKeys, precursKeys = getKeys(copyMsrun, ['title','name'])
But this gives the same problem, if I understood correctly because doing copyMsrun = msrun makes them point to the same thing.
Then I tried
import copy
copyMsrun = copy.copy(msrun)
But I still had the same problem. I used copy.copy instead of copy.deepcopy because I don’t think that the Reader objects contains other objects, and when I try deepcopy I get
TypeError: object.__new__(generator) is not safe, use generator.__new__().
So how do I copy an object so that looping through one doesn’t affect the other? Should I just do
msrun = pymzml.run.Reader(mzmlFile)
copyMsrun = pymzml.run.Reader(mzmlFile)
?
Edit:
On Ade YU’s comment, I tried that too but when I do
spectrumList = []
for spectrum in msrun:
print spectrum['id']
spectrumList.append(spectrum)
for spectrum in spectrumList:
print spectrum['id']
The first print gives me 1-10, but the second print give me ten times 10
From the publication of pymzML and the documentation, it is clear that this “pathologically design” is done on purpose. Initializing thousands of spectrum objects will create a huge computational overhead, memory and cpu cycle wise that are simply not needed. Normally, parsing large sets of mzML naturally calls for analyze-while-parsing approach rather then collecting everything one needs to analyze later.
Having said this, pymzML still offers the function to “deep copy” the spectrum simply by calling spectrum.deRef(). The advantage by using this function is that all unnecessary data will be stripped prior copying, hence offering smaller objects. pymzML deRef
Hope that helps.