New to python (very cool), first question. I am reading a 50+ mb ascii file, scanning for property tags and parsing the data into a numpy array. I have placed timing reports throughout the loop and found the culprit, the while loop using np.append(). Wondering if there is a faster method.
This is a sample input file format with fake data for debugging:
…
tag parameter
char name “Poro”
array float data 100
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 56 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84
85 86 87 88 89 90 91 92 93 94 95 96
97 98 99 100
endtag
…
and this is the code fragment, where it’s the while loop that is taking 70 seconds for a 350k element array:
def readParameter(self, parameterName):
startTime = time.time()
intervalTime = time.time()
token = "tag parameter"
self.inputBuffer.seek(0)
for lineno, line in enumerate(self.inputBuffer, 1):
if token in line:
line = self.inputBuffer.next().replace('"', '').split()
elapsedTime = time.time() - intervalTime
logging.debug(" Time to readParameter find token: " + str(elapsedTime))
intervalTime = time.time()
if line[2] == parameterName:
line = self.inputBuffer.next()
line = self.inputBuffer.next()
np.parameterArray = np.fromstring(line, dtype=float, sep=" ")
line = self.inputBuffer.next()
**while not "endtag" in line:
np.parameterArray = np.append(np.parameterArray, np.fromstring(line, dtype=float, sep=" "))
line = self.inputBuffer.next()**
elapsedTime = time.time() - startTime
logging.debug(" Time to readParameter load array: " + str(elapsedTime))
break
elapsedTime = time.time() - startTime
logging.debug(" Time to readParameter: " + str(elapsedTime))
logging.debug(np.parameterArray)
np.parameterArray = self.make3D(np.parameterArray)
return np.parameterArray
Thanks, Jeff
Appending to an array requires resizing the array, which usually requires allocating a new block of memory that’s big enough to hold the new array, copying the existing array to the new location, and freeing the memory it used to use. All of those operations are expensive, and you’re doing them for each element. With 350k elements, it’s basically garbage-collector memory fragmentation stress-test.
Pre-allocate your array. You’ve got the count parameter, so make an array that size, and inside your loop, just assign the newly-parsed element to the next spot in the array, instead of appending it. You’ll have to keep your own counter of how many elements have been filled. (You could instead iterate over the elements of the blank array and replace them, but that would make error handling a bit trickier to add in.)