Ok I have the following working program. It opens of a file of data in columns that is too large for excel and finds the average value for each column:
Sample data is:
Joe Sam Bob
1 2 3
2 1 3
And it returns
Joe Sam Bob
1.5 1.5 3
This is good. The problem is some columns have NA as a value. I want to skip this NA and calculate the average of the remaining values
So
Bobby
1
NA
2
Should output as
Bobby
1.5
Here is my existing program built with help from here. Any help is appreciated!
with open('C://avy.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
with open('c://finished.txt', 'w') as ouf:
for index, summedRowValue in enumerate(sums):
print>>ouf, columns[index], 1.0 * summedRowValue / numRows
Now I have this:
with open(‘C://avy.txt’, “rtU”) as f:
def get_averages(f):
headers = f.readline().split()
ncols = len(headers)
sumx0 = [0] * ncols
sumx1 = [0.0] * ncols
lino = 1
for line in f:
lino += 1
values = line.split()
for colindex, x in enumerate(values):
if colindex >= ncols:
print >> sys.stderr, "Extra data %r in row %d, column %d" %(x, lino, colindex+1)
continue
try:
value = float(x)
except ValueError:
continue
sumx0[colindex] += 1
sumx1[colindex] += value
print headers
print sumx1
print sumx0
averages = [
total / count if count else None
for total, count in zip(sumx1, sumx0)
]
print averages
and it says:
Traceback (most recent call last):
File “C:/avy10.py”, line 11, in
lino += 1
NameError: name ‘lino’ is not defined
The following code handles varying counts properly, and also detects extra data … in other words, it’s rather robust. It could be improved by explicit messages (1) if the file is empty (2) if the header line is empty. Another possibility is testing explicitly for
"NA", and issuing an error message if a field is neither"NA"nor floatable.Edit add here:
Edit
Normal usage: