I have a data file with multiple rows, and 8 columns – I want to average column 8 of rows that have the same data on columns 1, 2, 5 – for example my file can look like this:
564645 7371810 0 21642 1530 1 2 30.8007
564645 7371810 0 21642 8250 1 2 0.0103
564645 7371810 0 21643 1530 1 2 19.3619
I want to average the last column of the first and third row since columns 1-2-5 are identical;
I want the output to look like this:
564645 7371810 0 21642 1530 1 2 25.0813
564645 7371810 0 21642 8250 1 2 0.0103
my files (text files) are pretty big (~10000 lines) and redundant data (based on the above rule) are not in regular intervals – so I want the code to find the redundant data, and average them…
in response to larsks comment – here are my 4 lines of code…
import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)
##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]
Ok, based on Hury’s input I updated the code –
this worked with the test data, as posted by hury – but when I use my file after the df = … does not seem to work (I get an output like:
Traceback (most recent call last):
File “/media/DATA/arxeia/Programming/MyPys/data_refine_average.py”, line 31, in
df = pd.read_csv(data, sep=”\s+”, header=None)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 187, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 141, in _read
f = com._get_handle(filepath_or_buffer, ‘r’, encoding=encoding)
File “/usr/lib64/python2.7/site-packages/pandas/core/common.py”, line 673, in _get_handle
f = open(path, mode)
IOError: [Errno 36] File name too long: ‘564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r\n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0.0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t0\t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216……….
any ideas?