I have a data file with multiple rows, and 8 columns – I want

Question

0

Asked: June 16, 20262026-06-16T05:51:35+00:00 2026-06-16T05:51:35+00:00

I have a data file with multiple rows, and 8 columns – I want

0

I have a data file with multiple rows, and 8 columns – I want to average column 8 of rows that have the same data on columns 1, 2, 5 – for example my file can look like this:

564645  7371810 0   21642   1530    1   2   30.8007
564645  7371810 0   21642   8250    1   2   0.0103
564645  7371810 0   21643   1530    1   2   19.3619

I want to average the last column of the first and third row since columns 1-2-5 are identical;

I want the output to look like this:

564645  7371810 0   21642   1530    1   2   25.0813
564645  7371810 0   21642   8250    1   2   0.0103

my files (text files) are pretty big (~10000 lines) and redundant data (based on the above rule) are not in regular intervals – so I want the code to find the redundant data, and average them…

in response to larsks comment – here are my 4 lines of code…

import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)

##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T05:51:37+00:00

Ok, based on Hury’s input I updated the code –

import os #needed system utils
import numpy as np# for array data processing
import pandas as pd #import the pandas module
datadirectory = input('path to the data directory, ')
working = os.environ.get("WORKING_DIRECTORY", datadirectory) 
os.chdir( working)

 ##READ DATA FILE AND and convert it to string
dataset = open(input('dataset_to_be_used, ')).readlines()
data = ''.join(dataset) 

df = pd.read_csv(data, sep="\\s+", header=None)
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)

this worked with the test data, as posted by hury – but when I use my file after the df = … does not seem to work (I get an output like:

Traceback (most recent call last):
File “/media/DATA/arxeia/Programming/MyPys/data_refine_average.py”, line 31, in
df = pd.read_csv(data, sep=”\s+”, header=None)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 187, in read_csv
return _read(TextParser, filepath_or_buffer, kwds)
File “/usr/lib64/python2.7/site-packages/pandas/io/parsers.py”, line 141, in _read
f = com._get_handle(filepath_or_buffer, ‘r’, encoding=encoding)
File “/usr/lib64/python2.7/site-packages/pandas/core/common.py”, line 673, in _get_handle
f = open(path, mode)
IOError: [Errno 36] File name too long: ‘564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r\n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0.0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t0\t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216……….

any ideas?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a data file with multiple rows, and 8 columns – I want

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply