I need to extract certain data from a file, but this file is formatted

Question

0

Asked: May 15, 20262026-05-15T10:48:16+00:00 2026-05-15T10:48:16+00:00

I need to extract certain data from a file, but this file is formatted

0

I need to extract certain data from a file, but this file is formatted to be read by humans, and is therefore irregular. First off there is a large amount of text before any of the data actually begins:

   DL_POLY Version 2.20

                        Running on   10 nodes
*************** DLPOLY: LiNbO3 >***************

SIMULATION CONTROL PARAMETERS

simulation temperature 1.4500E+03

simulation pressure (katm) 0.0000E+00

selected number of timesteps 8000

equilibration period 500

data printing interval 80

statistics file interval 80

simulation timestep 5.0000E-04

Nose-Hoover (Melchionna) isotropic N-P-T
thermostat relaxation time 1.0000E-01
barostat relaxation time 5.0000E-01

trajectory file option on
trajectory file start 1
trajectory file interval 80
trajectory file info key 2
…

Then after a while there is the actual data but it is in this funny form:

step eng_tot temp_tot eng_cfg eng_vdw eng_cou eng_bnd > eng_ang eng_dih eng_tet
time(ps) eng_pv temp_rot vir_cfg vir_vdw vir_cou vir_bnd >vir_ang vir_con vir_tet
cpu (s) volume temp_shl eng_shl vir_shl alpha beta >gamma vir_pmf press

1 -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
0.0 -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
0.8 5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01

rolling -1.1289E+05 1.4750E+03 -1.1386E+05 1.7276E+04 -1.3114E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
averages -1.1545E+05 0.0000E+00 9.6539E+03 -1.2118E+05 1.3083E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
5.3733E+04 1.2367E+02 0.0000E+00 0.0000E+00 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -7.5549E+01

80 -1.1290E+05 1.5021E+03 -1.1392E+05 2.1894E+04 -1.3726E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
0.0 -1.1256E+05 0.0000E+00 8.6671E+02 -1.3974E+05 1.3707E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
10.6 5.3149E+04 1.1377E+03 1.4419E+03 3.5382E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.1119E+01

rolling -1.1290E+05 1.6145E+03 -1.1398E+05 2.0750E+04 -1.3588E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
averages -1.1333E+05 0.0000E+00 3.3694E+03 -1.3512E+05 1.3565E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
5.3481E+04 1.0997E+03 1.1430E+03 2.8391E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.2096E+01

160 -1.1287E+05 1.2629E+03 -1.1376E+05 2.1450E+04 -1.3633E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
0.1 -1.1249E+05 0.0000E+00 3.8761E+02 -1.3824E+05 1.3612E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
20.5 5.3375E+04 4.9015E+02 1.1243E+03 2.5052E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.2676E+01

rolling -1.1288E+05 1.4677E+03 -1.1389E+05 2.1589E+04 -1.3663E+05 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
averages -1.1235E+05 0.0000E+00 2.1147E+02 -1.3884E+05 1.3643E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
5.3152E+04 7.4818E+02 1.1440E+03 2.6211E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 1.7174E+01

On the 9th data interval there is a slight anamoly:

switching off temperature scaling at step 500
 560 -1.1287E+05  1.4709E+03 -1.1390E+05  2.1600E+04 -1.3678E+05  0.0000E+00  >0.0000E+00  0.0000E+00  0.0000E+00
 0.3 -1.1292E+05  0.0000E+00  1.9253E+03 -1.3743E+05  1.3656E+05  0.0000E+00  >0.0000E+00  0.0000E+00  0.0000E+00
68.4  5.4300E+04  1.5043E+02  1.2775E+03  2.7947E+03  5.6396E+01  5.6396E+01  >5.6396E+01  0.0000E+00  2.0576E-01
rolling -1.1286E+05 1.4784E+03 -1.1390E+05 2.1546E+04 -1.3673E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
averages -1.1298E+05 0.0000E+00 2.1361E+03 -1.3717E+05 1.3651E+05 0.0000E+00 >0.0000E+00 0.0000E+00 0.0000E+00
5.4303E+04 2.2261E+02 1.2785E+03 2.8027E+03 5.6396E+01 5.6396E+01 >5.6396E+01 0.0000E+00 -1.7421E+00

As you can see there is a pair of ‘—-‘ lines which may interfere with proper parsing of the data.

Lets say I want to get just ‘the eng_tot’ data from this file (the bolded numbers), how would I go about doing that in Python? The number is always in the same place in the file (second quantity, first row after second set of —-s.

By the way the header part with all the definitions in it repeats every 8 steps, execpt the first step in which there are 9 lines. I’d like to just ignore the first step. For now lets say I want to start with line 295 inclusive. Just so you know, I’m quite new to python and programming in general so all the help you can provide is appreciated.

Here’s the code I tried, but Eng_Total is still an empty set:

import re
import inspect

def lineno():
    """Returns the current line number"""
    linenum = inspect.currentframe().f_back.f_lineno
infile =  open('FilePath/OUTPUT.01').read()
Eng_Total = []
for line in infile:
#    if 'eng_tot' in line.split(): 
     if re.match("\s+-+\s+", line):
    lineno(line)
        line = linenum+1
        sanitized_line = line[8:]
        eng_total = line.split()[0]
        Eng_Total.append(eng_total)
print Eng_Total

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T10:48:17+00:00

I’d probably do this:

iterate over lines in the output
search for one containing eng_tot:
- if 'eng_tot' in line.split(): process_blocks
gobble up lines until one matches all dashes (with optional spaces on either side)
- if re.match("\s+-+\s+", line): proccess_metrics_block
process the first line of metrics:
- cut the first column off the line (it makes it harder to parse, because it might not be there)
  - sanitized_line = line[8:]
  - eng_total = line.split()[0] , the first column is now eng_total
skip lines until you reach another line of dashes, then start again

After seeing your edits:

You need to import the re (regular expression) module, at the top of the file : import re
The process_blocks and process_metrics_block were pseudo code. Those don’t exist unless you define them. 🙂 You don’t need those functions exactly, you can avoid them using basic looping (while) and conditional (if) statements.
You’ll have to make sure you understand what you’re doing, not just copy from stack overflow! 🙂

It looks like you’re trying to do something like this. It seems to work, but I’m sure with some effort, you can come up with something nicer:

import re

def find_header(lines):
  for (i, line) in enumerate(lines):
    if 'eng_tot' in line.split():
      return i
  return None

def find_next_separator(lines, start):
  for (i, line) in enumerate(lines[start+1:]):
    if re.match("\s*-+\s*", line):
      return i + start + 1
  return None

if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  header = find_header(lines)
  start = find_next_separator(lines, header+1)

  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break

    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

    start = end

  print totals

You can use a generator to be a little more pythonic:

def metric_block_iter(lines):
  start = find_next_separator(lines, find_header(lines)+1)
  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break
    yield (start, end)
    start = end


if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  for (start, end) in metric_block_iter(lines):
    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

  print totals

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to extract certain data from a file, but this file is formatted

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply