I have a huge text file, which is structured as:
SEPARATOR
STRING1
(arbitrary number of lines)
SEPARATOR
...
SEPARATOR
STRING2
(arbitrary number of lines)
SEPARATOR
SEPARATOR
STRING3
(arbitrary number of lines)
SEPARATOR
....
What only changes between the different “blocks” of the file is the STRING and the content between the separator. I need to get a script in bash or python which given a STRING_i in the input, gives as output a file, which contains
SEPARATOR
STRING_i
(number of lines for this string)
SEPARATOR
What is the best approach here to use bash or python? Another option? It must also be fast.
Thanks
In Python 2.6 or better:
In older Python versions you can do almost the same, but change the line
blockid = next(inf)toblockid = inf.next().The assumptions here are that the input and output files are opened by the caller (which also passes in the interesting values of
thestring, and optionallyseparator) but it’s this function’s job to close them (e.g. for maximum ease of use as a pipeline filter, with inf ofsys.stdinand ouf ofsys.stdout); easy to tweak if needed of course.Removing the
asserts will speed it up microscopically, but I like their “sanity checking” role (and they may also help understand the logic of the code flow).Key to this approach is that a file is an iterator (of lines) and iterators can be advanced in multiple places (so we can have multiple
forstatements, or specific “advance the iterator” calls such asnext(inf), and they cooperate properly).