For example, given a line a11b12c22d322 e... the fields of break are the numbers or spaces, we want to transform it into
a
b
c
d
e
...
sed need to read the whole line into memory, for gigabytes a line, it would not be efficient, and the job could not be done if we don’t have sufficient memory.
EDIT:
Could anyone please explain how do grep, tr, Awk, perl, and python manipulate the memory in reading a large file? What and how much content do they read into memory once a time?
If you use
gawk(which is the defaultawkon Linux, I believe), you can use theRSparameter to specify that multi-digit numbers or spaces are recognized as line terminators instead of a new-line.As to your second question, all of these programs will need to read some fixed number of bytes and search for its idea of a line separator in an internal buffer to simulate the appearance of reading a single line at a time. To prevent it from reading too much data while searching for the end of the line, you need to change the programs idea of what terminates a line.
Most languages allow you to do this, but only allow you to specify a single character.
gawkmakes it easy by allowing you to specify a regular expression to recognize an end-of-line character. This saves you from having to implement the fixed-size buffer and end-of-line search yourself.