So I was trying to write myself a command for a linux pipeline. Think of it as a replica of gnu ‘cat’ or ‘sed’, that takes input from stdin, does some processing and writes to stdout.
I originally wrote an AWK script but wanted more performance so I used the following c++ code:
std::string crtLine;
crtLine.reserve(1000);
while (true)
{
std::getline(std::cin, crtLine);
if (!std::cin) // failbit (EOF immediately found) or badbit (I/O error)
break;
std::cout << crtLine << "\n";
}
This is exactly what cat (without any parameters does).
As it turns out, this program is about as slow as its awk counterpart, and nowhere near as fast as cat.
Testing on a 1GB file:
$time cat 'file' | cat | wc -l
real 0m0.771s
$time cat 'file' | filter-range.sh | wc -l
real 0m44.267s
Instead of getline(istream, string) I tried cin.getline(buffer, size) but no improvements. This is embarassing, is it a buffering issue? I also tried fetching 100KB at a time instead of just one line, no help! Any ideas?
EDIT:
What you folks say makes sense, BUT the culprit is not string building/copying and neither is scanning for newlines. (And neither is the size of the buffer). Take a look at these 2 programs:
char buf[200];
while (fgets(buf, 200, stdin))
std::cout << buf;
$time cat 'file' | ./FilterRange > /dev/null
real 0m3.276s
char buf[200];
while (std::cin.getline(buf, 200))
std::cout << buf << "\n";
$time cat 'file' | ./FilterRange > /dev/null
real 0m55.031s
Neither of them manipulate strings and both of them do newline scanning, however one is 17 times slower than the other. They differ only by the use of cin.
I think we can safely conclude that cin screws up the timing.
Not really. This has exactly the same effect as /bin/cat, but it does not use the same method.
/bin/catlooks more like this:Notice that
/bin/catdoes no processing on its input. It doesn’t build astd::stringout of it, it doesn’t scan it for\n, it just does one system call after another.Your program, on the other hand, builds
strings, make copies of them, scans for\n, etc, etc.This small, complete program runs 2-3 orders of magnitude slower than /bin/cat:
I timed it thus:
EDIT
This program gets within 50% of the performance of /bin/cat:
In short, if your requirement is to perform line-by-line analysis of the input, then you will have to pay some price to use formatted input. If, on the other hand, you need to perform byte-by-byte analysis, then you can use unformatted input and go faster.