I have a shell script with the following line to remove double quotes ” from a text file.
sed 's/\"//g' old_file.txt > new_file.txt
There is one more awk statement that selects only a specific columns from a ^ separated text file.
Both the statements are working as expected. But the server hangs when the input file is more than a few GB in size. I will like to know if python can do the same more efficiently.
update:
It is not stopping the server, but mysql hosted on the same server is slow when I run the shell script.
It’s unlikely that Python could do that faster. With a bit work, it could do the same thing with +/- same efficiency. Unless you attempt to do it wrong; because then it will be slower.
Both sed & awk operate in line mode. They are quite I/O-optimized, and I don’t think you could improve over that. The Python script may be faster if it comes to performing operations but in this case it’s very unlikely to be relevant.
Just pipe them like @paxdiablo suggests:
Or, if the column format is simple enough, you can replace
awkwith simplercutwhich would be faster:(example for columns 1, 2 & 4, space-separated)
And if you need the intermediate output, you can put
teein the pipeline to write it in the meantime:But it may be actually less efficient since both
inter_file.txtandnew_file.txtwill be written at the same time.Ok, now I think I understand what the problem is. Your problem is not that the script is not fast enough because it gets as fast as it can get. It’s your hard drive which hits it throughput limit and thus other applications using it get delayed. You could say that it is simply too fast for you hard drive.
One solution is to try using
ioniceto give it lower priority. It may help, it may not make a difference at all.gives the lowest (idle) I/O priority to the current shell or script. Similarly, you can start your script with given priority using:
The results may vary upon I/O scheduler used. Some schedulers will ignore this, some may actually make the script slower (whenever mysql will be requesting I/O).
Alternatively, you could use an additional program which will limit the throughput going to sed, and effectively making it slower and giving some free space for mysql to fit in. You will, however, need to measure what throughput is optimal for you.
And finally, if none of the above is an option, you could jump in to Python, and add
time.sleep()every few hundred or thousand lines to stop the script for a while to let mysql do its job.