For my application, I had to write a custom “readline” method since I wanted to detect and preserve the newline endings in an ASCII text file. The Java readLine() method does not tell which newline sequence (\r, \n, \r\n) or EOF was encountered, so I cannot put the exact same newline sequence when writing to the modified file.
Here is the SSCE of my test example.
public class TestLineIO {
public static java.util.ArrayList<String> readLineArrayFromFile1(java.io.File file) {
java.util.ArrayList<String> lineArray = new java.util.ArrayList<String>();
try {
java.io.BufferedReader br = new java.io.BufferedReader(new java.io.FileReader(file));
String strLine;
while ((strLine = br.readLine()) != null) {
lineArray.add(strLine);
}
br.close();
} catch (java.io.IOException e) {
System.err.println("Could not read file");
System.err.println(e);
}
lineArray.trimToSize();
return lineArray;
}
public static boolean writeLineArrayToFile1(java.util.ArrayList<String> lineArray, java.io.File file) {
try {
java.io.BufferedWriter out = new java.io.BufferedWriter(new java.io.FileWriter(file));
int size = lineArray.size();
for (int i = 0; i < size; i++) {
out.write(lineArray.get(i));
out.newLine();
}
out.close();
} catch (java.io.IOException e) {
System.err.println("Could not write file");
System.err.println(e);
return false;
}
return true;
}
public static java.util.ArrayList<String> readLineArrayFromFile2(java.io.File file) {
java.util.ArrayList<String> lineArray = new java.util.ArrayList<String>();
try {
java.io.FileInputStream stream = new java.io.FileInputStream(file);
try {
java.nio.channels.FileChannel fc = stream.getChannel();
java.nio.MappedByteBuffer bb = fc.map(java.nio.channels.FileChannel.MapMode.READ_ONLY, 0, fc.size());
char[] fileArray = java.nio.charset.Charset.defaultCharset().decode(bb).array();
if (fileArray == null || fileArray.length == 0) {
return lineArray;
}
int length = fileArray.length;
int start = 0;
int index = 0;
while (index < length) {
if (fileArray[index] == '\n') {
lineArray.add(new String(fileArray, start, index - start + 1));
start = index + 1;
} else if (fileArray[index] == '\r') {
if (index == length - 1) { //last character in the file
lineArray.add(new String(fileArray, start, length - start));
start = length;
break;
} else {
if (fileArray[index + 1] == '\n') {
lineArray.add(new String(fileArray, start, index - start + 2));
start = index + 2;
index++;
} else {
lineArray.add(new String(fileArray, start, index - start + 1));
start = index + 1;
}
}
}
index++;
}
if (start < length) {
lineArray.add(new String(fileArray, start, length - start));
}
} finally {
stream.close();
}
} catch (java.io.IOException e) {
System.err.println("Could not read file");
System.err.println(e);
e.printStackTrace();
return lineArray;
}
lineArray.trimToSize();
return lineArray;
}
public static boolean writeLineArrayToFile2(java.util.ArrayList<String> lineArray, java.io.File file) {
try {
java.io.BufferedWriter out = new java.io.BufferedWriter(new java.io.FileWriter(file));
int size = lineArray.size();
for (int i = 0; i < size; i++) {
out.write(lineArray.get(i));
}
out.close();
} catch (java.io.IOException e) {
System.err.println("Could not write file");
System.err.println(e);
return false;
}
return true;
}
public static void main(String[] args) {
System.out.println("Begin");
String fileName = "test.txt";
long start = 0;
long stop = 0;
start = java.util.Calendar.getInstance().getTimeInMillis();
java.io.File f = new java.io.File(fileName);
java.util.ArrayList<String> javaLineArray = readLineArrayFromFile1(f);
stop = java.util.Calendar.getInstance().getTimeInMillis();
System.out.println("Total time = " + (stop - start) + " ms");
java.io.File oj = new java.io.File(fileName + "_readline.txt");
writeLineArrayToFile1(javaLineArray, oj);
start = java.util.Calendar.getInstance().getTimeInMillis();
java.util.ArrayList<String> myLineArray = readLineArrayFromFile2(f);
stop = java.util.Calendar.getInstance().getTimeInMillis();
System.out.println("Total time = " + (stop - start) + " ms");
java.io.File om = new java.io.File(fileName + "_custom.txt");
writeLineArrayToFile2(myLineArray, om);
System.out.println("End");
}
}
Version 1 uses readLine(), whereas version 2 is my version, which preserves newline characters.
On a text file with about 500K lines, version1 takes about 380 ms, whereas version2 takes 1074 ms.
How can I speed-up the performance of version2?
I checked Google guava and apache-commons libraries but cannot find a suitable replacement for “readLine()” that will tell which newline character was encountered when reading a text file.
Whenever the issue regards a program’s speed, the main thing you should keep in mind is that, for any continuous process within that program, the speed is nearly always limited by one of two things: CPU (processing power) or IO (memory allocation and transfer speed).
Usually either your CPU is faster than your IO, or the contrary. Because of this, your program’s speed-limit is almost always dictated by one of them, and it’s usually easy to know which:
Things are kinda straightforward when trying to improve an CPU-bounded program’s speed. It mostly comes down to achieving the same goal or effect while making less operations.
This, on the other hand, does not make the process any easier. In fact, it’s usually much harder to optimize CPU-bounded programs than to optimize IO-bounded ones, because each CPU-related operation is usually unique, and has to be revised individually.
Although generally easier once you have the experience, things are not so straightforward with IO-bound programs. There are a lot more stuff to consider when dealing with IO-bound processes.
I’ll be using Hard-Disk Drives (HDDs) as the basis, since the characteristics I’ll mention affect HDDs the strongest (because they are mechanical), but you should keep in mind that many of the same concepts apply, to some extent, to almost every memory-storage hardware, including Solid-State Drives (SSDs) and even RAM!
These are the main performance characteristics of most memory-storage hardware:
Access time: Also known as response time, it is the time it takes before the hardware can actually transfer data.
Seek time: The time it takes for the hardware to seek (reach) the correct position within it’s internal subdivisions, in order to read from or write to addresses in that section.
Average seek time ranges from 3 ms (~) for high-end server drives, to 15 ms (~) for mobile drives, with the most common desktop drives typically having a seek time around 9 ms (~).
Typical SSDs will have a seek time between 0.08 to 0.16 ms (~), with RAM being even faster.
Command-Processing time: Also known as command overhead, it is the time it takes for the drive’s electronics to set up the necessary communication between the various internal components, so it can read or write the data.
This is in the range of 0.003 ms (~) for both, mechanical and circuital devices, and is usually ignored in benchmarks.
Settle time: It is the time it takes for the heads to settle on the target track and stop vibrating, so that they do not read or write off-track.
This amount is usually very small (typically less than 0.1 ms), and typically included in benchmarks as part of the seek time.
Data-Transfer rate: Also called throughput, it covers both: The internal rate, which is the time it takes to move data between the disk surface and the controller on the drive. And the external rate, which is the time to move data between the controller on the drive and an external component in the host system. It has a few sub-factors within:
This means that the main performance issues regarding IO are caused by going back-and-forth between IO and processing. An issue that can be enormously diminished by using buffers, and processing and reading/writhing in bigger chunks of data, rather than every byte.
As you can also see, although many of the speed characteristics are still present, RAM and SSDs do not have the same internal limits of HDDs, so their internal and external transfer rates often reach the maximum capabilities of the drive-to-host interface.
Chunk approach example:
This example will create a
Testfolder on the desktop, and generate aTest.txtfile within.The file is generated with an specified number of lines, each line containing the word
"Test"repeated for an specific number of times (for file-size purposes). Each line is ended by"\r","\n"or"\r\n", sequentially.It’s meaningless to save the results of each chunk in-memory cumulatively, as doing so would lead the whole file end up in-memory eventually, which is nearly the same problem of not using chunks to begin with.
As such, an output file is created in the same
Testfolder, to which the result of every chunk is stored at, once that chunk is finished.The base file is read using buffers, and those buffers are additionally used as the chunks.
The process here is simply printing a textual version of the line-separator (
"\\r","\\n"or"\\r\\n"), followed by": ", followed by the line contents; But for the last line,"EOF"is used instead.To actually operate with chunks, it’s probably easier to manage with a class-based approach, rather than a purely function-based one.
Anyways, here goes the code:
I don’t know what IDE you are using, but if it’s NetBeans, make a memory-profile of your code and compare to a profile of this one. You should notice a big difference in the amount of memory needed during processing.
Here, the chunk approach’s memory usage, which includes not only the chunk itself but also the program’s own variables and structures, does not go over 40 MB even tough we are dealing with a file bigger than 100 MB. As you can see:

It also spends very little time in GB, mostly less than 5% at any given point:
