I have to work with big files (many GB) and need quick lookups to retrieve specific lines on request.
The idea has been to maintain a mapping:
some_key -> byte_location
Where the byte location represents where in the file the line starts.
Edit: the question changed a little bit:
First I used:
FileInputStream stream = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
FileChannel channel = stream.getChannel();
I noticed that FileChannel.position() will not return the exact position where the reader is currently reading because it is a “buffered” reader. It reads chunks of a given size (16k here) so what I get from the FileChannel is a multiple of 16k, and not the exact position where the reader is actually reading.
PS: the file is in UTF-8
I would have tried something like this:
The problem is that
readLine()turns each byte into a character with the top 8 bits zero. That’s fine if your file is ASCII or Latin-1, but problematic for UTF-8.However, if you are prepare to use RandomAccessFile to write the file, you can use
readUTF()andwriteUTF()to read and write “lines” encoded as modified UTF-8 Strings.FOLLOWUP
Yea … see above.
Another idea for coping with UTF-8 with
RandomAccessFile:readFully(byte[])method to read a bunch of bytes into abyte[],pos== position of the end of line in the buffer,new String(bytes, 0, pos, UTF-8)to convert to a Java String.This is more cumbersome than using
readLine(), but it should be faster than usingFileInputStreamandskip()when reading multiple lines from the files in random order.