I am reading Shift-JIS encoded XML file and store it in ByteBuffer, then convert it into a string and try to find start of a string and end of of a string by Pattern & Matcher. From these 2 positions I try to write buffer to a file. It works when there is no multibyte chars. If there is a multibyte char, I miss some text at the end, since value of end is little off
static final Pattern startPattern = Pattern.compile("<\\?xml ");
static final Pattern endPattern = Pattern.compile("</doc>\n");
public static void main(String[] args) throws Exception {
File f = new File("20121114000606JA.xml");
FileInputStream fis = new FileInputStream(f);
FileChannel fci = fis.getChannel();
ByteBuffer data_buffer = ByteBuffer.allocate(65536);
while (true) {
int read = fci.read(data_buffer);
if (read == -1)
break;
}
ByteBuffer cbytes = data_buffer.duplicate();
cbytes.flip();
Charset data_charset = Charset.forName("UTF-8");
String request = data_charset.decode(cbytes).toString();
Matcher start = startPattern.matcher(request);
if (start.find()) {
Matcher end = endPattern.matcher(request);
if (end.find()) {
int i0 = start.start();
int i1 = end.end();
String str = request.substring(i0, i1);
String filename = "test.xml";
FileChannel fc = new FileOutputStream(new File(filename), false).getChannel();
data_buffer.position(i0);
data_buffer.limit(i1 - i0);
long offset = fc.position();
long sz = fc.write(data_buffer);
fc.close();
}
}
System.out.println("OK");
}
Using the String indices i0 and i1 for byte positions in:
is erroneous. As UTF-8 does not give a unique encoding,
ĉbeing written as two charactersc+ combining diacritical mark^, back and forth translation between chars and bytes is not only expensive but error prone (in rand cases of specific data).Or use a CharBuffer, which implements a CharSequence.
Instead of writing to the FileChannel fc:
A CharBuffer version would need more rewriting, also touching the pattern matching.