We want to guarantee consumer process reads the data created by producer once the producer is finished writing to the file in HDFS. Following is one approach used in an application, that we are trying to improve.
Producer:
private void produce(String file, int sleepSeconds) throws Exception {
Configuration conf = new Configuration();
conf.addResource(new Path(
"C:\\dev\\software\\hadoop-0.22.0-src\\conf\\core-site.xml"));
conf.set("fs.defaultFS", "hdfs://XXX:9000");
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path(file);
if (fileSystem.exists(path)) {
fileSystem.delete(path, false);
}
System.out.println("Creating file");
FSDataOutputStream out = fileSystem.create(path);
System.out.println("Writing data");
out.writeUTF("--data--");
System.out.println("Sleeping");
Thread.sleep(sleepSeconds * 1000L);
System.out.println("Writing data");
out.writeUTF("--data--");
System.out.println("Flushing");
out.flush();
out.close();
fileSystem.close();
System.out.println("Releasing lock on file");
}
Consumer:
private void consume(String file) throws Exception {
Configuration conf = new Configuration();
conf.addResource(new Path(
"C:\\dev\\software\\hadoop-0.22.0-src\\conf\\core-site.xml"));
conf.set("fs.defaultFS", "hdfs://XXX:9000");
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path(file);
if (fileSystem.exists(path)) {
System.out.println("File exists");
} else {
System.out.println("File doesn't exist");
return;
}
FSDataOutputStream fsOut = null;
while (fsOut == null) {
try {
fsOut = fileSystem.append(path);
} catch (IOException e) {
Thread.sleep(1000);
}
}
FSDataInputStream in = fileSystem.open(path);
OutputStream out = new BufferedOutputStream(System.out);
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
in.close();
out.close();
if (fsOut != null)
fsOut.close();
fileSystem.close();
System.out.println("Releasing lock on file");
}
The requirements for how the processes should be run are as follows:
-
Producer process (not thread) is started. The thread.sleep simulates a bunch of database calls and business logic
-
Consumer process (not thread) is started in a different machine which blocks till producer releases its lock. While consumer reads, no other process should modify the data file
Any advice on how do we go about improving this code/design at the same time guaranteeing that reader is not missing data, using the HDFS java API?
One solution is to write to a file with a temporary suffix / prefix, and rename the file once the writing is complete:
For example output to the file file1.txt:
.file1.txtorfile1.txt.tmpfile1.txt.tmptofile1.txtfile1.txtto become available