I was looking for a Disk intensive Hadoop application to test the I/O activity in Hadoop but I couldn’t find any such application which kept the Disk utilization above, say 50% or some such application which actually keeps disk busy. I tried randomwriter, but that surprisingly is not disk I/o intensive.
So, I wrote a tiny program to create a file in Mapper and write some text into it. This application works well, but the utilization is high only in the master node which is also name node, job tracker and one of the slaves. The disk utilization is NIL or negligible in the other task trackers. I’m unable to understand why disk I/O is so low in task trackers. Could anyone please nudge me in right direction if I’m doing something wrong? Thanks in advance.
Here is my sample code segment that I wrote in WordCount.java file to create and write UTF string into a file-
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path outFile;
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
outFile = new Path("./dummy"+ context.getTaskAttemptID());
FSDataOutputStream out = fs.create(outFile);
out.writeUTF("helloworld");
out.close();
fs.delete(outFile);
}
OK. I must have been really stupid for not checking before. The actual problem was that all of my data nodes were not really running. I reformatted the namenode and everything fell back into place, I was getting a utilization of 15-20% which is not bad for WC. I will run it for the TestDFSIO and see if I could utilize the Disk even more.