I want to iterate over really many files which are placed in a deep folder hierarchy.
The files in question are 15 GB of MS Word documents I intend to process with POI. POI works fine, but a simple recursive function creates an OutOfMemoryException:
public void checkDir(File dir) {
for (File child : dir.listFiles()) {
if (".".equals(child.getName()) || "..".equals(child.getName()))
continue; // Ignore the self and parent aliases.
if (child.isFile())
processFile(child); // do something
else if (child.isDirectory())
checkDir(child);
}
}
// check if the word file can be read by POI
private void processFile(File file) {
InputStream in = null;
try {
in = new FileInputStream(file);
WordExtractor extractor = null;
try {
extractor = new WordExtractor(in);
extractor.getText();
} catch (Exception e) {
// This can happen if the file has the "doc" extension, but is
// not a Word document
throw new Exception(file + "is not a doc");
} finally {
in.close();
in = null;
}
} catch (Exception e) {
// log the error to a file
FileWriter fw = null;
try {
fw = new FileWriter("corruptFiles.txt", true);
fw.write(file.getAbsolutePath() + "\r\n");
} catch (Exception e2) {
e.printStackTrace();
} finally {
try {
fw.close();
} catch (IOException e3) {
}
}
}
Trying to accomplish this with org.apache.commons.io.FileUtils.iterateFiles yields the same exception:
String[] extensions = { "doc" };
Iterator<File> iter = FileUtils.iterateFiles(dir, extensions, true);
for(; iter.hasNext();)
{
File file = iter.next();
processFile(file); // do something
}
I am running Java 6 on Windows 7 and not allowed to move or rearrange the files.
What are my options?
Thank you for your time.
[EDIT] Added the processFile function. Just did a successful run with a simplified version of processFile after increasing the heap size to 512 MB.
In conclusion my problem is somehow POI related and NOT to iterating files.
private void processFile(File file) {
System.out.println(file);
}
[EDIT2] I could narrow the cause of the exception down to a 33 MB file. Trying to parse that results in the java.lang.OutOfMemoryError: Java heap space exception. I will post a ticket to the POI bug tracker. Thanks everybody for your suggestions.
I’ll accept MathAsmLang’s answer as that helped to overcome the iteration problem.
I would have accepted krishnakumarp’s comment as an answer, as he was the first one to point that out, but that proved to be impossible 😉
Because it is outofmemoryerror, you should try out starting jvm with
different memory parameters i.e. heap size.