I am trying on reading a file in parallel from FTP using map-reduce. I have got a code working which reads a file and performs word count on it . However it fails when the input size is large (over 2 MB to be specific) .
It stops with a Spill 0 completed message , then a Map 100% Reduce 0% . and then a connection closed by server .
I don’t quite get it . What does Spill 0 mean ? Why does the code fail for large inputs? How can I split the input and provide it to mapper ? will that help ?
Can i extend FileInputFormat class to do work this out ?
Thanks 🙂
I am trying on reading a file in parallel from FTP using map-reduce. I
Share
Yes, you can implement your on
InputFormat. Apart fromFileInputFormatthere are several others in Hadoop such asTextInputFormat,KeyValueInputFormat, etc. You can also define how a record is read from a split. For that you need to implement your ownRecordReader.http://developer.yahoo.com/hadoop/tutorial/module4.html
For instance, the default
InputFormatis theTextInputFormatthat reads a file and uses aLineRecordReaderto get records line by line. If you are reading structured data from a file you could implement your ownRecordReaderso each record is a structure of data from that file.In any case, doing a MapReduce job for reading a file from FTP is really strange. Hadoop works because data is stored on Hadoop’s File System (HDFS) which is a distributed filesystem where each file is divided in chunks and spread across all the nodes of the filesystem. The way you should approach IMHO is to download that file to your HDFS and the execute your MapReduce job.