Now i am trying to export data from a db table, and write it into hdfs.
And the problem is: will the name node become bottleneck? and how is the machanism, will name node cache a slice(64MB) and then give it to data node?
And is there any better way rather than write the hdfs? because i think it dosen’t take the advantage of parellism.
Thanks:)
Have you considered using Sqoop. Sqoop can be used to extract data from any DB with supports JDBC and put it in HDFS.
http://www.cloudera.com/blog/2009/06/introducing-sqoop/
Sqoop import command takes the number of map jobs to be run (it defaults to 1). Also, while parallelizing the work (map tasks > 1) the splitting column can be specified or Sqoop will make a guess based on the sequence key for the table. Each map file will create a separate file for the results in a directory. The NN will not be a bottleneck unless a huge number of files created is huge (the NN keeps the meta data about the files in the memory).
Sqoop can also interpret the source DB (Oracle, MySQL or others) and use the DB specific tools like mysqldump and import instead of the JDBC channel for better performance.