I always use “dfs -get” or “dfs -cat”, but I imagine there might be something better. With “dfs -cat | pv”, it appears my network connection isn’t saturating (I’m getting only 20MB/sec). Is there a way to parallelize it, maybe?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
dfs -cathas to shuttle all the data through that single process, with poor parallelism.What I’ve done is run a mapper-only streaming job that dumps to scratch space on each disk and then rsync back to a single machine. Both parts do a good job of exercising the cluster to its full; and since rsync is nicely idempotent you can start it going at the same time as the hdfs->local part.