I always use dfs -get or dfs -cat, but I imagine there might be

Question

0

Asked: May 19, 20262026-05-19T22:11:32+00:00 2026-05-19T22:11:32+00:00

I always use dfs -get or dfs -cat, but I imagine there might be

0

I always use “dfs -get” or “dfs -cat”, but I imagine there might be something better. With “dfs -cat | pv”, it appears my network connection isn’t saturating (I’m getting only 20MB/sec). Is there a way to parallelize it, maybe?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T22:11:32+00:00

Editorial Team

2026-05-19T22:11:32+00:00Added an answer on May 19, 2026 at 10:11 pm

dfs -cat has to shuttle all the data through that single process, with poor parallelism.

What I’ve done is run a mapper-only streaming job that dumps to scratch space on each disk and then rsync back to a single machine. Both parts do a good job of exercising the cluster to its full; and since rsync is nicely idempotent you can start it going at the same time as the hdfs->local part.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I always use dfs -get or dfs -cat, but I imagine there might be

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply