I wonder if it is possible to install a “background” hadoop cluster. I mean, after all it is meant to be able to deal with nodes being unavailable or slow sometimes.
So assuming some university has a computer lab. Say, 100 boxes, all with upscale desktop hardware, gigabit etherner, probably even identical software installation. Linux is really popular here, too.
However, these 100 boxes are of course meant to be desktop systems for students. There are times where the lab will be full, but also times where the lab will be empty. User data is mostly stored on a central storage – say NFS – so the local disks are not used a lot.
Sounds like a good idea to me to use the systems as Hadoop cluster in their idle time. The simplest setup would be of course to have a cron job start the cluster at night, and shut down in the morning. However, also during the day many computers will be unused.
However, how would Hadoop react to e.g. nodes being shut down when any user logs in? Is it possible to easily “pause” (preempt!) a node in hadoop, and moving it to swap when needed? Ideally, we would give Hadoop a chance to move away the computation before suspending the task (also to free up memory). How would one do such a setup? Is there a way to signal Hadoop that a node will be suspended?
As far as I can tell, datanodes should not be stopped, and maybe replication needs to be increased to have more than 3 copies. With YARN there might also be a problem that by moving the task tracker to an arbitrary node, it may be the one that gets suspended at some point. But maybe it can be controlled that there is a small set of nodes that is always on, and that will run the task trackers.
Is it appropriate to just stop the tasktracker or send a SIGSTOP (then resume with SIGCONT)? The first would probably give hadoop the chance to react, the second would continue faster when the user logs out soon (as the job can then continue). How about YARN?
First of all, hadoop doesn’t support ‘preempt’, how you described it.
Hadoop simply restarts task, if it detects, that task tracker dead.
So in you case, when user logins into host, some script simply kills
tasktracker, and jobtracker will mark all mappers/reducers, which were run
on killed tasktracker, as FAILED. After that this tasks will be rescheduled
on different nodes.
Of course such scenario is not free. By design, mappers and reducers
keep all intermediate data on local hosts. Moreover, reducers fetch mappers
data directly from tasktrackers, where mappers was executed. So, when
tasktracker will be killed, all those data will be lost. And in case
of mappers, it is not a big problem, mapper usually works on relatively
small amount of data (gigabytes?), but reducer will suffer greater.
Reducer runs shuffle, which is costly in terms of network bandwidth and
cpu. If tasktracker runs some reducer, restart of this reducer means,
that all data should be redownloaded once more onto new host.
And I recall, that jobtracker doesn’t see immediately, that
tasktracker is dead. So, killed tasks shouldn’t restart immediately.
If you workload is light, datanodes can live forever, don’t put them offline,
when user login. Datanode eats small amount of memory (256M should be enough
in case small amount of data) and if you workload is light, don’t eat much
of cpu and disk io.
As conclusion, you can setup such configuration, but don’t rely on
good and predictable job execution on moderated workloads.