I am about to embark on writing a system that needs to re-balance it’s load distribution amongst the remaining nodes once one of more of the nodes involved fail. Anyone have any good references on what to avoid and what works?
In particular I’m curious how one should start in order to build such a system to to be able to unit-test it.
This question smells like my distributed systems class. So I feel I should point out the textbook we used.
It covers many aspects of distributed systems at an abstract level, so a lot of its content would apply to what you’re going to do.
It does a pretty good job of pointing out pitfalls and common mistakes, as well as giving possible solutions.
The first edition is available for free download from the authors.
The book doesn’t really cover unit-testing of distributed systems though. I could see entire book written on just that.