If it possible to provide a service to multiple clients whereby if the server providing this service goes down, another one takes it’s place- without some sort of centralised “control” which detects whether the main server has gone down and to redirect the clients to the new server?
Is it possible to do without having a centralised interface/gateway?
In other words, its a bit like asking can you design a node balancer without having a centralised control to direct clients?
This answer is a general overview to high availability for networked applications, not specific to Erlang. I don’t know too much about what is available in the OTP framework yet because I am new to the language.
There are a few different problems here:
Problem 1 – Moving client connection
This may be solved in many different ways and on different layers of the network architecture. The easiest thing is to code it right into the client, so that when a connection is lost it reconnects to another machine.
If you need network transparency you may use some technology to sync TCP states between different machines and then reroute all traffic to the new machine, which may be entirely invisible for the client. This is much harder to do than the first suggestion.
I’m sure there are lots of things to do in-between these two.
Problem 2 – State data
You obviously need to transfer the session state from the crashed machine unto the backup machine. This is really hard to do in a reliable way and you may lose the last few transactions because the crashed machine may not be able to send the last state before the crash. You can use a synchronized call in this way to be really sure about not losing state:
This may potentially be expensive (or at least not responsive enough) in some scenarios since you depend on the backup machine and the connection to it, including latency, before even confirming anything to the client. To make it perform better you can let the client check with the backup machine upon connection what transactions it received and then resend the lost ones, making it the client’s responsibility to queue the work.
Problem 3 – Detecting a crash
This is an interesting problem because a crash is not always well-defined. Did something really crash? Consider a network program that closes the connection between the client and server, but both are still up and connected to the network. Or worse, makes the client disconnect from the server without the server noticing. Here are some questions to think about: