Lets say I have 100 servers each running a daemon – lets call it server – that server is responsible for spawning a thread for each user of this particular service (lets say 1000 threads per server). Every N seconds each thread does something and gets information for that particular user (this request/response model cannot be changed). The problem I a have is sometimes a thread hangs and stops doing something. I need some way to know that users data is stale, and needs to be refreshed.
The only idea I have is every 5N seconds have the thread update a MySQL record associated with that user (a last_scanned column in the users table), and another process that checks that table every 15N seconds, if the last_scanned column is not current, restart the thread.
The general way to handle this is to have the threads report their status back to the server daemon. If you haven’t seen a status update within the last 5N seconds, then you kill the thread and start another.
You can keep track of the current active threads that you’ve spun up in a list, then just loop through them occasionally to determine state.
You of course should also fix the errors in your program that are causing threads to exit prematurely.
Premature exits and killing a thread could also leave your program in an unexpected, non-atomic state. You should probably also have the server daemon run a cleanup process that makes sure any items in your queue, or whatever you’re using to determine the workload, get reset after a certain period of inactivity.