I have a distributed app where resources get locked for exclusive use by tasks.

Question

0

Asked: May 20, 20262026-05-20T18:19:54+00:00 2026-05-20T18:19:54+00:00

I have a distributed app where resources get locked for exclusive use by tasks.

0

I have a distributed app where resources get locked for exclusive use by tasks. Each task runs in its own process. I’d like to automatically unlock resources if a task process exits or the server it’s running on dies (eg power failure).

How could I remotely detect such a process exit/failure within a few seconds?

After some Googling I came up with a few ideas, but I don’t have direct experience with any of them…

Use advisory lock functions built into mySQL (get_lock) or postgres (pg_advisory_lock). These would automatically release the locks if the database connection closed, which would happen on a process exit or server crash.
Use a dedicated distributed lock manager, like ZooKeeper. This would work, but it seems like more than I need.
Make a TCP connection from the task process to a remote monitoring process with the TCP/socket keepalive option enabled. This seems doable, but I’d rather build on something that takes care of the low-level network details for me.

Another thought was to split the problem up. Since server crashes are fairly uncommon, I could use a local watchdog process to monitor for process exits and then use some thing else to monitor for server crashes.

Thanks for any feedback!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T18:19:55+00:00

You may want to read on “The ϕ Accrual Failure Detectors”. I found it is the most generic and theoretically sound approach to failure detectors. It is never a question of “detecting failures within seconds” but always a trade-of between how fast and how reliable is your failure detection. By knowing how to collect and process statistics from failures that were correctly or mistakenly detected in the past you can estimate probability of failure as function of time you were waiting for response from remote server.

TCP keep-alive is useless here – its “ping” is too coarse, like 2 hours by default.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a distributed app where resources get locked for exclusive use by tasks.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply