I have a PostgreSQL 9.1.3 streaming replication setup on Ubuntu 10.04.2 LTS (primary and standby). Replication is initialized with a streamed base backup (pg_basebackup). The restore_command script tries to fetch the required WAL archives from a remote archive location with rsync.
Everything works like described in the documentation when the restore_command script fails with an exit code <> 255:
At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.
But when the restore_command script fails with exit code 255 (because the exit code from a failed rsync call is returned by the script) the server process dies with the following error:
2012-05-09 23:21:30 CEST - @ LOG: database system was interrupted; last known up at 2012-05-09 23:21:25 CEST
2012-05-09 23:21:30 CEST - @ LOG: entering standby mode
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7]
2012-05-09 23:21:30 CEST - @ FATAL: could not restore file "00000001000000000000003D" from archive: return code 65280
2012-05-09 23:21:30 CEST - @ LOG: startup process (PID 8184) exited with exit code 1
2012-05-09 23:21:30 CEST - @ LOG: aborting startup due to startup process failure
So my question is now: Is this a bug or is there a special meaning of exit code 255 which is missing in the otherwise excellent documentation or am I missing something else here?
On the primary server, you have
WALfiles sitting in thepg_xlog/directory. WhileWALfiles are there, PostgreSQL is able to deliver them to the standby should they be requested.Typically, you also have local archived
WALlocation, when files are moved there by PostgreSQL, they no longer can be delivered to the standby on-line and standby is expecting them to come from the archivedWALlocation viarestore_command.If you have different locations for archived
WALs setup on primary and on standby servers, then there’s no way for a while to reach standby and you have a gap.In your case this might mean, that:
00000001000000000000003Dhad been archived by the primary PostgreSQL;restore_commanddoesn’t see it from the configured source location.You might consider manually copying missing WAL files from primary to the standby using
scporrsync. It is also might be necessary to review yourWALlocations and make sure both servers look in the same direction.EDIT:
grep-ing forrestore_commandin sources, onlyaccess/transam/xlog.creferences it. In functionRestoreArchivedFilealmost at the end (round line 3115 for 9.1.3 sources), there’s a check whetherrestore_commandhad exited normally or had it received a signal.In first case, message is classified as
DEBUG2. In caserestore_commandreceived a signal other thenSIGTERM(and wasn’t able to handle it properly I guess), aFATALerror will be reported. This is true for all codes greater then 125.I will not be able to tell you why though.
I recommend asking on the hackers list.