Scenario:
- We have a scheduler which is using JDBC Job Store. Quartz version is 2.1.2.
- The job which is being scheduling is also updating a database.
- The database is same for both quartz and the job itself and is hosted in MySQL Server. Both application tables and quartz tables are stored in the same database.
-
Connection pool is different for both application and quartz. In the application we are using spring for connection pooling and quartz is forced to use connection pooling via quartz.properties.
Here is the snippet of quartz.propertiesorg.quartz.dataSource.qzDS.driver = com.mysql.jdbc.Driver org.quartz.dataSource.qzDS.URL = jdbc:mysql://localhost:3306/dbname?autoReconnect=true org.quartz.dataSource.qzDS.user = dbuser org.quartz.dataSource.qzDS.password =dbpassword org.quartz.dataSource.qzDS.maxConnections = 30 org.quartz.datasource.qzDS.validationQuery = select 1 #org.quartz.datasource.qzDS.minEvictableIdleTimeMillis=21600000 #org.quartz.datasource.qzDS.timeBetweenEvictionRunsMillis=1800000 #org.quartz.datasource.qzDS.numTestsPerEviction=-1 #org.quartz.datasource.qzDS.testWhileIdle=true org.quartz.datasource.qzDS.debugUnreturnedConnectionStackTraces=true org.quartz.datasource.qzDS.unreturnedConnectionTimeout=120 org.quartz.datasource.qzDS.initialPoolSize=5 org.quartz.datasource.qzDS.minPoolSize=5 org.quartz.datasource.qzDS.maxPoolSize=30 org.quartz.datasource.qzDS.acquireIncrement=5 org.quartz.datasource.qzDS.maxIdleTime=120 org.quartz.datasource.qzDS.validateOnCheckout=true -
Database is clustered with MASTER-MASTER replication on two servers and they are being used via virtual IP everywhere in the application and quartz.
- Scheduler i.e. quartz is also clustered on the same two machines where MySQL is clustered.
The problem:
One of the servers (till now we have got the problem with backup server machine) is occasionally throwing database connection error while calling notifyJobStoreJobComplete method. This is causing the job to stay in BLOCKED state even if the job itself has successfully completed but quartz was unable to update its status.
Questions:
- What can be the cause of the problem?
- How to move the BLOCKED jobs into WAITING state so that the jobs can be run on their next scheduled time at least. Direct editing the QRTZ_SIMPLE_TRIGGERS tables would not be a good solution, even if it works.
EDIT: To bump up the question.
I think main problem was communication link failure by MySQL which we solved it by increasing ‘wait_timeout’ to 14 days and as our maintenance is scheduled in every 15 days, we restart the each of MySQL server is our DB cluster (We have Master-Master replication in place). With approach we haven’t get any communication link failure after that. In fact some time we don’t restart the server in every 15 days but still no error(touch wood). 🙂
And as far as Quartz triggers being locked in BLOCKED state, we updated the quartz to 2.1.4 which possibly has the fix for the almost same problem. After the quartz update, we have faced the triggers being in BLOCKED state very very less frequent.
We are still unable to find out how to get the trigger out of BLOCKED state without directly modifying the quartz tables. Whenever we face this problem, we manually remove the entry for BLOCKED trigger from the qrtz_fired_triggers table and it solves the problem. I think enterprise version of quartz may have this feature from some web UI.