Monitoring reconciliation jobs using the fail-safe mechanism

From Service Pack 1 onward, this feature is available only when you configure a value for the Job Idle Time (Minutes) parameter.

The fail-safe mechanism in the reconciliation engine monitors the progress of all the running jobs (scheduled/continuous/non continuous). When it comes across an unresponsive job, it automatically restarts the job.

For the non-continuous jobs, this feature stops the current run and starts a new run. For continuous jobs, it stops the current run and lets the next run start when the continuous interval of that job has elapsed.

For example, if the job idle time is configured to 500 minutes; the fail-safe mechanism monitors all the jobs after every 500 minutes. If the fail-safe mechanism detects a job that is not responding (not processing a CI) for more than 500 minutes, it restarts the unresponsive job.

This feature logs all the traces in the arrecond.log file. Locate the log file if you have configured it to reside in a particular directory. Usually, this log file resides in the installation directory. For example, \Program Files\BMC Software\AtriumCore\Logs.

You should consider the following when working with the fail-safe mechanism:

  • The time interval for which a job can remain idle is configurable.
  • The configured time interval is in minutes.
  • The default job idle time is 0 minutes.

To configure the above settings, use the configuration options provided in Server Configuration.

