Failure recovery and pipeline resilience

You must ensure that CI jobs and downstream deployments can tolerate intermittent failures, fail fast on unrecoverable conditions, keep useful diagnostics (logs), and automatically rollback or regress when required by quality gates or deploy faults.

Retry logic for network and API failures

Network flakiness (such as HCI timeouts, CES REST hiccups, and transient Db2 interruptions) is common in mainframe integrations. A robust pipeline treats transient errors as retriable, but avoids infinite retries for real failures.

Follow these implementation principles:

Safe-to-retry operations: Retry only those operations that are safe to repeat (for example, GET set information, download, polling). Do not blindly retry operations that create side effects unless you know they are safe to retry, or you can detect duplicates.
Exponential backoff and jitter: Apply progressively longer wait times between retries, and add a small random delay to each interval to prevent multiple jobs from retrying at the same time (for example, three attempts at ~5s, ~15s, ~30s and a random 0–5s).
Retry limits and error categorization:
- Transient: For network timeouts, 5xx HTTP, connection refused - retry.
- Permanent: For auth failures (401/403), bad request (4xx except 429), invalid payload fail and alert immediately.
Logging and metrics: Log attempt number, error code, and raw response to pipeline logs, which are retained for troubleshooting.

Jenkins (Groovy) example

def callWithRetries(maxRetries=3) {

  int attempts = 0

  while (attempts < maxRetries) {

    try {

      attempts++

      // example: call ISPW REST or plugin operation

      def result = ispwOperation(connectionId: 'CES_Conn', ...)

      return result

    } catch (err) {

      if (attempts >= maxRetries || !isTransient(err)) {

        throw err

      }

      sleep time: (5 * attempts) + new Random().nextInt(5)

    }

  }

}

Shell wrapper (CLI invocation)

for i in 1 2 3; do

  SCMDownloaderCLI --cesUrl "$CES_URL" --cesToken "$CES_TOKEN" ... && break

  echo "Download attempt $i failed; sleeping..."

  sleep $(( i * 10 + RANDOM % 5 ))

done

Important

mark isTransient(err) by using HTTP status (>=500 or 429) or connection exceptions. Use operation-specific knowledge For example, if Generate returns an error code that indicates a transient host issue, retry.

Timeouts for CLI and plug-in tasks

Avoid pipeline runs that have stopped responding (blocked JCL submission, stuck downloads) by setting deterministic timeouts that match your operational SLAs.

Where to configure

CLI tools: Many CLIs include a -timeout or --timeout argument (for example, -timeout "90"). Use a non-zero value for network operations. A zero value often means “wait indefinitely.”
TotalTest CLI: This has -wait and -maxwait parameters. The default maxwait is 20 minutes; adjust for long test runs. Use -wait=false if you submit and poll asynchronously.
Jenkins plug-in or Azure tasks: Set task-level timeouts in the job definition (pipeline DSL or YAML) and fail fast if a step exceeds expected duration.

We recommend that following defaults (tune to environment):

SCM download or resolve copybooks: 2–5 minutes per container (depending on size).
Generate or Build set: Use webhook callback as the standard approach (avoid blocking). Typical build completion time is 10–30 minutes.
TotalTest runs: Per suite timeouts configured via -maxwait; overall pipeline stage timeout = sum of expected tests + safety margin.
REST calls: client HTTP timeout 30s–60s; long polling operations use larger timeouts but prefer callback webhooks (where supported).

For example, # SCMDownloader in CLI with timeout 90 seconds is as follows

SCMDownloaderCLI -host $HOST -port $PORT -timeout "90" -targetFolder ./sources ...

Rollback and automatic regress on failures

If a quality gate (Sonar) or deployment fails, revert the target environment to the previously known good state automatically (where possible), or at least mark and notify for manual rollback.

Examples of supported Code Pipeline operations are as follows:

Regress or RegressAssignment - Revert an assignment to the previous version.
FallbackAssignment or FallbackRelease - Fallback operation templates are provided in CI templates for graceful fallback.

How to implement?

Quality gate failure:
- After SonarScanner runs, query Sonar quality gate (via Sonar Web API). If the status != OK:
  - Invoke RegressAssignment (or RegressRelease) via plugin/task/REST with ASSIGNMENT_ID/SET_ID.
  - Fail the CI job and send notifications.
Deployment failure:
- If DeployAssignment returns a failure or deployment health checks fail:
  - Invoke FallbackAssignment (or FallbackRelease) immediately with the same container id and parameters to revert.
  - If fallback fails automatically, mark incident and keep logs for support escalation.
When to NOT auto-rollback:
- If rollback requires database schema changes, or if rollback may cause data loss, prefer manual review and controlled rollback windows.

Example (Jenkins snippet)

try {

  runSonarScanner()

  def gate = checkSonarGate() // returns 'OK' or 'ERROR'

  if (gate != 'OK') {

    ispwOperation(connectionId: 'CES_Conn', credentialsId: 'cesToken', ispwAction: 'RegressAssignment', ispwRequestBody: "assignmentId=${params.ISPW_Assignment}")

    error "Quality gate failed — assignment regressed."

  }

  deployAssignment()

} catch (e) {

  // On deploy error: attempt fallback then fail

  ispwOperation(connectionId: 'CES_Conn', credentialsId: 'cesToken', ispwAction: 'FallbackAssignment', ispwRequestBody: "assignmentId=${params.ISPW_Assignment}")

  throw e

}

Predefined templates and runner jobs include variables for ASSIGNMENT_ID, CES_TOKEN, LEVEL, and so on, to support these operations. Use the marketplace or runner templates (where available).

Log retention

Troubleshooting mainframe CI/CD requires STC output, JCL spools, and test/coverage artifacts. BMC Support and internal ops rely on these artifacts.

We recommend that you retain the following artifacts:

STC spool output for Code Pipeline engine: retain for several days (minimum) or archive longer in a spool management tool. ADCP best practice says STC output should be kept on the spool for a few days and ideally permanently in a spool management tool — BMC Support requires STC output for issue resolution.
JCL execution logs (runner JCL), exit codes, and submitted JCL outputs.
Total Test: CLI creates Output (execution logs) and Suites folders; collect and archive those. Total Test CLI produces logs in logs directory inside the workspace; ensure those are copied to artifact storage.
Coverage reports and Sonar artifacts: store XMLs produced by Xpediter/Total Test and Sonar scan output (or preserve CI job artifacts that produced them).

Storage and retention policy (recommended)

Transient artifacts (console logs, CI temporary files): keep 7–14 days.
STC / JCL spools and coverage results: Keep 30–90 days based on audit needs; archive older runs to long-term storage. ADCP recommends STC output retention at least a few days and ideally permanent archive.
Critical production regression evidence: Keep indefinitely (or per company policy).

How to implement?

CI job artifacts: Publish JUnit, coverage XML, and zipped logs to artifact storage (Azure Artifacts) and attach to pipeline run UI.
Mainframe spool: Configure mainframe spool management to ship STC outputs to a searchable archive (or to a file server) automatically. ADCP recommends spool retention for BMC Support.
Automated snapshot: After every deploy or regress, take a snapshot the set ID and archive the relevant JCL or runner logs for the transaction.

Notification integration

Deliver timely, actionable alerts to developers and operations when pipelines fail, when regressions are applied, or when manual approvals are required.

We recommend the following notifications:

Failure alerts: Pipeline stage name, error code, short stack trace, link to archived logs, suggested next step (for example, Run ispwOperation GetSetInfo setId=...).
Regression alerts: Include assignment or set ID, reason (Sonar fail or deploy fail), and link to STC or JCL logs.
Deployment success: Include release ID and components deployed.
Manual approvals: Notify approvers with a link to the approval UI (Jenkins, ADO, or GitHub).

Follow these implementation patterns:

Native CI notifications: Use Jenkins email-ext, Azure DevOps Notifications, or GitHub Actions workflow dispatch and third-party actions to push to Slack or Teams.
Webhook and CES notifications: CES and some DevX components support email configuration (for example, measurement jobs include an EMAIL param to send completion notices). Set SMTP in CES Web Server Settings for automated mail from CES actions.
Structured payloads: Send JSON attachments with setId, assignmentId, ISPW_Changed_Programs_File link, and artifact storage links. Use adaptive cards for Teams or formatted Slack blocks.

Example: Jenkins and Slack (post-build)

post {

  failure {

    slackSend channel: '#devops', color: 'danger', message: "Build failed: ${env.BUILD_URL}, assignment ${params.ISPW_Assignment}"

    emailext body: "See ${env.BUILD_URL}\nLogs: artifact link", subject: "PIPELINE FAIL: ${JOB_NAME}"

  }

  success {

    slackSend channel: '#deploys', color: 'good', message: "Deploy succeeded: ${env.BUILD_URL}"

  }

}

Always include a stable link to retained artifacts so that responders can fetch STC or JCL logs quickly.

Observability and operational playbooks

Observability

To facilitate observability, expose the following metrics and feed them to native DevOps dashboards: Pipeline duration, retries count, regress invoked, and deploy failure rate.
Store structured JSON logs for API calls: Timestamp, operation, status code, request ID, and correlation ID.

Runbooks or playbooks must be available to on-call

When a Sonar quality gate fails, follow these steps
1. Pipeline automatically triggers RegressAssignment (documented and executed).
2. Alert developer and opeartion channel with assignment ID and links to STC logs.
3. Developer triages source, fixes, and re-promotes.
When a deploy fails, follow these steps:
1. Pipeline calls FallbackAssignment (or FallbackRelease) to revert.
2. If fallback fails, escalate to on-call with priority and attach STC/JCL/coverage artifacts.
When CLI hangs on JCL submit, follow these steps:
1. Cancel the running JCL if possible (use cancelling templates like .cancel_deployment or cancel assignment templates).
2. Inspect STC spool and re-run with timeout adjusted.

Practical checklist — implement these items

Set per-step timeouts(CLI -timeout, TotalTest -maxwait, task-level timeouts in CI).
Implement retry wrapper around network/API calls (exponential backoff + jitter).
Make RegressAssignment / FallbackAssignment calls part of pipeline failure handlers; validate parameters.
Archive STC spool output and JCL logs for at least 7–30 days, longer for regulated environments.
Configure Total Test CLI logging (--loglevel, workspace logs) and copy Output/logs to artifact store.
Wire notifications to Slack/Teams/email with links to retained artifacts and a recommended follow-up action.
Document runbooks for Sonar fail, deploy fail, and CLI hangs (include commands to invoke regress/fallback and to retrieve STC output).