Apache Tomcat recovery use case

At Calbro Systems, a customer-facing documentation website was running on an Apache Tomcat instance connected to a Postgresql database. Customers began calling the Calbro Systems NOC service desk complaining that the documentation site's performance was poor, and that it was sometimes completely inaccessible.

Initial meeting

At the initial meeting with management, several diagrams and calculations were sketched out on paper.

Before process automation and orchestration

The meeting with management resulted in a sketch of the current process.

Before process automation and orchestration

After process automation and orchestration

An automated and orchestrated process was sketched on paper:

After process automation and orchestration

Problem analysis

An analysis of the problem is detailed in the table below.

Metric

Value

Incident frequency

~15/month

Time to resolve

~20 minutes

1 full-time employee

~$100K/year

252 workdays/year

480 minutes/8 hour workday

$100K = $397/workday
  252 workdays

$397 = $0.82/minutes
 480 minutes

15 incidents
20 minutes/incident
300 minutes/month

300 minutes
$0.82/minute
Incident costs $246/month

Subsequent meetings

The next meetings, with the technical team, identified the current business process and costs to resolve the issue.

Process

  1. The service desk personnel spent time on the phone with the customer and then created an incident, which remained in the queue until a technician was assigned to work on it.
  2. An IT technician checked the logs for out-of-memory messages or other events.
  3. The network team determined that the issue was not network related, and the database team reported that the database was functioning properly.
  4. The issue was escalated to the application owner, who checked the disk space, CPU usage, and memory. Although all were within normal ranges, the Tomcat Java Virtual Machine (JVM) memory usage was high.
  5. Further investigation showed that the Tomcat's log had out-of-memory exceptions, specifically connection pool exhausted. The application had an issue that exhausted its database connection pool.
  6. After taking a thread and heap dump, the application owner restarted Tomcat.
  7. Service desk personnel checked the documentation site. After it appeared to be functioning as expected, they closed the incident ticket.

The total time to resolve the issue was 45 minutes, and involved 15 IT members.

Subsequent incidents led the team to verify the symptoms and restart the application, which lessened the time to resolution to 15 minutes.

Costs

Over the span of six months, this issue recurred an average of 11.54 times every month. Each incident averaged 15.35 minutes of downtime, totaling 177 minutes of monthly downtime. The cost per incident was $5,312.50, totaling $61,306.25 in lost productivity every month.

How BMC Atrium Orchestrator can help

A monitoring solution can be used to automatically detect the issue and notify BMC Atrium Orchestrator to execute the proper workflow to resolve the issue. The same workflow could be used independently from the monitoring solution.

The workflow creates the incident ticket, restarts the application, and then updates the incident ticket.

Develop formal drawings

Translate the hand-drawings into more formal drawings, showing all involved systems and workflow activities.

Sample formal drawing

Sample_formal_drawing

Automate the workflow

Using BMC Atrium Orchestrator, the team developed a workflow that checks for the problem condition, restarts the Tomcat instance, and continuously updates the incident ticket.