BMC AMI Ops Insight


BMC AMI Ops Insight is a forward-looking tool that helps you detect anomalies in your environment. It ingests your historical data and uses machine learning to understand normal behavior for your systems. It then uses multivariate analysis to detect anomalies in real-time data. This approach minimizes detection time, maximizes lead-time to remediation, and reduces false positives. Built-in domain knowledge and data science expertise lowers the learning curve for your employees by tracking carefully selected key metrics and connecting the dots so you don't have to spend time configuring the solution and navigating through trial and error.

Getting Started

Through this test drive, you will see how easy it is to identify a Db2 performance degradation before a traditional monitoring product can detect it and so we can act on it before it impacts the business.
In this case an application change was made to a Db2 subsystem. BMC AMI Ops Insight immediately provides notification of an event that needs be investigated shortly after the application has been introduced into production, this allows the user to identify and minimize or eliminate this disruption completely.

BMC AMI Ops Insight

To get started in AMI OI, do the following:

Do This

Choose "Web Browser" as Application

Do This

Click the Set as default button on top, and the Advanced one at the bottom

Do This

Click the 'Proceed to dtw-cwcc-td.nasa.cpwr.corp (unsafe)' link

Do This

Enter your assigned Test Drive ID (i.e. cwezxxx) and your assigned password and click Sign In

Upon successfully logging in, you will be taken to the Active Events screen, with a list of active events that need to be investigated.

This environment is made up of three Db2 subsystems (DMS1, DMS2 and DMS3) included in a Sharing Group (DSNDMS).

An application change was put into production on the DMS2 Db2 subsystem on a Sunday (March 28th, 2021). As the change caused some issues in the DMS2 and DMS3 subsystems, as well as in the overall Sharing Group, you will see how quickly we can detect the problem using AMI OI.

So let us investigate.

Do This

Click on the Investigate link for DMS2, as we know that this is where the bad application was introduced, to see what triggered this event. 

Do This

Warning: if you see this message, you can ignore it. It will dissapear automatically.

It simply informs that this Test environment is not connected direclty to a Tomcat server. 

The Probable Cause Analysis view displays the current most probable sources or classifications (suspected source KPI (Key Performance Indicator) Group of an anomaly) of the anomaly based on the current data.

Above the graphical representation is a timeline of the event. You can select a specific point on the timeline to see the state of things at a specific time. All other areas of the tab are affected by the timeline.

Use the arrows (< or >) or drag the slider (blue circle) to jump backward or forward on the timeline, or click Event Start or Latest Ingestion to go to the beginning of the event or the latest data.

Do This

Set up 9:50 as the analysis time

You will see that Local Contention on DMS2 is driving Global Contention on the Sharing Group, and this Global Contention is causing an increase in the IRLM (Db2 Internal Resource Lock Manager) CPU time.

As Global Contention is increasing, a lot of work is being delayed, so CICS starts more Db2 threads, which leads to more real and auxiliary storage usage.

The KPI Groups that are identified as the classifications of the anomaly, Workload and Contention, are listed on the left.

Do This

Click on the View Detail Analysis button for the Contention event on the left

You will be taken to the detailed analysis and the root cause of the problem. The details are divided into two sections:

  1. Overview: there is a page in contention (Db2 lock) caused by a batch job ($VSLKR10) in the DMS3 Db2 subsystem (QAC 5 LPAR)
  2. Analysis: This section displays the connections between work units involved in the Event and how they are affecting each other. It provides the right view and context for further investigation on the AMI Ops Monitors (Not covered in this scenario). The small question marks provide information about the data presented in the Overall and Analysis sections.

Do This

Close the Detailed Analysis view by clicking on the X on the upper right side

Do This

Click on the Event Progression tab to the right of Probable Cause Analysis

On the Event Progression tab, you can see details of the event over time.

The event log, on the left, shows a sequence of KPI Groups going into anomaly state and Exceptions occurring. You will see that Db2 has issued the same exception, QTXATIM (Timeouts), repeatedly since 9:42 because of the Global Contention. Db2 exceptions are actively monitored by AMI OI.
The timeline, on the right, shows a graphic display of the performance of the KPI Groups and Exceptions.

A green up-arrow, after the KPI Group name, indicates things are getting better for that KPI. A red down-arrow indicates things are getting worse for that KPI. A dash indicates things are holding steady for that KPI.

Finally, the numeric values on the right are Z-Score values, which provide a distance measure of how far a raw value is from its modeled mean in terms of standard deviation units. A low or high Z-Score value indicates either getting closer to its normal value or further away from it.

Do This

You can select a specific point on the timeline to see the state of things at that time. The timelines are connected and clicking a time on the left graph will adjust the time on the graph on the right.

Do This

The timelines are connected and clicking a time on the left graph will adjust the time on the graph on the right.

Do This

Click on the Switch to Graph View link on the upper right side

Graph view displays individual graphs per KPI Group for all KPI Groups at the selected time, with a breakdown to the individual KPIs.

Do This

To see KPI values on a graph for a specific time, mouse-over the graph

You can also view a larger version of a graph by mouse-over the graph title like "Thread-Counts" clicking the down arrow when it appears and then selecting "View"

Do This

To exit the large graph click the left arrow on the left side of the graph

Do This

Scroll down on the right to see more graphics

You will see all the KPIs that are increasing or decreasing significantly after 9:42: Thread Counts, Local Contention, Page Plock Activity, Storage, etc.

Let's look at how Db2 Subsystem Events can cause Sharing Group Events.

Do This

Click on the 'View Event Correlations-Go to Timeline' button on the left at the bottom

When you get to the Timeline view you can see how Events occur in each of the Db2 subsystems and how issues in those subsystems affected the Db2 Sharing Group.

Review the red Events on Db2 subsystems DMS2 and DMS3. Note the red dots within the Event. 

Do This

Mouse-over the red dots to see the Exceptions that also occurred during this Event time interval

Click on the red Event box for DMS2

This action will take you to the timeline specifically for DMS2 and allows you to drill down into each Category to see the associated KPI Groups and their sparc line graphs. A row for Exceptions is positioned above all of the Categories.

Do This

Click on the small down arrow next to the Contention Category to expand and see the KPI Groups

Do This

Notice how the high Global Contention graph aligns with the red Event box near the top of the view

So, in summary, you will get two fundamental benefits from AMI OI:

  1. Lead time to resolution: AMI OI is shifting left any issue that may arise in the near future by showing deviations from normal, so you can start investigating immediately before it really happens.
  2. Narrow the hotspot for you: AMI OI will tell you what LPAR, Db2 subsystem, KPI Group are experiencing deviations from normal, and what AMI Ops monitor view you can use to hyperlink and to start the investigation.

Next Steps

Choose your next road trip !

Was this page helpful? Yes No Submitting... Thank you