BMC AMI Ops Insight
BMC AMI Ops Insight is a forward-looking tool that helps you detect anomalies in your environment. It ingests your historical data and uses machine learning to understand normal behavior for your systems. It then uses multivariate analysis to detect anomalies in real-time data. This approach minimizes detection time, maximizes lead-time to remediation, and reduces false positives. Built-in domain knowledge and data science expertise lowers the learning curve for your employees by tracking carefully selected key metrics and connecting the dots so you don't have to spend time configuring the solution and navigating through trial and error.
Through this test drive, you will see how easy it is to identify a Db2 performance degradation before a traditional monitoring product can detect it and so we can act on it before it impacts the business.
In this case an application change was made to a Db2 subsystem. BMC AMI Ops Insight immediately provides notification of an event that needs be investigated shortly after the application has been introduced into production, this allows the user to identify and minimize or eliminate this disruption completely.
BMC AMI Ops Insight
To get started in AMI OI, do the following:
Choose "Web Browser" as Application
Click the Set as default button on top, and the Advanced one at the bottom
Click the 'Proceed to dtw-cwcc-td.nasa.cpwr.corp (unsafe)' link
Enter your assigned Test Drive ID (i.e. cwezxxx) and your assigned password and click Sign In
Upon successfully logging in, you will be taken to the Active Events screen, with a list of active events that need to be investigated.
This environment is made up of three Db2 subsystems (DMS1, DMS2 and DMS3) included in a Sharing Group (DSNDMS).
An application change was put into production on the DMS2 Db2 subsystem on a Sunday (March 28th, 2021). As the change caused some issues in the DMS2 and DMS3 subsystems, as well as in the overall Sharing Group, you will see how quickly we can detect the problem using AMI OI.
So let us investigate.
Click on the Investigate link for DMS2, as we know that this is where the bad application was introduced, to see what triggered this event.
Warning: if you see this message, you can ignore it. It will dissapear automatically.
It simply informs that this Test environment is not connected direclty to a Tomcat server.
The Probable Cause Analysis view displays the current most probable sources or classifications (suspected source KPI (Key Performance Indicator) Group of an anomaly) of the anomaly based on the current data.
Above the graphical representation is a timeline of the event. You can select a specific point on the timeline to see the state of things at a specific time. All other areas of the tab are affected by the timeline.
Use the arrows (< or >) or drag the slider (blue circle) to jump backward or forward on the timeline, or click Event Start or Latest Ingestion to go to the beginning of the event or the latest data.
Set up 9:50 as the analysis time
You will see that Local Contention on DMS2 is driving Global Contention on the Sharing Group, and this Global Contention is causing an increase in the IRLM (Db2 Internal Resource Lock Manager) CPU time.
As Global Contention is increasing, a lot of work is being delayed, so CICS starts more Db2 threads, which leads to more real and auxiliary storage usage.
The KPI Groups that are identified as the classifications of the anomaly, Workload and Contention, are listed on the left.
Click on the View Detail Analysis button for the Contention event on the left
You will be taken to the detailed analysis and the root cause of the problem. The details are divided into two sections:
- Overview: there is a page in contention (Db2 lock) caused by a batch job ($VSLKR10) in the DMS3 Db2 subsystem (QAC 5 LPAR)
- Analysis: This section displays the connections between work units involved in the Event and how they are affecting each other. It provides the right view and context for further investigation on the AMI Ops Monitors (Not covered in this scenario). The small question marks provide information about the data presented in the Overall and Analysis sections.
Close the Detailed Analysis view by clicking on the X on the upper right side
Click on the Event Progression tab to the right of Probable Cause Analysis
On the Event Progression tab, you can see details of the event over time.
The event log, on the left, shows a sequence of KPI Groups going into anomaly state and Exceptions occurring. You will see that Db2 has issued the same exception, QTXATIM (Timeouts), repeatedly since 9:42 because of the Global Contention. Db2 exceptions are actively monitored by AMI OI.
The timeline, on the right, shows a graphic display of the performance of the KPI Groups and Exceptions.
A green up-arrow, after the KPI Group name, indicates things are getting better for that KPI. A red down-arrow indicates things are getting worse for that KPI. A dash indicates things are holding steady for that KPI.
Finally, the numeric values on the right are Z-Score values, which provide a distance measure of how far a raw value is from its modeled mean in terms of standard deviation units. A low or high Z-Score value indicates either getting closer to its normal value or further away from it.
You can select a specific point on the timeline to see the state of things at that time. The timelines are connected and clicking a time on the left graph will adjust the time on the graph on the right.
The timelines are connected and clicking a time on the left graph will adjust the time on the graph on the right.
Click on the Switch to Graph View link on the upper right side
Graph view displays individual graphs per KPI Group for all KPI Groups at the selected time, with a breakdown to the individual KPIs.
To see KPI values on a graph for a specific time, mouse-over the graph
You can also view a larger version of a graph by mouse-over the graph title like "Thread-Counts" clicking the down arrow when it appears and then selecting "View"
To exit the large graph click the left arrow on the left side of the graph
Scroll down on the right to see more graphics
You will see all the KPIs that are increasing or decreasing significantly after 9:42: Thread Counts, Local Contention, Page Plock Activity, Storage, etc.
Let's look at how Db2 Subsystem Events can cause Sharing Group Events.
Click on the 'View Event Correlations-Go to Timeline' button on the left at the bottom
When you get to the Timeline view you can see how Events occur in each of the Db2 subsystems and how issues in those subsystems affected the Db2 Sharing Group.
Review the red Events on Db2 subsystems DMS2 and DMS3. Note the red dots within the Event.
Mouse-over the red dots to see the Exceptions that also occurred during this Event time interval
Click on the red Event box for DMS2
This action will take you to the timeline specifically for DMS2 and allows you to drill down into each Category to see the associated KPI Groups and their sparc line graphs. A row for Exceptions is positioned above all of the Categories.
Click on the small down arrow next to the Contention Category to expand and see the KPI Groups
Notice how the high Global Contention graph aligns with the red Event box near the top of the view
So, in summary, you will get two fundamental benefits from AMI OI:
- Lead time to resolution: AMI OI is shifting left any issue that may arise in the near future by showing deviations from normal, so you can start investigating immediately before it really happens.
- Narrow the hotspot for you: AMI OI will tell you what LPAR, Db2 subsystem, KPI Group are experiencing deviations from normal, and what AMI Ops monitor view you can use to hyperlink and to start the investigation.
Choose your next road trip !