Detect problems using the automated self health monitoring capability
Mix Technologies is a large enterprise company in the Silicon space. It has the following deployment:
- 5000 servers in the IT infrastructure
- 1500 servers in a virtual environment using VMware
500 servers in a public cloud environment
Mix Technologies monitors its network devices using events through SNMP. It also uses deep dive network topology tools. The rest of the application infrastructure and servers are monitored using application performance and traditional monitoring tools.The help desk personnel and application owners are responsible for monitoring and managing the servers in the private cloud as well.
There are many user roles involved in the deployment, operation, and management of Infrastructure Management. Your company may employ the roles as described below, consolidate them into fewer roles, or divide them into roles with more granular responsibilities and may have other titles for these roles.
The following role is required to complete this use case:
- Roger - Distributed Service Operations User
Roger handles the following responsibilities:
- Maintaining the ongoing performance and availability of production systems with a focus on server infrastructure
- Performing administrative functions on servers and monitoring tools
- Monitoring the performance and solving availability, performance, and capacity problems
Solving a data collection problem
One of the two Integration Services in an Integration Service cluster goes down. As a result of this, a KM loaded on the Integrated Service cannot collect data, which in turn leads to no data being sent to the BMC TrueSight Infrastructure Management Server. Hence, no event will be generated if a problem occurs.
Roger needs to be notified immediately if there are problems with the data collection because he has got internal SLAs on availability. Without monitoring the SLAs, he is unable to create compliance reports.
Roger can use the self health monitoring feature to solve this problem.
Whenever a KM has a problem collecting data, or a connection between the Integration Service and the BMC TrueSight Infrastructure Management Server is down, Infrastructure Management automatically generates an event that indicates the Integration Service system on which the KM is installed and from which the data collection problem emanated. To solve the problem, Roger can:
- Log on to the the operator console.
- He can view the Details pane to see the name of the Integration Service system on which the KM is installed, the severity of the event, and so on.
- Roger can then perform event operations, or perform probable cause analysis to drill-down to the cause of the event and try to fix it. For more information, see Performing probable cause analysis on an event from the TrueSight console.