Monitoring services
As an operator or a site reliability engineer (SRE), it's critical that you are able to observe the business services in your organization to monitor their overall health. When a service gets impacted by any factor, you need to view the events generated because of the impact, analyze the causes of the impact, and quickly remediate those events to restore the health of the impacted services.
BMC Helix AIOps provides a set of comprehensive, service-centric monitoring capabilities. A service is a logical group of applications, middleware, security, storage, networks, and other child services that work together to achieve a business goal.
BMC Helix AIOps offers a single pane of view for all the business services used by your organization. The single pane of view helps the users to view all the relevant information at one place and to respond faster. Operators or SREs can view the following information:
- Number of impacted services by severity - Critical, Major, Minor, or OK
- Number of situations (correlated events), events, incidents, and CIs for each impacted service
- Association between child services of each service
- Details of each service related to its health, and impacting events, situations, incidents, or changes
APEX Global IT Train Ticketing System is a microservices-based architecture. The train ticketing system provides a portal for booking and managing train reservations.
Susan is a site reliability engineer at APEX Global IT and is responsible for monitoring the overall health of all the services used for the train ticketing system by using BMC Helix AIOps. The Services page on the console shows whether the services are healthy and marked in green. Today, she observes that the TrainsApp service, typically used by the travelers to book tickets, is red. Investigating further, she observes that the service health score is 0, which means it is 100% impacted, and there are 73 events, 1 situation, and 1 incident against the service. She can also see that there's only 1 impacted CI (configuration item), which tells her that the problem is probably affecting only 1 node in the service. She clicks the Impacting Events link to view the events, situations, incidents, and changes for the impacted service.
Susan wants to find the root cause of the events, so she views the event details under the Analyze Root Cause section. If automation is available, such as restarting a service or clearing disk space, Susan can try to remediate the issue by invoking the available automation from the list of applicable items. In addition, Susan can delegate the event to another user for further actions or change the priority of the event, and so on. Finally, she can review the service behavior for the last 15 days, and look for a pattern or a trend to identify the service degradation. Susan can then take corrective measures to ensure that the service health improves.
All these capabilities enable Susan to achieve the following objectives with services in her organization:
- Remain available and healthy at all times
- Perform at an optimal level
- Have low downtime and minimal impact on the business
Learn about the advanced monitoring and analytical capabilities by using the topics listed in the following table:
Action | Reference |
---|---|
Learn about the service health, health score, and metrics. | Service health score and health timeline |
View the services in a heatmap view or tile view and monitor the overall health of each service on the Services page. | Getting started with service monitoring |
On the service details page:
|
Comments
Log in or register to comment.