IT has the ability to deliver mission-critical business services that an enterprise needs to engage customers, increase efficiency, drive innovation, and unlock business insights. These services—supply chain systems, e-commerce portals, collaboration platforms, and others—have to be highly available and responsive. And the long-term impact of frequent service degradations and outages can be colossal—poor business service health destroys an organization’s ability to deliver on its promises and drives away customers. Service degradations and outages have an enormous and immediate business impact.
Unfortunately, IT is still plagued by service availability and performance issues. IT operations continue to drown in a deluge of infrastructure events, without any real understanding of how these events affect business services. For example, when a server or network connection fails, IT doesn’t know how the failure impacts the business. The failure could be relatively unimportant—or it could be something critical, such as being unable to process credit card transactions. Similarly, when a customer complains about poor response times, it’s incredibly difficult to find the root cause since IT doesn’t know which infrastructure components support a specific service or how these components are connected.
IT operations today can be disconnected and lack visibility.
Multiple disconnected monitoring tools make this situation worse. Each tool generates its own siloed stream of data, and multiple tools often report the same issue. There’s a huge amount of noise. A single issue can create thousands of events, and many events are irrelevant secondary symptoms or have no business impact at all. Network operations center (NOC) staff have to manually correlate this information to understand what is actually happening. This is incredibly time-consuming, and errors are common, dramatically increasing the time it takes to fix service outages. Issues are also missed, leading to poor performance and further outages down the road.
In short, IT operations continues to drown in a deluge of infrastructure events, without any real understanding of how these events affect business services.
Traditional event management tools aren’t “service-aware”.
In an attempt to create service visibility, IT software vendors have developed tools that display events on business service maps. These maps show all of the applications, databases, servers, and other IT components that support a business service, with corresponding events attached to each IT component or configuration item (CI).
However, these maps aren’t “service-aware”. To provide service visibility, these service maps have to be up to date and accurate—but they aren’t. They are typically created using a manual process that requires extensive input from domain experts and application owners. A single service map can take weeks to complete, rendering it obsolete by the time it’s complete. And constantly updating hundreds of service maps is a gargantuan task, far beyond the resources of most IT organizations.
Because service maps are inaccurate and out of date, IT operations staff end up misdiagnosing service issues and miss others completely. This makes service outages more severe and prolonged, and it destroys the staff’s trust in the very maps that were supposed to make things better. Because of this, these tools fall into disuse, leaving IT back where it started, struggling to deliver the service availability and performance the business demands.
But, what about a different approach to event management?
Creating service visibility is only part of the solution. Simply displaying events on a service map doesn’t reduce today’s overwhelming event volumes. And as business services become more complex and distributed, these volumes are only increasing. IT operations teams are drowning in events, and the situation is getting worse.
A different approach is needed to keep up with this pace of change.
For example, artificial intelligence (AI) can be used to apply algorithms to IT operations (AIOps). Machine learning can be used to automatically analyze and process the vast quantity of events being generated by today’s IT environments—far more effectively than manually defining and managing event rules.
Event management systems need to automatically identify topological and temporal event patterns and then use these patterns to correlate event data. By applying this type of machine learning, event management systems can automatically adapt to today’s rapidly evolving IT environments, dramatically reducing noise by identifying multiple symptoms of a single underlying issue.
Discover how to create efficient, intelligent, service-aware event management in this complete Service Health with AIOps eBook.
- Accurate, up-to-date service maps
- Service-aware event correlation
- Operational context
- More tips to deliver high performing business services