AIOps Is Your Catalyst To Delivering High Performing Business Services

ServiceNow Upgrade Madrid Cover Image

Today, perhaps more than ever before, IT delivers mission-critical business services that an enterprise needs to keep employees safe, engage and retain customers, increase efficiency, drive innovation, and unlock business insights. These services—supply chain systems, e-commerce portals, collaboration platforms, and others—have to be highly available and responsive. And the long-term impact of frequent service degradations and outages can be colossal—poor business service health destroys an organization’s ability to deliver on its promises and drives away customers. Service degradations and outages have an enormous and immediate business impact.

Unfortunately, IT is still plagued by service availability and performance issues. IT operations continue to drown in a deluge of infrastructure events, without any real understanding of how these events affect business services. For example, when a server or network connection fails, IT doesn’t know how the failure impacts the business. The failure could be relatively unimportant—or it could be something critical, such as being unable to process credit card transactions. Similarly, when a customer complains about poor response times, it’s incredibly difficult to find the root cause since IT doesn’t know which infrastructure components support a specific service or how these components are connected. Multiple disconnected monitoring tools make this situation worse. Each tool generates its own siloed stream of data, and multiple tools often report the same issue.

There’s a huge amount of noise.

A single issue can create thousands of events, and many events are irrelevant secondary symptoms or have no business impact at all. Network operations center (NOC) staff have to manually correlate this information to understand what is happening. This is incredibly time-consuming, and errors are common, dramatically increasing the time it takes to fix service outages. Issues are also missed, leading to poor performance and further outages down the road. IT provides mission-critical services that businesses need to engage customers, automate processes, and drive innovation—that must be highly available and responsive.

Traditional event management tools aren’t service-aware

In an attempt to create service visibility, IT software vendors have developed tools that display events on business service maps. These maps show all of the applications, databases, servers, and other IT components that support a business service, with corresponding events attached to each IT component or configuration item (CI).

However, these maps aren’t service-aware.

To provide service visibility, these service maps have to be up to date and accurate—but they aren’t. They are typically created using a manual process that requires extensive input from domain experts and application owners. A single service map can take weeks to complete, rendering it obsolete by the time it’s complete. And constantly updating hundreds of service maps is a gargantuan task, far beyond the resources of most IT organizations. Because service maps are inaccurate and out of date, IT operations staff end up misdiagnosing service issues and miss others completely. This makes service outages more severe and prolonged, and it destroys the staff’s trust in the very maps that were supposed to make things better. Because of this, these tools fall into disuse, leaving IT back where it started, struggling to deliver the service availability and performance the business demands.

A different approach to event management

Creating service visibility is only part of the solution. Simply displaying events on a service map doesn’t reduce today’s overwhelming event volumes. And as business services become more complex and distributed, these volumes are only increasing. IT operations teams are drowning in events, and the situation is getting worse. Traditional event management systems use manually defined rules in an attempt to normalize, deduplicate, and filter events, turning them into a smaller number of alerts—but it’s not enough. A single underlying issue can generate thousands of events as symptoms propagate over time across the IT environment.

There’s just no way to correlate all of these using static rules—there are just too many scenarios to cover. And even if IT could create all of these rules, they would need to be updated every time IT deploys a new application or technology. Even a simple software version upgrade can affect multiple rules, making maintenance a nightmare. A different approach is needed to keep up with this pace of change. For example, artificial intelligence (AI) can be used to apply algorithms to IT operations (AIOps).

Machine learning can be used to automatically analyze and process the vast quantity of events being generated by today’s IT environments—far more effectively than manually defining and managing event rules. Event management systems need to automatically identify topological and temporal event patterns and then use these patterns to correlate event data. By applying this type of machine learning, event management systems can automatically adapt to today’s rapidly evolving IT environments, dramatically reducing noise by identifying multiple symptoms of a single underlying issue.

The bottom line…

To effectively manage business service availability and performance, IT needs event management to be service-aware and intelligent.

Displaying a flood of events on an out-of-date business service map doesn’t solve the problem. To deliver this intelligence and service visibility, modern event management platforms must:

  • Deliver up-to-date, accurate service maps
  • Use machine learning to turn a flood of events into clear, actionable service health information
  • Accelerate diagnosis and remediation by providing operational context

Service Maps

ServiceNow Service Mapping helps IT operations create accurate service maps in the ServiceNow configuration management database (CMDB) and then automatically keeps these maps and related configuration items (CIs) up to date. A patented service discovery mechanism identifies all of the IT components that support a business service and how they are related. Starting from a service entry point—for example, a URL or message queue—it drills down through applications and IT infrastructure, interrogating each CI in turn to determine its service-specific relationships with other CIs.

Once Service Mapping has mapped a business service, it keeps the map up to date by automatically tracking service topology changes—including across public and private cloud environments. When Service Mapping automatically detects a change, it updates the service map in the CMDB, maintaining a full change history. Instead of inaccurate, out-of-date business service information, you’ll have a complete, current view of how your business services are delivered.

Event Management

ServiceNow Event Management works seamlessly with virtually all monitoring and event management tools, providing out-of-the-box integrations with leading vendors including Nagios, SolarWinds, Splunk, Microsoft SCOM, HP Operations Manager, IBM Netcool, VMware vRealize, Amazon CloudWatch, and others. It normalizes, deduplicates, and filters events across these sources, significantly reducing event noise by turning them into a smaller number of alerts. Advanced machine learning techniques are used to correlate these alerts, automatically creating groups of related alerts. This further reduces noise, letting NOC staff see which symptoms are due to a single underlying cause.

ServiceNow Event Management does this by learning patterns of repeating alerts and then applying these patterns to new alerts. It provides temporal correlation—sequences of alerts that occur over time—as well as topological correlation, using service maps and other relationships in the ServiceNow CMDB to identify how symptoms propagate across your business services and IT infrastructure. You can manually add or delete alerts from these automatically-generated groups, and you can provide feedback on their usefulness. Event Management learns from this, modifying its future alert grouping behavior accordingly. 10 As well as grouping alerts, Event Management also assesses their impact on your business services.

This goes far beyond simply mapping alerts to specific CIs in the service map. For instance, it weighs alerts to reflect their service-specific impact and adjusts severities to account for redundant service topologies such as service clusters. Event Management displays this service health information on an intuitive service health dashboard, letting you see the health of all of your business services—or specific groups of services—at a glance. From the dashboard, you can also instantly drill down into individual service maps, which show the health of each component in the map—allowing you to quickly investigate the root cause of service issues.