Anomaly Detection – What, why and now!

#include <std_disclaimer.h>

I’ve been doing a little research lately to learn about anomaly detection and wanted to share. I can’t think of a better way to start that with a visceral example.

If you are old enough to be a working professional you already inherently know what anomaly detection is. We all learned it watching Sesame Street as children!

Do you remember the “1 of these things is not like the other” song? It brings back warm memories.

What is anomaly detection?

An anomaly is something that doesn’t belong. It’s an exception, an outlier, an aberration. It’s just peculiar. Not that there’s anything wrong with that.

In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations that do not conform to an expected pattern or other items in a dataset. This is the formal definition from Wikipedia. In plain English, anomaly detection solutions use software algorithms to understand the streams of operational metrics and their inter-relationships to automatically identify events that shouldn’t be happening and the likely causes.

And how can we put this into perspective with things we already know? We can better relate this new machine learning and analytics to what we are familiar with by looking at the business intelligence (data analysis and reporting) maturity cycle:

Data     >>     Information     >>     Knowledge

In the early days of computing we used them primary to collect data, to record each item sold or produced, or inventoried. Computer systems, at the time, were very transaction oriented. And at the end of each month, management could see a report summarizing the transactions: total sales, total units.

In the 1980s and 1990s, enterprise use of data became much more informational—examining sales by region by month or week to manage productivity. Specialized business analysts with analytical reporting tools (OLAP) performing interactive, multi-dimensional analysis of metrics, pivoting through the data to spot important trends to the business used to be employed. Now we have those capabilities in Excel and every accounting department is creating pivot tables. This is traditional, mainstream business intelligence.

Once reserved for giant Telco’s and financial services, now knowledge bordering on foresight can be gleaned by leveraging powerful data mining algorithms on today’s ultrafast hardware applied to the thousands of metrics and millions of data points most businesses collect about user experience and business results.

Data mining and analytics are augmented cognition.

Why is anomaly detection important to business?

We have entered the Age of the Customer. Spurred by the iPhone’s consumerization of mobile touch technology, bringing always connected mobile computing, access to cloud services, and social networks to everyone including our children and our parents, businesses can no longer survive by earning customers by being good closers. Content, relationship, trust – substance is required to succeed with today’s customers whatever generation we want as customers.

Customers no longer want to buy products they want experiences. And creating digital experiences that are memorable and worth sharing is what’s driving the way tomorrow’s successful businesses will engage users.

For IT the world is different, too, then. We are no longer creating systems of record but systems of engagement, which directly impact business results. Execution of the customer experience through application delivery has a bigger impact on business results than ever.

Here’s a quick video about how Snap Interactive uses Anomaly Detection to power the decision making behind real-time campaign management.

Leveraging the data the business generates from business and technical operations can yield insights that improve business operations and results delivered to stakeholders. 

Why is anomaly detection important to IT Operations?

With application delivery having such a big impact on customer engagement these powerful new analytics are critical to managing the risk of IT operations.There are just too many metrics from too many tiers flowing out of today’s complex application delivery environments powered by Cloud, SaaS, virtualization, and third party content and components.

Business are functioning in real-time, so the IT that supports the business much be much more real-time. Without our software systems, often the business of our businesses comes to a halt. Add to that the new Continuous Deployment scenarios and things are just happening a lot faster than they used to.

Today’s application and IT operations monitoring are generating terabytes of big data. With more frequent collection intervals a lot more data is being generated. Many sensors today are generating data every few seconds. The complex infrastructures we are monitoring have many more data elements and are spread across more servers, app servers, networks, and third party components. And the importance of business systems to the operation of the business means that all of the data we are collecting needs to be analyzed in real-time, across all of the many dimensions.

The old methods are not enough.

Dashboards are insufficient to manage today’s complex, real-time businesses. How many metrics can a person really watch on a dashboard and understand? Maybe a dozen or so – not nearly enough. Threshold setting is also no longer effective. Manually setting static values that represent Red – Yellow – Green doesn’t work in dynamic environments and doesn’t scale. Besides, thresholds assume you already know what you are looking for.

Data mining and machine learning will augment our cognitive abilities to deliver user experiences with greater reliability and consistency.

Anomaly detection in action

The analytics behind anomaly detection use powerful statistical techniques such as: K-nearest neighbor, cluster analysis and neural networks to understand the data and train the algorithms about what’s normal.

Here’s a quick example of a multivariate Gaussian outlier analysis courtesy of holehouse.org.

The graph below shows how CPU and Memory utilization are fitted with an elliptical probability model that identifies the green X as an outlier using a formula something like this:

multivariate_formula

 

multivariate_gaussian

What benefits can I expect from using these new anomaly detection tools?

Able to process a nearly unlimited amount of data in real-time generating expert analysis, anomaly detection systems are automatic. Anomaly detection turns unknown business and technical events into known events giving you a better understanding of what’s going on with your business. These systems perform complex event correlation and are much better at isolating root-cause as opposed to just symptoms.

Analytics can see past broad averages and analyze trends by their dimensions quickly spotting issues on a single cluster or only affecting a particular region that might normally go unnoticed. The alerts produced are both more proactive and more reliable. These systems can find system bugs by quickly identifying troubling behavior during a 10% production deploy or identifying sudden abandonment points from visitor analytics or RUM.

Finally, this is all about risk mitigation.

Anomaly detection is a safety system for your business just like Traction Control is for your car.

Not that long ago traction control was only found on luxury cars but now you wouldn’t consider purchasing an automobile without this important safety capability that greatly mitigates the risk of accident in slippery / changing conditions.

Leading vendors and market dynamics

There are several leading vendors in the analytics spaced dubbed ITOA for IT Operations Analytics by Gartner. Prelert, Netuitive and Sumo Logic come to mind and these successful vendors are independently selling these analytics as add-ons to supplement the results from current tools. In fact, Prelert has a nice partnership with both Splunk and CA APM.

By now you can probably tell that I think that anomaly detection and event correlation is a capability of sensor systems and not something I believe can sustain a company or many companies in the long term.

Given the way technology product cycles are compressing, I wonder how long the market leaders will be independent?

The takeaway!

These powerful, and now accessible data mining algorithms are now available in commercially available products and are an important safety control system for your business.

Analytics turns unknown business and technical events into known events giving you a better understanding of what’s going on with your business.

You shouldn’t operate without them.

Leave a Reply

Your email address will not be published. Required fields are marked *