From chess playing robots to robot assassins from the future, the potential of artificial intelligence (AI) has been thrilling our imaginations for a few decades now. Now, even as Mark Zuckerberg and Elon Musk duke out the risks, we are finally starting to see practical applications of this technology—some as close to home as our personal devices. Meanwhile, on the IT side, CIOs are adopting AI initiatives into their digital transformation strategies and IT vendors are investing in AI-based self-healing networks, applications, and infrastructure to keep pace. As learning-based technologies enter their peak hype cycle, if AI proves it can deliver the promised ROI for IT operations, we are poised right on the brink of the AIOps revolution.
What does this mean for you?
First you’ll need data…
In practical terms, if you’re looking to establish an artificial intelligence and data science practice at your organization, the first thing you will need is lots of really good data. At the end of the day, analytics are only as good as the data sets they are applied to. Incomplete or unreliable data sets will inevitably lead to wrong or inconclusive results. For example, if you’re capturing only partial call trees, or tracing only when certain “snapshot” conditions are met, or sampling transactions, and are then trying to do a code audit to find the longest running methods that are called the most often, you simply won’t have enough data to trust your results. Worst yet, you’ll be capturing some fraction of the universe of potential issues, let’s say 10% (depending on your sampling rate and the overall volume of transactions) but you’ll still have zero visibility into the remaining 90%. Maybe with really good snapshot conditions you could improve your odds… but when applications are still slow and users are unable to complete transactions, you’ll have to admit your APM solution is simply not capturing all the relevant data. You won’t find the needles in your haystacks if you’re only examining a handful of hay from each stack.
What is pattern recognition?
One of the ways that data scientists extract information out of data is with pattern recognition algorithms. Pattern recognition is a form of machine learning that looks for regularities or patterns in data sets to find things that go together and figures out what they have in common. With pattern recognition, the more granular and complete your data set, the better the results you can expect to see.
To help you quickly grok the concept of pattern recognition, let’s talk for a minute about how Google learned to recognize cats. Back in 2012, Google’s X labs directed its deep learning algorithm onto analyzing 10 million YouTube videos. After three days, the algorithm was able to recognize humans and cats with over 70% accuracy. How did it do that? First, it used image recognition to looked for groups of images that had patterns in common to identify “thing X” (a cat face) or “thing Y” (a human body). It then looked through all the metadata for words that appeared most often in conjunction with those images. In the case of cat faces, it found the word “cat” and was therefore able to correctly affirm that “thing X” was a cat.
This is a classic example of what is called large-scale unsupervised learning. (Six years down the road, Google can now tell all your cats apart and make home videos of them!)
What can patterns tell me about my application?
When all is well in the world of your application, transaction response times should be more or less randomly distributed around some optimal mean. When patterns show up in transaction data, this usually points to something warranting further investigation. One of the ways Riverbed surfaces patterns is with a scatterplot visualization called TruePlot, shown below.
Each data point represents an individual transaction (or server or instance) and its response time. A quick glance can immediately point to common issues like timeouts (horizontal lines), stalls (vertical lines), server-side queuing (left-leaning ramps), client-side queuing (right-leaning ramps), load thrashing (rise and fall ramps) and so on. When multiple issues are present, you’ll see combinations of patterns. When the TruePlot data points are color-coded by transaction type or geographical region they reveal even more. A horizontal line of the same color is probably a true code-related issue impacting that set of transactions, but the multicolored spikes are probably a shared downstream dependency or underlying resource issue. Clusters of transactions that stand out from the rest can also indicate an attack bot or other external factor impacting application performance. Some patterns are easily identified visually but other patterns lie hidden inside vast amounts of data. This is where the ability to automatically surface patterns with machine learning comes in.
Clusters, correlations and anomaly detection
Aternity APM has built-in machine learning algorithms for clustering, correlations and anomaly detection to help troubleshooters be more proactive and solve problems faster.
The cluster operator shown above uses pattern recognition algorithms to automatically find sets of transactions with potential issues. It can also help pinpoint the root cause of the issue, such as the specific SQL statement in this example. The Insights card below statistically correlates key performance indicators across all application tiers to find groups of metrics on tiers that are spiking in tandem. This narrows down the set of possible culprits for a troubleshooter to investigate, which is especially helpful in chasing down performance issues in distributed containerized and microservices environments with thousands of nodes. Automated anomaly detection, the ability to statistically learn what is normal and what is not for different sets of transactions, proactively watches for things that are not behaving as they should.
These AI-powered analytics can help you
- Surface issues and causes beyond the usual suspects that you may not have thought to look for
- Quickly point you in the right direction for faster troubleshooting and remediation
- Proactively alert you when normal operating thresholds are breached
Of course, the ultimate goal of AIOps is automated root cause analysis and remediation, eliminating the need for a human operator altogether. Well, we haven’t addressed that—yet! But with all the data we collect, Aternity is uniquely well positioned to deliver on that promise.
While it’s generally agreed that even the smartest algorithms are only as good as the data they are applied to, coaxing information out of application performance data with algorithms and statistics remains the key to beating the competition and realizing greater business and IT efficiencies. Aternity APM takes a big data approach to collecting and storing the most complete set of non-aggregated diagnostics data for troubleshooting applications. Along with other intuitive analytics and visualizations, Aternity uses pattern recognition as a way to surface issues and root causes that would otherwise be missed through conventional workflows.
In my next blog, I’ll be diving deeper into how our customers are using our data visualization and search capabilities to extract business and application intelligence from vast amounts of APM data—in just two clicks!