Over the past 15 years I’ve encountered countless cases where inadequate performance data either masked or misrepresented the true symptom or root cause of performance problems leading troubleshooters on futile quests that never led to a resolution. One of the recurring themes behind this are fundamental flaws in how monitoring data is captured, stored and reported. This often results in data that’s not simply inaccurate, but rather wholly incorrect.
This issue occurs in any practice which converts analog information to digital data. The digital media industry leverages fundamental digital signal processing principles to ensure that analog events are captured with precise fidelity, yet in the field of performance monitoring, these principles seem to be largely absent in many tools, which then imprecisely capture the performance data we rely on.
I recently gave a talk at the Velocity conference which explored this, drawing parallels between digital audio, imaging and video with the performance monitoring space. We got such great feedback from the attendees about the content, so I decided to evolve it into a blog series where I’ll dive even deeper into the reasons behind these issues and how to ensure that your monitoring practices don’t fall victim to them. I’ll post the slides and video from Velocity in the next installment, but in the meantime here are some related talks I’ve given in the past.
A real-world example
I had a customer who hosted an application they developed in the cloud. They had been using the cloud vendor’s tools to monitor their VMs, and recently installed Aternity APM to do deeper diagnostics. After releasing a new build of their application, the cloud monitoring tool reported that CPU increased almost 8x from 7% to 61%:
This became a SEV1 issue and management was panicking to find a resolution quickly. They dove into the issue with Aternity APM and were very surprised by what they saw:
Aternity APM showed that CPU only increased by 5%! The blue charts show how Aternity APM’s 1-second granular metrics revealed that the application wasn’t consuming CPU in a constant manner, but rather processing requests in batches every minute for about 15 seconds.
In version x.1 of the app, the cloud monitoring tool was sampling once per minute, in-between the spikes, so it completely missed them and misreported the CPU load as 7%, lower than its actual load of 20%.
In version x.2 of the app, the batch processing time increased from 15 seconds to 20 seconds, which widened the spikes enough that they aligned with the cloud monitoring tool’s samples, which then reported the CPU load at 61%, much higher than its actual load of 25%.
In both cases the cloud monitoring tool’s coarse 1-minute samples were completely wrong. If you think about this further, the cloud monitoring tool could have reported the CPU 60 different ways, depending on which second in that window it coincidentally happened to be sampling at, and every way would have been wrong.
A slightly different example
Here’s another CPU Load chart as reported by the cloud monitoring tool’s 1-minute samples:
It appears that during this 15 minute period, CPU Load increased from about 5% to 95%, and then back down to about 25%. However, when we look at this with Aternity APM‘s 1-second granularity metrics (blue), you can see that the blue CPU Load pattern is once again spiking repeatedly:
Let’s compare the two cases side-by-side:
The only difference between them is that the left graph has spikes occurring every 61 seconds, while the right one is occurring every 60 seconds. This slight difference dramatically altered how the 1-minute samples (red) aligned with the spikes, resulting in completely different sampled results for very similar source data.
These examples clearly show how in these cases, the 1-minute sampled monitoring data was completely wrong, and would have led troubleshooters towards time-wasting dead-ends. The 1-second metrics from Aternity APM accurately represented precisely how the application was behaving.
More to come…
In the next installment I’ll explain the theory behind why coarse sampling gets this wrong, and discuss similar examples from other fields. Hopefully by the end of this series you’ll be able to identify the cases where your performance data can (and can’t) be trusted.