Billions of transactions a day. Thousands of distributed traces. Dynamic workloads. The reality of monitoring modern hyperscale cloud-native applications is a big data reality where the application environment changes rapidly on short time scales.
A big data approach to application performance management (APM) is necessary for:
- Completeness. Tools that sample transactions, or only snapshot based on exceptions, provide a fragmented, incomplete view that leaves too many blind spots for troubleshooting. This is especially true for the long tail of performance issues.
- Context. Some tools discard payload and metadata information, which obscures the business relevance that helps prioritize efforts and makes each transaction unique. If you capture the checkout transaction and discard the shopping cart contents, how do you know if the failed transaction was worth $10 or $10,000?
- Correlation. In today’s distributed environments, transactions can traverse thousands of application tiers. To provide a clear end-to-end picture, distributed traces must be correlated across application components and stitched together into a single trace.
- Causality. Without high definition monitoring of the application environment, it’s easy to miss CPU or memory spikes entirely, or be misled by “aliasing” effects. Second-by-second metrics are needed to accurately pinpoint root causes such as shared resource issues and other infrastructure dependencies.
Riverbed is the only APM vendor that is architected to support these requirements delivering deep visibility with proven low overhead even in high-throughput production environments. Let’s take a look under the hood and see how it works.
Architected for big data
Overhead is a key concern for any technology that is sitting in line with the application. One of the ways AppInternals maintains its incredibly low overhead is with a distributed three-tier architecture:
- In-app instrumentation records traces of every critical method and application call in a transaction
- Traces are compressed and streamed, along with 1-second metrics, by an aggregator agent
- Data is stored in a non-relational proprietary big data datastore which can be scaled horizontally
Separating the in-app instrumentation from the time-series metrics collection, aggregation, compression and streaming functions, eliminates the need to add compute or processing power to the application stack being monitored. And this also results in greater reliability, with protection from data faults.
AppInternals is now fully adapted to hyperscale cloud environments. With the multi-system clustered deployment approach introduced this year, we have increased scalability by an order of magnitude—a single AppInternals analysis server can now support tens of thousands of agents capturing many billions of transactions per day.
The cluster mode distributes the analysis server workload across redundant data processing pipelines for scale and availability. Some of the worker processes that are running in a cluster include:
- Controller Worker: storing metrics, transaction traces and configuration data
- Parser Worker: parsing and analyzing transaction trace files
- Indexer Worker: indexing transaction segments and stitching them into end-to-end transactions
- UI Worker: hosting the analysis server web interface to view performance data and perform configuration tasks
The persistence layer maintains state and can be setup for failover in a master/slave architecture. There are internal watchdog processes that monitor health and trigger recovery or failover. The agents can be configured to be multihomed with a primary and a failover for additional redundancy. Agents additionally buffer data locally for a period of time so even in the case of the entire backend becoming disconnected for a while, no data is lost.
Virtually unbounded call stack depth
AppInternals automatically tunes its code-level instrumentation to keep performance impact to a minimum using a technique we call hotspot detection. With hotspot detection, you can monitor your entire application, without making any trade offs—with no intimate knowledge of the application code required to configure what to include/exclude.
Hotspots are essentially short, uninteresting methods that are called with excessive frequency. Because the amount of time added to each instrumented method is fixed, even with an extremely low footprint of 250 ns per method, this can add up and create a “hotspot” for response time overhead. By dynamically monitoring the execution time of all methods, AppInternals is quickly able to identify those methods that are called many times but have a very short execution time. If a method meets these (and other) criteria that ensure it is an unlikely candidate for optimization, it is automatically disabled for monitoring. Over time, these expensive methods are completely eliminated from the monitoring. Although there are typically only a few hotspot methods in an application (even in very large applications), our testing shows that disabling monitoring of these few methods dramatically reduces instrumentation overhead. AppInternals keeps a record of all hotspot methods in an easy to access directory for visibility.
This is how AppInternals can capture a call stack that is virtually unbounded—reaching even thousand of levels deep into user code. Open tracing spans are seamlessly integrated into our call stacks, and transactions are stitched end-to-end across all tiers, delivering the most complete distributed tracing in the industry.
Efficient data transfer, storage and retrieval
In addition, another level of pruning is applied to the data by the AppInternals aggregator agent to enable efficient data transfer and storage. This is done with proprietary data compression technology similar to the kind of smarts we’ve built into our SteelHeads for WAN optimization. Think of it as lossless compression, the way you can scale image files without loss of picture quality.
Data storage and retrieval is further optimized with Riverbed’s proprietary data structures. With our NoSQL database, you can capture billions of transactions on less disk space than what is probably in your laptop right now, and quickly retrieve meaningful transaction data. On our own demo servers we’ve surpassed 2B transactions on less than 500 GB of disk.
Riverbed has developed proprietary technology for capturing, storing and indexing very large data sets. Unlike other solutions, we capture every transaction across every tier with full detail—no sampling—along with systems metrics monitored at 1-second intervals. Riverbed’s approach has minimal overhead on the system being monitored, even in high throughput production environments benchmarked at 500M transactions/minute. As a result, we offer the most complete distributed tracing in the industry. In addition, our big data architecture allows you to store the non-aggregated raw data for analysis for as long as it is needed without overloading your storage resources.
High precision data holds the key to trustworthy machine-automated analysis. Our big data captures all trends, corner cases, and anomalies which our analytics can then surface reliably to accurately map dependencies, resolve complex problems, and optimize performance.
To learn more, read our whitepaper on “Why Big Data is Critical for APM” or try it for yourself in our sandbox.