APM and Cloud Monitoring’s Data Quality in a Large Scale Environment

Riverbed Technology June 26, 2018

If you are planning to run applications in the cloud you’re probably expecting a large number of concurrent users, applications, transactions, servers, instances, and objects. As you know, application performance management (APM), when deployed in large scale environments, presents a whole new set of challenges. Many traditional APM solutions are unable to keep up with a large volume of transactions, and frequently force a choice between scale and data quality through sampling and trigger-based approaches. If you are running your applications in a cloud environment and have run into some of these challenges, SteelCentral AppInternals can help!

APM and cloud monitoring challenge 1: sampling transactions

Many APM products sample. They don’t capture all transactions for all users all the time. When you use a product that samples, in a large scale cloud environment, you capture a smaller percentage of transactions. The more concurrent users you have, and the more transactions that are processed by your system, the smaller percentage of transactions those APM products capture. For example, sampling every second may sound like a pretty good rate unless you realize that it is only a tenth of transactions if you’re seeing a million transactions a day. For many of our customers, daily transactions number in the billions, so you can see how quickly this degrades…

Sampling impacts the accuracy of data and calculations, not to mention increased the risk of missing very important transactions. What if you capture the $37 transaction and miss the $37,000 transaction? How do you scale when every transaction counts and every transaction is important?

SteelCentral AppInternals captures ALL transactions for ALL users ALL the time! With SteelCentral AppInternals all calculations are based on a complete data set including all transactions, not just a sample. When looking for a specific user’s transaction you will find it in our big data, and our natural language search capabilities help you easily sift through the data to find those transactions quickly!

APM and cloud monitoring challenge 2: size of monitored environment

To monitor applications in a cloud environment your APM solution needs to be able to monitor the large number of objects in those applications. In some cases, the sheer number of monitored objects is something that can bring down a monitoring solution. Some monitoring solutions choose to use SaaS for the larger size environments and use a totally different on-premises solution for the smaller environments.

In many cases APM solutions will be configured to monitor almost everything in the smaller development environment and settle for monitoring only a subset of the objects running in production. When an outage inevitably occurs, the typical process is for APM admins to add additional objects to be monitored in order to find the root cause in the next outage. This makes the investigation process longer and more stressful! An alternative, and common technique, is to divide the monitored environment among multiple monitoring servers. This is a good solution in some cases – but wouldn’t it be nice if your monitoring solution could scale to support growth in your business and a large size application environment with a single analysis server?

SteelCentral AppInternals runs both on-premises and as a SaaS. In a recent release, we introduced a multi-system cluster deployment of the analysis server to support the demand for monitoring large scale environments.

clustered architecture

With the new clustered architecture introduced last month, we have increased AppInternals scalability by an order of magnitude—a single analysis server can now support tens of thousands of agents capturing many billions of transactions per day.

The clustered approach distributes processing across multiple worker processes and increases the ability to store metrics, analyze transactions, and deliver performance data. Some of the worker processes that are running in a cluster include:

  • Controller Worker: storing metrics, transaction traces and configuration data.
  • Parser Worker: parses and analyzes transactions trace files.
  • Indexer Worker: indexes transaction segments and stitches them into end-to-end transactions.
  • UI Worker: hosting the analysis server web interface to view performance data and perform configuration tasks

APM and cloud monitoring challenge 3: architecture complexity

Understanding the relationships between components in a system is very important for monitoring and troubleshooting problems. A simplistic monitoring of hosts, JVMs, instances, or classes, without understanding the relationships between them, will prove ineffective when troubleshooting problems.

What are the components in you application? With legacy applications it used to be simpler but with cloud applications it is more challenging. Your application may include user code, company infrastructure, and third party libraries. Your application may also be calling external services you were not even aware of. If your APM solution does not automatically discover components and instrument the code, how will it fully understand your application and its complexity? How do you know what to troubleshoot and what is causing issues? Some  solutions settle for simple resource monitoring, others may get information from a central administrative interface and present it to the user. In some cases the display is a simple “who is talking to who” but is looking at that “spaghetti” good enough?

SteelCentral AppInternals auto-discovers the relationships between application components. It automatically instruments the code and discovers the relationships between transactions and the time consumed in code, SQL, or external calls. AppInternals uses an innovative visualization we call Performance Graph to simplify the complexity of the architecture. In the picture below, the entry point to the transactions on the left may be many hops from the code/SQL on the right. The Performance Graph abstracts away that complexity and presents the direct connection between the most time consuming business transactions and the source of the time consumed. The Performance Graph gives app owners the ability to quickly see where to invest valuable dev resources to fix or optimize the applications, focusing efforts on what ultimately matters most to the business.

APM and Cloud Monitoring, application performance management

Ready to start monitoring your cloud application? Try out SteelCentral AppInternals today!

apm and cloud monitoring free trial