When a Bad Day for Application Performance Happens to Good People

Aternity December 17, 2014


APM and NPMWe’ve heard the big stories about bad days for application performance—like Best Buy’s challenge on Black Friday. I discussed simliar issues in this recent post. And we’re still hearing about Healthcare.gov‘s performance problems.

The cost for application downtime can be incredibly crippling even if it’s just seconds, much like a kid spilling their ice cream cone.

Ok, realistically, having a bad application day is quite different from spilling an ice cream cone.

Imagine yourself being in each of the following three scenarios:

The delivery driver

You’ve got a mobile device that accesses a virtual private network (VPN) to an app that gives you status updates on your pick up and delivery times in route through a crowded suburb of a big city. Each day, you must pick up stuff and deliver stuff based on what the app tells you. When the app is running perfectly, you typically make four deliveries per hour and do two pick ups. When the app or VPN is down or even slow, you have to stop and wait. Sometimes you start to do something else altogether like getting something to eat nearby. No, not ice cream 🙂

When the app is down or slow for you, it tends to also be slow for about 500 other drivers in your region. The number of deliveries plummets; guaranteed shipping dates are not met and need to be refunded. You end up eating way too much food and return to your delivery truck feeling frustrated.

The Problem—An app server spiked in CPU because of a scheduling error from one of your engineering teams.

Server Health

The online shopper

Finally, you have a moment in your busy schedule to order some office chairs for your new business. You go to your favorite online shopping site and start looking at the chairs one-by-one and then decide it would be better to narrow down your search. When you click the filter button to sort by price from low to high, you get a bunch of seemingly unrelated stuff in a list and then the page hangs. You figure it might just be momentary so you decide to click to another tab and update your Facebook status.

You go back to the page, and it’s still hung. Instead of waiting for this to resolve you just click the back button and start searching on another site. Because there’s only a few major sites that sell office chairs, you realize that this could be happening to thousands of folks at the same time and just for the last hour. Imagine if this lasted for hours like it did for Best Buy?

Problem—A database call to the SQL server that updates the price filter was recently changed by a new developer for the online store application.

SQL Error

The mortgage broker

You’re a mortgage broker for a popular online mortgage company and you’re sitting with your colleagues. You’ve just gotten a fresh batch of leads from an inside sales rep. You go to the app where the leads are stored with the initial contact information. You log on successfully and then start to make the calls. Your first prospect has an awesome credit profile and you’re about to close in on entering some of the final details that will go to underwriting. Then—you guessed it—something goes wrong and you politely ask your customer to be put on a brief hold as you update the information. The app does not respond.

Just as that happens, you hear a big sigh from other members of your team and a couple expletives. The app is down and the IT supervisor comes to explain that one of the app servers crashed and will be back up in about an hour. Your customer needs to hang up.

Problem—IT staff just rolled out a new mobile app that they did not realize would add so much traffic to one of the app tiers of this application and it ran out of disk space.

Disk Error

Why visibility is the key to fixing a bad-day scenario

In all three scenarios, the loss of revenue was evident. When you multiply it by the potential number of other users having the same issue, you can quickly conclude how seconds, minutes, and hours of down or slow time can really have a huge impact. In addition, each one of these bad-day scenarios includes less tangible effects but worth noting—the frustration of a loyal customer, employees doing something else, or losing trust in your IT staff.

Having the visibility of where the performance problem was and who was responsible provides fast diagnosis and troubleshooting. This is where Riverbed SteelCentral comes in with the most complete visibility of your applications from end to end. We can also help you with your Performance Engineering strategies with both APM and NPM professional services.

Many of our customers now realize it’s not just about packets and that a more complete visibility of application page views, page times, and related analytics of server health and application tiers provides a more comprehensive approach to getting your mean time to resolution much shorter. Check out this dashboard that provides visibility that is actionable right away.


Further reading: