The Hidden Costs of On-Call: False Alarms

The video of my LISA17 talk is posted on YouTube.

Abstract:

On-call teams, postmortems, and costs of downtime are well-covered topics of DevOps. What’s not spoken of is the costs of false alarms in your alerting. The team’s ability to effectively handle true issues is hindered by this noise. What are these hidden costs, and how do you eliminate false alarms?

While you’re at LISA17, how many monitoring emails do you expect to receive? 50? 100? How many of those need someone’s intervention? Odds are you won’t need to go off into a corner with your laptop to fix something critical on all of those emails.

Noisy monitoring system defaults and un-tuned alerts barrage us with information that isn’t necessary. Those false alerts have a cost, even if it’s not directly attributable to payroll. We’ll walk through some of these costs, their dollar impacts on companies, and strategies to reduce the false alarms.

Talk slides:

If you would like to read more about monitoring and on-call, you may enjoy these posts:

Citations:

Using Fault-Tree Analysis To Reduce Failures in Software

Fault tree analysis is a top down analysis of an undesired system state to determine the best ways to reduce risk. It uses Boolean logic to combine contributing events, giving overall probabilities of failure. Fault trees are used primarily in high-risk industries such as aerospace, nuclear, and chemical. However, it can also be used in software to review and harden systems against failures.

Walking through a fault tree

Suppose we’re concerned with the uptime of our Software as a Service (SaaS). We wanted to run a formal analysis to understand why this is happening.

We’ll start from the top of the analysis with our end state: Site unavailable.

What might be contributing causes to this outage?

  • CDN not delivering assets
  • Frontend server errors (bad HTML, application crashes, etc.)
  • Backend server errors (database lookup failures, etc)

To keep this example simple, we’ll assume that these failures are independent. The CDN serves static content like images and javascript and not pages rendered by the frontend servers.

We can recognize that any one of these conditions can be true, which means they contribute to site availability through an OR operation. Events can be combined with standard Boolean logic: AND, OR, XOR. There are some additional combination functions you may not be familiar with such as Priority AND — an AND where failures must occur in a given order for the condition to occur.

Wikipedia has a reference with the symbols used.

FaultTree
Our initial fault tree

In a formal fault tree analysis, we would assign probabilities to each event. The overall uptime of our SaaS is 2 9s (99%). This means our system is in a failure state 1% of the time.

Our CDN is “always” up according to our records, but the SLA is 4 9s (99.99%). We can treat the SLA as our probability of success. The chance of failure is then 0.01%.

We can measure the uptime and error rates of the frontend servers. Suppose that our frontend servers have 99.97% uptime. That is to say, only .03% of the time the frontend server is at fault — a code bug produces bad output, the uwsgi or nginx process dies, etc.

If we had no further data measured or known about our systems, we could determine approximately the probability of failure on the backend systems. In the case of an AND combination, we’d multiply the probabilities P(Uptime) = P(CDN) * P(Frontend) * P(Backend). However, since this is an OR combination, and because the failure rates are small (~0.01), we can approximate as the sum of individual probabilities:

P(Uptime) = P(CDN) + P(Frontend) + P(Backend)

This gives us a probability of failure of the backend of 0.96 (1 – 0.01 – 0.03), or an uptime of 99.04%. If we had measured values for the backend uptime which did not match this estimation, we should re-evaluate assumptions made (e.g. OR vs AND), verify the correctness of the failure modeling used, or consider additional causes of failure not initially identified.

FaultTreeFailures
The fault tree with the failure percentages noted

Where does this backend failure rate come from? Repeat the process to identify the components of failures, extending the tree. Considering that the application backend contributes the most to the SaaS downtime, we should start tackling the problems there. We would identify several potential solutions and consider the return on investment (ROI) in how it relates to our uptime concerns.

Fault trees in practice

Recognize that fault trees are a tool from high risk industries where “catastrophic explosion” or “toxic vapor cloud” are unexaggerated outcomes to prevent besides “production halt”. They’re strict and formal so the documents can be reviewed and approved by multiple individuals.

Fault trees don’t need to trace failures for external systems. The CDN goes down? Maybe that’s due to ISP problems or a problem at the hosting provider. There is no need to trace that back further and try to break that down into component failures.

Fault trees for software may not require failure rates. Fault trees in high-risk industries are largely based in the physical realm — the parts purchased and installed have spec sheets with MTBF values because they’ve been tested under a variety of conditions by the manufacturer in order to model failure rates. Software, however, may be under active use until the first failure occurs, so there is no model available. In this case, identifying and charting potential causes of failures is more important than determining the probability of a given failure.

Treating fault trees as an exercise to map dependencies will help identify single points of failure, common dependencies and co-dependencies not initially recognized, in part because software is so easy to change. It’s trivial to introduce additional dependencies by trying to pull more information into applications.

Further Reading

Fault Tree Handbook (NUREG-0492) from the US Nuclear Regulatory Commission.

For a bottom up analysis of common dependencies, check out event tree analysis. It was developed as an alternative to fault tree analysis since the fault trees of some systems become too large.

Overcoming Monitoring Alarm Flood

You’ve most likely had 10, 20, 50, or even more alerts hit your inbox or pager in a short span of time or all at once. What do you call this situation?

It turns out, there’s a name for this influx of alerts–“alarm flood”.

“Alarm flood” originates in the power and process industries, but the concept can be applied to any industry. Alarm flood deals with the interaction between humans and computers–specifically more automated alerts than the human element can process, interpret, and correctly respond to. It is the result of multiple small changes, redesigns, and additions to a system over time: Why would you not want to “let the operator know” that a system has changed states?

Alarm flood has been discussed in those industries for at least the past 20 years, but it was formally defined in 2009 in the ANSI/ISA 18.2 Alarm Management Standard as 10 or more alarms in any 10 minute period, per operator.

In tech, the “operator” will be the person on call. In a smaller operation with only one or two engineers on call at a single time, any significant event could turn into a flood, making it difficult for the engineer to identify and address the root causes. How do we fix this flood state and provide better information to our engineers?

Continue reading “Overcoming Monitoring Alarm Flood”

Reducing the Stresses of On-Call

Being on-call is stressful. It feels like the future of the company–or at the very least your job–depends on your vigilance. When will the pager alert come? How bad will it be?

Where is this stress coming from?

  • Urgency – Typically on-call only has a certain amount of time to respond to an incident. The idea of being late to respond is stressful for many. There’s also an implied urgency in that down = bad, so services should be restored as quickly as possible
  • Uncertainty – The failures could come at any time. Want to run to the store? Better be quick. Want to see a movie? You might have to miss the ending. Good luck sleeping, the alert could come as soon as your head hits the pillow or in the middle of the night or become your new alarm for tomorrow morning.
  • Duration – Long on-call rotations wear you down from the knowledge that there’s still more to come. Too short and you have to check your calendar for everything to figure out if you’re on-call that day. Frequency has a role too — with only 1 or 2 people in a rotation your “week of on-call” quickly turns into half the year or the entire year.
  • Expectations – Either internally or externally, it’s easy to be pressured by the expectations of on-call. For the rotation it’s your job to fix what’s broken. If the environment is broken you must not be doing your job.

This is all on top of your normal job stress – responsibilities, deadlines, work environment, office politics. And we haven’t touched on stress from life outside of work.

What are the impacts of this stress?

Short answer? Mistakes. Health issues.

The National Institute for Occupational Safety and Health (NIOSH) has a great publication about the impact of stress at work. You can find it on the CDC website here, DHHS (NIOSH) Publication No. 99-101. I’ve pulled a few quotes from that document.

Mistakes

The St. Paul Fire and Marine Insurance Company conducted several studies on the effects of stress prevention programs in hospital settings.

[…]

In one study, the frequency of medication errors declined by 50% after prevention activities were implemented in a 700-bed hospital. In a second study, there was a 70% reduction in malpractice claims in 22 hospitals that implemented stress prevention activities. In contrast, there was no reduction in claims in a matched group of 22 hospitals that did not implement stress prevention activities.
—Journal of Applied Psychology

You’re thinking: “Great, they made a reduction in malpractice claims and medication errors, but I’m in tech, this doesn’t relate”.

It does relate. People make mistakes.

  • GitLab – Admin ran command on production instead of secondary database, losing 6 hours of data.
  • Reddit – 1.5 hours of downtime during planned migration because Puppet wasn’t disabled.
  • Amazon S3 – Command entered incorrectly by admin, resulting in 2 hours of downtime. Even Amazon’s status page broke in this outage.

To be clear: I’m not blaming these companies or employees involved for their outages. It’s great that they have publicly available postmortems that we can all learn from.

I’m also not claiming stress reduction or prevention would have made a difference in these particular cases. People are going to make mistakes regardless, but why open yourself up to the possibility of having twice as many, when those could be reduced and you’d have a better workplace to boot?

Health Issues

Health care expenditures are nearly 50% greater for workers who report high levels of stress.
—Journal of Occupational and Environmental Medicine

[W]orkers who must take time off work because of stress, anxiety, or a related disorder will be off the job for about 20 days.
—Bureau of Labor Statistics

If the business can’t afford downtime, can they afford higher health care premiums, or to miss an employee for 20 days?

How to reduce stress

Going back to the hospital studies (67 hospitals, 12,000 individuals), what did they do to reduce stress?

  1. Employee and management education on job stress
  2. Changes in hospital policies and procedures to reduce organizational
    sources of stress
  3. Establishment of employee assistance programs (specifically help and counseling for work-related and personal problems)

Employee assistance programs, while helpful, are most likely not something you can implement if you don’t have a role in employee benefits.

Reading this post and the NIOSH report on stress and then sending it to your friends, coworkers, and managers will help educate on on-call stress.

How do we in tech reduce organizational sources of stress?

Blameless Postmortems. Etsy’s blog explains. Practicing blameless postmortems will help reduce the stress experienced by the team, reducing the external expectations, because the team culture isn’t to attack mistakes. The person on-call knows this and should help with their own internal expectations.

Proper on-call coverage. On-call rotations where you’re responsible for 24-hour coverage should ideally be a week long — rotating more frequently leads to too many handoffs and is difficult to plan around. Longer durations impacts the employee’s life outside of work. If you have at least 4 people in the on-call rotation, then you’ll only be on-call for no more than one week a month. Allowing the on-call rotation to have proper fall-back coverage, for example having a secondary as backup or allowing people to get coverage for a few hours or day will let employees have the flexibility to be able to do meaningful things in their lives outside of work.

Avoid too many cooks in the kitchen during outages. If people are hovering over the on-call individual or team trying to solve the issue, asking for constant updates, this adds to the stress of urgency and expectations. Organizationally, you can limit this. Pagerduty has an excellent on-call response guide, which appears to be modeled after the Incident Command System or National Incident Management System if you want some ideas. Schedule check-ins for updates, giving time for work to get done.

No on-call heroes. One person’s “heroic” effort to restore the environment is another person’s worst nightmare. They don’t want to spend hours solving the problem, or be the only one working on the outage where everyone is relying on them. There’s no satisfaction for being a “hero”, only dread. When they see others being praised for those actions they want to run the other way. Be mindful of whether you’re promoting this sort of behavior in your team’s culture or just acknowledging employee contributions. Strive to reduce the need for these situations in the future.

Improve the systems. Minimize the uncertainty of being on-call by making it reasonable to be on-call. Set a quantitative goal to reduce the number of after-hour pages within your organization. Some of changes will be easy to make, others may be more involved. But if the entire team is striving towards that goal, progress will be made and the whole team will be helped by it.

Anyone can help drive change in their team and organization to reduce sources of stress. Even small changes will help make the work environment more enjoyable, so do your part where you can.

Using virtualenv and PYTHONPATH with Datadog

Datadog is a great service I’ve used for monitoring. Since the agent is Python-based it’s very extensible through a collection of pip installable libraries, but the documentation is limited on how to handle these libraries.

If you use the provided datadog-agent package, Datadog comes with its own set of embedded applications to monitor your server, including python for the agent, supervisord to manage the Datadog processes, and pip. Since this is all just Python, surely this can lead to something. Can’t we import our own custom libraries in our custom checks? Yes we can.

Continue reading “Using virtualenv and PYTHONPATH with Datadog”

Tiered Storage With Elasticsearch

Elasticsearch allows you to setup heterogeneous clusters, that is, nodes with different configurations within the same cluster. Elastic (the company) refers to this architecture as “Hot-Warm“, but it’s called tiered storage if you come from a storage background.

The canonical example is that you have a bunch of data you want to keep online and able to query, but it becomes less relevant over time. You want to cut costs, so you have your “Hot” data that is written and/or read frequently, most likely on SSD, and then “Warm” data accessed less frequently on less expensive nodes, most likely on spinning disks. But it doesn’t end there, this architecture can be used and extended in different ways

Continue reading “Tiered Storage With Elasticsearch”

Take That Vacation: Eliminate Alerts Dragging You Back to the Office

I authored this as part of SysAdvent, which posts one system administration-related post each day in December, ending on the 25th. You can find the original posted here: http://sysadvent.blogspot.com/2016/12/day-15-take-that-vacation-eliminate.html

It’s mid afternoon and you just sat down for that holiday meal with your family and friends. Your phone goes off and you look at the number. Work, again.

Before you even read the text or answer the call with the robotic voice telling you about the latest problem, you’re wondering to yourself “how long it will take?” Your relatives are only in-town for another day or two, before you have to take them to the airport. What if it goes off again later? A holiday potentially ruined.

You read the text. Maybe it’s a false alarm. Maybe it’s not. Either way you’re out of the moment–worrying about work and if things are going to break over the holidays.

Don’t Be Your Own Grinch

It’s possible to engineer yourself and environment for success. Continue reading “Take That Vacation: Eliminate Alerts Dragging You Back to the Office”