Sustainable On-Call

I saw a tweet by Charity Majors that got me thinking–

On-call is stressful, overwhelmingly the negative type of stress, distress, rather than positive, eustress. I’ve written about the stresses of on-call before–urgency, uncertainty, duration, and expectations.  We all know that distress can contribute to burnout, but individually those four factors are fairly benign. Expectations are part of any job. People on oil rigs work 12+ hours a day for two weeks straight. A number of jobs have uncertain and urgent of tasks, such as first responders or doctors. If these components can be managed, why then can on-call be so miserable?

Digging deeper, I came to the conclusion that the worst part of on-call revolves around frequency and volume. Everything we do around improving on-call I believe tries to attack these two causes. Why do these factors impact on-call, and how can they be mitigated?

Continue reading “Sustainable On-Call”

The Hidden Costs of On-Call: False Alarms

The video of my LISA17 talk is posted on YouTube.

Abstract:

On-call teams, postmortems, and costs of downtime are well-covered topics of DevOps. What’s not spoken of is the costs of false alarms in your alerting. The team’s ability to effectively handle true issues is hindered by this noise. What are these hidden costs, and how do you eliminate false alarms?

While you’re at LISA17, how many monitoring emails do you expect to receive? 50? 100? How many of those need someone’s intervention? Odds are you won’t need to go off into a corner with your laptop to fix something critical on all of those emails.

Noisy monitoring system defaults and un-tuned alerts barrage us with information that isn’t necessary. Those false alerts have a cost, even if it’s not directly attributable to payroll. We’ll walk through some of these costs, their dollar impacts on companies, and strategies to reduce the false alarms.

Talk slides:

If you would like to read more about monitoring and on-call, you may enjoy these posts:

Citations:

 

Overcoming Monitoring Alarm Flood

You’ve most likely had 10, 20, 50, or even more alerts hit your inbox or pager in a short span of time or all at once. What do you call this situation?

It turns out, there’s a name for this influx of alerts–“alarm flood”.

“Alarm flood” originates in the power and process industries, but the concept can be applied to any industry. Alarm flood deals with the interaction between humans and computers–specifically more automated alerts than the human element can process, interpret, and correctly respond to. It is the result of multiple small changes, redesigns, and additions to a system over time: Why would you not want to “let the operator know” that a system has changed states?

Alarm flood has been discussed in those industries for at least the past 20 years, but it was formally defined in 2009 in the ANSI/ISA 18.2 Alarm Management Standard as 10 or more alarms in any 10 minute period, per operator.

In tech, the “operator” will be the person on call. In a smaller operation with only one or two engineers on call at a single time, any significant event could turn into a flood, making it difficult for the engineer to identify and address the root causes. How do we fix this flood state and provide better information to our engineers?

Continue reading “Overcoming Monitoring Alarm Flood”

Using virtualenv and PYTHONPATH with Datadog

Datadog is a great service I’ve used for monitoring. Since the agent is Python-based it’s very extensible through a collection of pip installable libraries, but the documentation is limited on how to handle these libraries.

If you use the provided datadog-agent package, Datadog comes with its own set of embedded applications to monitor your server, including python for the agent, supervisord to manage the Datadog processes, and pip. Since this is all just Python, surely this can lead to something. Can’t we import our own custom libraries in our custom checks? Yes we can.

Continue reading “Using virtualenv and PYTHONPATH with Datadog”