Sustainable On-Call

I saw a tweet by Charity Majors that got me thinking–

On-call is stressful, overwhelmingly the negative type of stress, distress, rather than positive, eustress. I’ve written about the stresses of on-call before–urgency, uncertainty, duration, and expectations.  We all know that distress can contribute to burnout, but individually those four factors are fairly benign. Expectations are part of any job. People on oil rigs work 12+ hours a day for two weeks straight. A number of jobs have uncertain and urgent of tasks, such as first responders or doctors. If these components can be managed, why then can on-call be so miserable?

Digging deeper, I came to the conclusion that the worst part of on-call revolves around frequency and volume. Everything we do around improving on-call I believe tries to attack these two causes. Why do these factors impact on-call, and how can they be mitigated?

Continue reading “Sustainable On-Call”

Book Review: Incident Management for Operations

I have an interest in bringing ideas from outside of the tech industry and seeing how they fit. After working with Kerim Satirli (@ksatirli) on my SysAdvent post about multiple root causes, he was kind enough to send me a book “Incident Management for Operations”. The book focuses on using the Incident Management System, pioneered in emergency services for fighting wildfires, in managing outages in tech.

“Incident Management for Operations” was authored by Rob Schnepp, Ron Vidal, and Chris Hawley of Blackrock 3 Partners. You can find the book on Amazon or Safari Books Online.

In A Nutshell

The authors have adapted the Incident Management System (IMS) for use in IT operations. IMS is a standardized, scalable method for incident response to facilitate coordination between responders. This translates nicely to organizations where separate departments or teams are responsible for different pieces of a business’s IT infrastructure, and multiple disciplines are required for incident resolution.

The book lays out the framework for IMS and includes examples of applying the framework to IT. Since implementation can vary in practice (alignment with DevOps, ITIL, etc.), the book stops short of prescribing how to setup organizations, but gives enough information to determine how your organization could adapt to IMS.

The authors provide a number of mnemonics such as “CAN” (Conditions, Actions, Needs), “STAR” (Size up, Triage, Act, Review), and “TIME” (Tone, Interaction, Management, Engagement) to aid in implementing IMS and effectively leading as an Incident Commander. If your organization implements IMS, I’d suggest making a quick reference card with these mnemonics to put on your ID badge holder in case you forget during a 3 a.m. incident.

Continue reading “Book Review: Incident Management for Operations”

Root Cause is Plural

Below is a copy of my post from Sysadvent 2017 (Day 3). I’d like to thank Kerim Satirli (@ksatirli) once again for his help in editing the post and improving it.

Root Cause is Plural

Post-mortems are an industry standard process that happens after incidents and outages as a method of continuous learning and improvement. While the exact format varies from company to company, your post-mortem report typically addresses the Five W’s:

  • What happened?
  • What happened?
  • Where did it happen?
  • Who was impacted by the incident?
  • When did problem and resolution events occur?
  • Why did the incident occur?

The first four questions are generally easy to answer. The question that takes the majority of the time is the why. To determine why the incident occurred requires investigative skills, critical thinking, and logical deductions. Sometimes determining the true why takes multiple incidents, as various fixes are attempted before the incident is resolved, but eventually a “root cause” is designated as the root of all the problems and the report is complete.

But if your “root cause” amounts to a single failure, you have stopped your process too soon.

Continue reading “Root Cause is Plural”

The Hidden Costs of On-Call: False Alarms

The video of my LISA17 talk is posted on YouTube.


On-call teams, postmortems, and costs of downtime are well-covered topics of DevOps. What’s not spoken of is the costs of false alarms in your alerting. The team’s ability to effectively handle true issues is hindered by this noise. What are these hidden costs, and how do you eliminate false alarms?

While you’re at LISA17, how many monitoring emails do you expect to receive? 50? 100? How many of those need someone’s intervention? Odds are you won’t need to go off into a corner with your laptop to fix something critical on all of those emails.

Noisy monitoring system defaults and un-tuned alerts barrage us with information that isn’t necessary. Those false alerts have a cost, even if it’s not directly attributable to payroll. We’ll walk through some of these costs, their dollar impacts on companies, and strategies to reduce the false alarms.

Talk slides:

If you would like to read more about monitoring and on-call, you may enjoy these posts:



Overcoming Monitoring Alarm Flood

You’ve most likely had 10, 20, 50, or even more alerts hit your inbox or pager in a short span of time or all at once. What do you call this situation?

It turns out, there’s a name for this influx of alerts–“alarm flood”.

“Alarm flood” originates in the power and process industries, but the concept can be applied to any industry. Alarm flood deals with the interaction between humans and computers–specifically more automated alerts than the human element can process, interpret, and correctly respond to. It is the result of multiple small changes, redesigns, and additions to a system over time: Why would you not want to “let the operator know” that a system has changed states?

Alarm flood has been discussed in those industries for at least the past 20 years, but it was formally defined in 2009 in the ANSI/ISA 18.2 Alarm Management Standard as 10 or more alarms in any 10 minute period, per operator.

In tech, the “operator” will be the person on call. In a smaller operation with only one or two engineers on call at a single time, any significant event could turn into a flood, making it difficult for the engineer to identify and address the root causes. How do we fix this flood state and provide better information to our engineers?

Continue reading “Overcoming Monitoring Alarm Flood”

Reducing the Stresses of On-Call

Being on-call is stressful. It feels like the future of the company–or at the very least your job–depends on your vigilance. When will the pager alert come? How bad will it be?

Where is this stress coming from?

  • Urgency – Typically on-call only has a certain amount of time to respond to an incident. The idea of being late to respond is stressful for many. There’s also an implied urgency in that down = bad, so services should be restored as quickly as possible
  • Uncertainty – The failures could come at any time. Want to run to the store? Better be quick. Want to see a movie? You might have to miss the ending. Good luck sleeping, the alert could come as soon as your head hits the pillow or in the middle of the night or become your new alarm for tomorrow morning.
  • Duration – Long on-call rotations wear you down from the knowledge that there’s still more to come. Too short and you have to check your calendar for everything to figure out if you’re on-call that day. Frequency has a role too — with only 1 or 2 people in a rotation your “week of on-call” quickly turns into half the year or the entire year.
  • Expectations – Either internally or externally, it’s easy to be pressured by the expectations of on-call. For the rotation it’s your job to fix what’s broken. If the environment is broken you must not be doing your job.

This is all on top of your normal job stress – responsibilities, deadlines, work environment, office politics. And we haven’t touched on stress from life outside of work.

What are the impacts of this stress?

Short answer? Mistakes. Health issues.

The National Institute for Occupational Safety and Health (NIOSH) has a great publication about the impact of stress at work. You can find it on the CDC website here, DHHS (NIOSH) Publication No. 99-101. I’ve pulled a few quotes from that document.


The St. Paul Fire and Marine Insurance Company conducted several studies on the effects of stress prevention programs in hospital settings.


In one study, the frequency of medication errors declined by 50% after prevention activities were implemented in a 700-bed hospital. In a second study, there was a 70% reduction in malpractice claims in 22 hospitals that implemented stress prevention activities. In contrast, there was no reduction in claims in a matched group of 22 hospitals that did not implement stress prevention activities.
—Journal of Applied Psychology

You’re thinking: “Great, they made a reduction in malpractice claims and medication errors, but I’m in tech, this doesn’t relate”.

It does relate. People make mistakes.

  • GitLab – Admin ran command on production instead of secondary database, losing 6 hours of data.
  • Reddit – 1.5 hours of downtime during planned migration because Puppet wasn’t disabled.
  • Amazon S3 – Command entered incorrectly by admin, resulting in 2 hours of downtime. Even Amazon’s status page broke in this outage.

To be clear: I’m not blaming these companies or employees involved for their outages. It’s great that they have publicly available postmortems that we can all learn from.

I’m also not claiming stress reduction or prevention would have made a difference in these particular cases. People are going to make mistakes regardless, but why open yourself up to the possibility of having twice as many, when those could be reduced and you’d have a better workplace to boot?

Health Issues

Health care expenditures are nearly 50% greater for workers who report high levels of stress.
—Journal of Occupational and Environmental Medicine

[W]orkers who must take time off work because of stress, anxiety, or a related disorder will be off the job for about 20 days.
—Bureau of Labor Statistics

If the business can’t afford downtime, can they afford higher health care premiums, or to miss an employee for 20 days?

How to reduce stress

Going back to the hospital studies (67 hospitals, 12,000 individuals), what did they do to reduce stress?

  1. Employee and management education on job stress
  2. Changes in hospital policies and procedures to reduce organizational
    sources of stress
  3. Establishment of employee assistance programs (specifically help and counseling for work-related and personal problems)

Employee assistance programs, while helpful, are most likely not something you can implement if you don’t have a role in employee benefits.

Reading this post and the NIOSH report on stress and then sending it to your friends, coworkers, and managers will help educate on on-call stress.

How do we in tech reduce organizational sources of stress?

Blameless Postmortems. Etsy’s blog explains. Practicing blameless postmortems will help reduce the stress experienced by the team, reducing the external expectations, because the team culture isn’t to attack mistakes. The person on-call knows this and should help with their own internal expectations.

Proper on-call coverage. On-call rotations where you’re responsible for 24-hour coverage should ideally be a week long — rotating more frequently leads to too many handoffs and is difficult to plan around. Longer durations impacts the employee’s life outside of work. If you have at least 4 people in the on-call rotation, then you’ll only be on-call for no more than one week a month. Allowing the on-call rotation to have proper fall-back coverage, for example having a secondary as backup or allowing people to get coverage for a few hours or day will let employees have the flexibility to be able to do meaningful things in their lives outside of work.

Avoid too many cooks in the kitchen during outages. If people are hovering over the on-call individual or team trying to solve the issue, asking for constant updates, this adds to the stress of urgency and expectations. Organizationally, you can limit this. Pagerduty has an excellent on-call response guide, which appears to be modeled after the Incident Command System or National Incident Management System if you want some ideas. Schedule check-ins for updates, giving time for work to get done.

No on-call heroes. One person’s “heroic” effort to restore the environment is another person’s worst nightmare. They don’t want to spend hours solving the problem, or be the only one working on the outage where everyone is relying on them. There’s no satisfaction for being a “hero”, only dread. When they see others being praised for those actions they want to run the other way. Be mindful of whether you’re promoting this sort of behavior in your team’s culture or just acknowledging employee contributions. Strive to reduce the need for these situations in the future.

Improve the systems. Minimize the uncertainty of being on-call by making it reasonable to be on-call. Set a quantitative goal to reduce the number of after-hour pages within your organization. Some of changes will be easy to make, others may be more involved. But if the entire team is striving towards that goal, progress will be made and the whole team will be helped by it.

Anyone can help drive change in their team and organization to reduce sources of stress. Even small changes will help make the work environment more enjoyable, so do your part where you can.

Using virtualenv and PYTHONPATH with Datadog

Datadog is a great service I’ve used for monitoring. Since the agent is Python-based it’s very extensible through a collection of pip installable libraries, but the documentation is limited on how to handle these libraries.

If you use the provided datadog-agent package, Datadog comes with its own set of embedded applications to monitor your server, including python for the agent, supervisord to manage the Datadog processes, and pip. Since this is all just Python, surely this can lead to something. Can’t we import our own custom libraries in our custom checks? Yes we can.

Continue reading “Using virtualenv and PYTHONPATH with Datadog”

Take That Vacation: Eliminate Alerts Dragging You Back to the Office

I authored this as part of SysAdvent, which posts one system administration-related post each day in December, ending on the 25th. You can find the original posted here:

It’s mid afternoon and you just sat down for that holiday meal with your family and friends. Your phone goes off and you look at the number. Work, again.

Before you even read the text or answer the call with the robotic voice telling you about the latest problem, you’re wondering to yourself “how long it will take?” Your relatives are only in-town for another day or two, before you have to take them to the airport. What if it goes off again later? A holiday potentially ruined.

You read the text. Maybe it’s a false alarm. Maybe it’s not. Either way you’re out of the moment–worrying about work and if things are going to break over the holidays.

Don’t Be Your Own Grinch

It’s possible to engineer yourself and environment for success. Continue reading “Take That Vacation: Eliminate Alerts Dragging You Back to the Office”