Recent Posts
Playing with Python Structured Logs
Migrating to Hugo
I recently migrated my blog from Wordpress.com to a static site powered by Hugo on Netlify.
While I had no problems with Wordpress.com with respect to hosting, ease of setup or use, I did not care for the plugin situation. Specifically, Wordpress has a very limited whitelist of Javascript functionality and plugins on the personal plan, and this leads to some weirdness around behavior.
Sustainable On-Call
I saw a tweet by Charity Majors that got me thinking–
Yes, yes. On call sucks and can destroy your life. I know this. Bored now.On call is a fact of life for anyone who cares about developing high quality software for the long run. So how can we make it not suck?
— Charity Majors (@mipsytipsy) January 31, 2018
On-call is stressful, overwhelmingly the negative type of stress, distress, rather than positive, eustress. I’ve written about the stresses of on-call before–urgency, uncertainty, duration, and expectations. We all know that distress can contribute to burnout, but individually those four factors are fairly benign. Expectations are part of any job. People on oil rigs work 12+ hours a day for two weeks straight. A number of jobs have uncertain and urgent of tasks, such as first responders or doctors. If these components can be managed, why then can on-call be so miserable?
Digging deeper, I came to the conclusion that the worst part of on-call revolves around frequency and volume. Everything we do around improving on-call I believe tries to attack these two causes. Why do these factors impact on-call, and how can they be mitigated?
Book Review: Incident Management for Operations
I have an interest in bringing ideas from outside of the tech industry and seeing how they fit. After working with Kerim Satirli (@ksatirli) on my SysAdvent post about multiple root causes, he was kind enough to send me a book “Incident Management for Operations”. The book focuses on using the Incident Management System, pioneered in emergency services for fighting wildfires, in managing outages in tech.
“Incident Management for Operations” was authored by Rob Schnepp, Ron Vidal, and Chris Hawley of Blackrock 3 Partners. You can find the book on Amazon or Safari Books Online.
In A Nutshell
The authors have adapted the Incident Management System (IMS) for use in IT operations. IMS is a standardized, scalable method for incident response to facilitate coordination between responders. This translates nicely to organizations where separate departments or teams are responsible for different pieces of a business’s IT infrastructure, and multiple disciplines are required for incident resolution.
The book lays out the framework for IMS and includes examples of applying the framework to IT. Since implementation can vary in practice (alignment with DevOps, ITIL, etc.), the book stops short of prescribing how to setup organizations, but gives enough information to determine how your organization could adapt to IMS.
The authors provide a number of mnemonics such as “CAN” (Conditions, Actions, Needs), “STAR” (Size up, Triage, Act, Review), and “TIME” (Tone, Interaction, Management, Engagement) to aid in implementing IMS and effectively leading as an Incident Commander. If your organization implements IMS, I’d suggest making a quick reference card with these mnemonics to put on your ID badge holder in case you forget during a 3 a.m. incident.
Root Cause is Plural
Below is a copy of my post from Sysadvent 2017 (Day 3). I’d like to thank Kerim Satirli (@ksatirli) once again for his help in editing the post and improving it.
Root Cause is Plural
Post-mortems are an industry standard process that happens after incidents and outages as a method of continuous learning and improvement. While the exact format varies from company to company, your post-mortem report typically addresses the Five W’s:
What happened?
What happened?
Where did it happen?
Who was impacted by the incident?
When did problem and resolution events occur?
Why did the incident occur?
The first four questions are generally easy to answer. The question that takes the majority of the time is the why. To determine why the incident occurred requires investigative skills, critical thinking, and logical deductions. Sometimes determining the true why takes multiple incidents, as various fixes are attempted before the incident is resolved, but eventually a “root cause” is designated as the root of all the problems and the report is complete.
But if your “root cause” amounts to a single failure, you have stopped your process too soon.