FIXIT: Let’s making alerting great again

Background

As of November 2018, WG2 has over 70+ on-call alerts. This alone is a great accomplishment as we have the ability to proactively fix connectivity issues in our services or network.

However, we found out that creating a good on-call alert can be tricker then expected. Many of our alerts were either incomplete with missing links to dashboards and playbooks, or playbooks were not specific to the alert. In turn, this made it challenging for an on-call engineer to properly investigate and escalate the problem.

So how did we “fix” this problem? A FIXIT!

A FIXIT is a process where we all pull together to spend an afternoon to fix a simple and fundamental problem.

On-call alerts

Before we can fix a problem, we need to better understand what the problem is.

There is a lot of information in the Google SRE book on the topic of monitoring.

Summed up by @dape, these questions reflect a fundamental philosophy on pages and pagers:

So essentially, on-call alerts (pages) are actionable and should require intelligence - investigate then escalate.

Playbooks

To help perform the investigation process, playbooks are used. A playbook for an OCE alert should follow these guidelines:

  1. It should be actionable, i.e. it is clear what the OCE should do in order to:
    a. Determine if the problem is real.
    b. Escalate to the right party, or – failing to determine that – the alarm owner!
  2. It should aid the investigation:
    a. Links to a dashboard where the problem is visible.
    b. Pointers to related systems, and briefly describe their relationship.
  3. It should describe the potential consequence of the outage.

Now having the understanding of what makes a good on-call alert and how to write good playbooks, let’s fix it!

So let’s make alerting great again! - FIXIT style

During our Friday afternoon FIXIT, we managed to address more then half of the current on-call alerts. With 13 PRs, we made changes to:

While there is still more work to be done, it is key that each team takes ownership over the on-call alerts related to their services.

Special thanks for everyone that participated and helped try to make the alerts better!

Useful links:

Suggest a change

Working Groups Two's blog is open-source and hosted on GitHub. Anyone is free to suggest changes through GitHub.

Suggest a change