This page contains a description of the Gitlab Improve workflow vision as a part of our Monitor stage.
Improve is the process of reviewing all events that happened around (before, during, and after) an incident and identifying how to change processes, system behavior and human behavior to prevent future incidents. Conducting an effective Post Incident Review requires preparation
Preparing for a Post Incident Review begins during the Triage process by documenting events and actions as they take place. This makes Post Incident Reviews much more effective. In addition to capturing events and actions taken by team members, it is helpful to collect metric visualizations that show when and how a system changed at the time of the incident.
Effective Post Incident Reviews are blameless. It should be stated at the beginning of the review that everyone involved acted with good intent and that they made the best decision they could with the information that they had. Setting this tone at the beginning of a review helps the team discover all system flaws and potential improvements. The review will walk through the event timeline of the incident disucssing why for each step (the Five whys method is an iterative interogative technique utilized by the GitLab Infrastructure Team to uncover true root cause).
Once the root causes of all critical events that happened during the incident have been uncovered and understood, the team will brainstorm improvements to change or prevent those events, ultimately preventing the incident from happening again or preparing better response plans in the case an similar incident occurs. Action items should be written down, prioritized, and scheduled. All action items should be assigned a DRI (directly responsible individual) to ensure completion.
Action items are no good if they team does not follow-up with the DRI to inquire on progress and completion. Follow-up may occur during daily or weekly stand-ups.
We have not enabled the entire workflow detailed above, however, we do have a couple features you can take advantage of today to simplify your Improve processes:
This workflow is currently at the Planned stage. Workflows in the Operations section are graded on the same maturity scale as categories.
We plan to empower teams with a guided Post Incident Review experience that makes it simple to feed system and process improvements back into the Plan stage, completing the DevOps loop. Work supporting this work is captured in this epic.