Monitor workflow - Resolve

Resolve

This page contains a description of the Gitlab Resolve workflow vision as a part of our Monitor stage.

Why Resolve?

Resolve is the process of restoring IT services following an incident that disrupted availability. This workflow follows Triage in which the problem at hand was investigated and the root cause determined. Once the fix for the root cause has been determined, the solution must be verified in a local environment before it is deployed to production. Following release, the services must be monitored to ensure that they return to levels that meet SLOs.

User Journey

Change proposal

The root cause has been discovered and responders have determined a potential solution. The next step is to propose a set of changes for review with the intention of restoring impacted services. In this scenario, the responding team is typically under pressure and the proposed solution may not be a long-term solution. The goal is to restore services for stakeholders as quickly as possible and follow-up the incident with a review where a long-term solution can be designed, discussed, and scheduled for implementation.

Verify and deploy

The solution has been reviewed and approved. A responder implements the solution and tests in their local environment before pushing to master and deploying to production. Depending on progressiveness of the team, this process may be streamlined using CI/CD workflows.

Monitor metrics

After release it is important to monitor production metrics to ensure the solution was comprehensive and worked as intended. Alerts will often auto-resolve during this phase.

External communication

Services have been restored and meet SLOs. A member of the team, often the Incident Commander, will communicate with stakeholders via different channels (Status Page, social media platforms, internal email, etc) to inform them that services are back up and available.

ドキュメント

After an incident, it is important to document what happened and how it was fixed. Taking the time to document this information may help the team triage and resolve a similar incident much faster in the future.

Today

What's possible

We have not enabled the entire workflow detailed above, however, we do have a couple features you can take advantage of today to simplify your Resolve processes:

Maturity

This workflow is currently at Planned stage. Workflows in the Operations section are graded on the same maturity scale as categories.

What's next

We plan to provide a Resolve experience to allows our users to efficiently restore services whether it be deploying a patch to application code or running a script to unclog ETL pipelines. Work supporting this workflow is captured in this epic.