This page contains a description of Gitlab triaging workflow vision as apart of our Monitor stage.
Triage is a process of detecting and identifying application performance bottleneck, intending to understand the root cause of the problem quickly and accurately. To conduct fast and effective troubleshooting, you need to have collected and have access to all the relevant information in order to appropriately diagnose the degradation. In this process, we aim to provide meaningful insights using deep visibility into all segments of an application. This way when it breaks, we'll help you figure out why quickly.
Triaging flow usually starts with an alert or a customer complaint. Gitlab will immediately alert you, via Email, Slack, Pagerduty or any other 3rd party tool about your application's health.
Once the alert has been triggered, it is examined, and a verification process begins to understand if this is a real problem and whether it is undergoing. It is also recommended to look at known issues, afterward the alert is acknowledged and assigned to the right team for further investigation.
An essential part of the triaging flow is understanding the business impact of a problem. Is it a wide system failure? Does it affects all users or just a subset? Different business impact dictates the level of urgency and course of action to take. For example, the selection of run books to follow, or the recommendation of team members to collaborate with.
Collaboration is critical for successful troubleshooting of an incident. Oftentimes different teams will need to work together to reduce the MTTR (Mean Time To Resolution) and to the detect the root cause of the problem. As a result, actions such as involving other teams, internal communication, notify stakeholders, all need to take place
To conduct an effective investigation of a problem, it is expected that an observability solution would have:
Once the investigation is over, it is common to document the finding for future analysis
We plan to provide a streamline triage experience to allows our users to quickly identify and effectively troubleshoot an application problem as described in the following flow:
Detailed information can be found in the triage to minimal epic