Category Strategy - Disaster Recovery

🚨 Disaster Recovery

Last updated: 2020-05-05

Introduction and how you can help

GitLab installations hold business critical information and data. The Disaster Recovery (DR) category helps our customers fulfill their business continuity plans by creating processes that allow the recovery of a GitLab instance following a natural or human-created disaster. Disaster Recovery complements GitLab's Reference Architectures and utilizes Geo nodes to enable a failover in a disaster situation. We want disaster recovery to be robust and easy to use for systems administrators - especially in a potentially stressful recovery situation.

Please reach out to Fabian Zimmer, Product Manager for the Geo group (Email) if you'd like to provide feedback or ask any questions related to this product category.

This strategy is a work in progress, and everyone can contribute:

Current state

⚠️ Currently, there are some limitations of what data is replicated. Please make sure to check the documentation!

Setting up a disaster recovery solution for GitLab requires significant investment and is cumbersome in more complex setups, such as high availability configurations. Geo doesn't replicate all parts of GitLab yet, which means that users need to be aware of what is automatically covered by replication via a Geo node and what parts need to be backed up separately.

Where we are headed

In the future, our users should be able to use a GitLab Disaster Recovery solution that fits within their business continuity plan. Users should be able to choose which Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are acceptable to them and GitLab's DR solution should provide configurations that fit those requirements.

A systems administrator should be able to confidently setup a DR solution even when the setup is complex, as is the case for high availability. In case of an actual disaster, a systems administrator should be able to follow a simple and clear set of instructions that allows them to recover a working GitLab installation. In order to ensure that DR works, frequent planned failovers should be tested.

We envision that GitLab's Disaster Recovery processes and solution should

Target audience and experience

Sidney - (Systems Administrator)

For more information on how we use personas and roles at GitLab, please click here.

What's Next & Why

Improve support for planned failovers

We want DR processes to be simpler and believe that improving the planned failover process is the best way to start improving this process. It currently takes more than 20 steps to perform a failover, many of which can be automated. DR procedures should be tested regularly and we are aiming to provide better support for this process. A simple example for a planned failover process would be:

Promoting a secondary should be simple

It is currently possible to promote a secondary node to a primary node, either during a planned failover or in a genuine disaster recovery situation; however, we believe that this process should be much simpler.

Geo supports promotion for a single node installation and for an HA configuration. The current promotion process is consists of a large number of manual preflight checks, followed by the actual promotion. Overall, there are more than 20 steps. The promotion is only possible in the command line, no UI flow is possible and for high-availability configurations modifications to the gitlab.rb file are required on almost all nodes. Given the critical nature of this process, Geo should make it simple to promote a secondary, especially for more complex high-availability configurations.

Add a GitLab maintenance mode

As stated above, part of a planned failover process is usually putting your instance in a maintenance mode. This would block any write operations and would allow a primary and secondary to be fully in sync before making the switch. Additionally, a maintenance period may be useful in other situations e.g. during upgrades or other infrastructure changes.

Replication should be easy to pause and resume

DR depends on PostgreSQL streaming replication via a Geo node right now. It should be easy to pause and resume the database replication during a planned failover or upgrade event.

Building a self-service Geo framework

As of May 2020 only ~60% of data types (we need a better name) are replicated 22% are fully verified. We have made some efforts to change this situation by trying to replicate the remaining data types and by trying to verify those data types. As part of those efforts we learned that replicating data types is hard and so is verifying the data.

In order to change this situation and allow for adding data types to Geo more quickly, we are investigating how to build a scalable, self-service geo-replication and verification framework. This should make it easier for other teams within GitLab to add new datatypes and allow us to manage GitLab's growth. Additionally, this will make it easier for the community to contribute to Geo. The goal here is to allow new features to ship with Geo support by default without impacting velocity.

In a year

Enable Geo on GitLab.com for Disaster Recovery

GitLab.com is by far the largest GitLab instance and is used by GitLab to dogfood GitLab itself. Currently, GitLab.com does not use GitLab Geo for DR purposes. This has many disadvantages and the Geo Team is working with Infrastructure to enable Geo on GitLab.com. We have recently enabled Geo on staging and are now evaluating Geo for Disaster Recovery. You can learn more about our our Geo on staging blog post!

What is not planned right now

We currently don't plan to replace PostgreSQL with a different database e.g. CockroachDB.

Maturity plan

This category is currently at the minimal maturity level, and our next maturity target is viable (see our definitions of maturity levels).

In order to move this category from minimal to viable, one of the main initiatives is to create a simplified disaster recovery process, enable DR via Geo on GitLab.com and to add a maintenance mode. You can track the work in the viable maturity epic.

Competitive landscape

We have to understand the current DR landscape better and we are actively engaging with customers to understand what features are required to move the DR category forward.

Analyst landscape

We do need to interact more closely with analysts to understand the landscape better.

Top customer success/sales issue(s)

Top user issues

Top internal customer issues/epics

Top strategy item(s)