The GitLab recovery - what can we learn?
GitLab is a git repository & management system. On the 31st of Jan, they suffered an outage and data loss.
After any recovery exercise/test or incident we should review the lessons-learned and see how we can improve. This particular incident received a great deal of press and social media attention, and GitLab provided detail of what went wrong and how they recovered which makes it an excellent case to review.
The incident
Here is GitLab's own "post-mortem": https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
There were initial reports that the loss might have been from a cyber-attack but the cause was just "accidental removal of data". So far, this should not be cause for serious concern we all have accidental deletions.
We are frequently making these kinds of low-urgency restores for our customers: some data has been removed, it is not currently causing a major operational concern, and it just needs to be restored quickly.
It was at this stage that the incident escalated from something relatively minor to an operational concern. GitLab tried several different methods to restore data but they failed:
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don't appear to be working, producing files only a few bytes in size.
SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
Our backups to S3 apparently don't work either: the bucket is empty
https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/
Eventually, most data was able to be recovered from a staging server but there was data loss and downtime of around 18 hours.
Transparency
Firstly a note on GitLab's transparency throughout the incident. It is refreshing to see this level of transparency from a business revealing their inner workings and processes. GitLab live-streamed their recovery and at its peak over 5,000 viewers were watching on Youtube.
Although this event was embarrassing, the general attitude of on social media at the time was of real respect for the way they handled the situation.
This is a great example of a business actually gaining good-will through good management of a downtime incident.
Lessons
GitLab provided a root cause analysis with the actions it is taking to improve its recovery procedures, but there are some more universal lessons that we can all learn from:
- Backups must be monitored
GitLab was forced to recover from a staging environment because its standard backups "failed silently". Backups must be monitored and you must receive alerts for any failures.
- Storage affects your recovery speed
Backups are one of the places most will look to save money on storage. Storage performance isn't important for most backup jobs but it is important for recovery.
- The location of your storage affects your recovery speed
GitLab's recovery speed was slow in part due to recovery from a different storage account. The more universal lesson here and the issue we commonly see is loss of local backup storage resulting in a need to recover from offsite. A local recovery is significantly faster so it is important to separate local backup from production systems where possible to reduce the chance of a failure affecting both live and backup data.
- Use the right backup/replication method for your recovery
Snapshots are great for fast recovery of entire disks. They're not good for granular restores.
- Have someone responsible for backups
No ownership of the backup process leads to failures.
- Always have an alternative method of working
GitLab uses its own product GitLab.com and so was affected by the outage. It was able to use: "a private GitLab instance we normally use for private/sensitive workflow" while GitLab was unavailable. A more universal lesson for businesses here is to have alternative systems to use in a disaster. For instance, how would you communicate with staff during a disaster? If you lose access to your office and IT, you can't use email. Where do you store your contact directory? The solution here is to use an offsite Mass Notification Service – or build one.
- Do not assume backups are working (or anything)
For GitLab, multiple different backups did not work when called upon. Any assumption is dangerous and should be tested. Which leads to...
- Recoveries must be tested
The most important lesson to learn here is that if you do not test your recovery, you cannot be sure it will work. All of these issues are quickly flagged and fixed if you test. If you don't you will be forced to make these discoveries and improvements during a disaster.