Most mainstream databases implement some form of “incremental” backup. If you’ve already performed a backup, subsequent ones can be done by only transferring what has changed since your last backup. It’s a clever optimization.
But it turns out, in many cases a full backup is much faster than an incremental one.
TL;DR:
This article will argue for the following rule of thumb:
A full backup is faster if
backup_age
is greater thandownload_time * apply_rate
Where
backup_age
is the number of transactions behind you are when you start the incremental backupdownload_time
is the time it takes to download the full database, not counting time to catch up afterwardsapply_rate
is the rate at which your database is able to recover / apply / replay transactions for your particular workload
Read on for specifics!
Background
This post applies to databases like Postgres, Neo4j and MySQL, where the “diff” for incremental backups is some form of transaction log.
It’s primarily intended for people that implement these protocols, as a help to decide when to pick one strategy over the other. That said, as a user of these databases you could determine the parameter values and choose the appropriate strategy in your backup scripts.
I found it interesting the arrival rate of new transactions during the backup didn’t affect which strategy is faster. I think you will as well.
First, what does each strategy involve?
Incremental backup
- Given an existing database backup, ask the source to stream all transactions that have occurred since the backup was taken
- Apply those transactions to the backup
Full backup
This is the procedure on Neo4j, I believe it’s roughly the same on Postgres and MySQL but I admit I’ve not read the code:
- Given nothing, ask the source database to transfer all data
- Once transferred, ask for all transactions that occurred during the transfer
- Apply those transactions to the backup
The two last steps heals errors caused by copying the files from a live database while it is under load.
Benefit
You can see how the incremental approach may be faster on a low write-load database.
Say it takes 10 minutes to transfer the raw store wholesale, but that there’s only been a single write since your last backup. An incremental backup in this case would be effectively instant.
But conversely, imagine you have a two month old backup, and performing an incremental backup would involve applying hundreds of millions of transactions. The 10 minutes you save from not copying the full store will likely be overtaken by the long time to catch up your old backup image.
The question is: when exactly does that overtake point occur?
Setup
There are three key variables in this problem to be aware of:
download_time
- How long does full backup take just to copy the database filesapply_rate
- What is the the rate at which we can apply transactions to the backupbackup_age
- How many transactions behind is the source backup in the incremental backup case
When I first started modelling this, I was convinced you also had to take into account the arrival rate of new transactions.
It turns you do if arrival_rate
is greater than apply_rate
because then you are, if you’ll excuse my language, fucked.
The backup would then never complete, falling further and further behind the source database.
Think doing a backup from a large production server under high load onto a small backup machine.
Otherwise, it factors out.
We can estimate the time for each method with these two formulas:
full_backup_time =
download_time
+ (transactions_that_arrived_during_download / apply_rate)
incremental_backup_time =
(backup_age / apply_rate)
+ (transactions_that_arrived_during_catchup / apply_rate)
The first term in each (download_time
and (backup_age / apply_rate)
) is about creating a base snapshot.
Once complete, that base snapshot is as up to date as the source was when the backup started.
The second term then brings the base snapshot up running speed with the source. There’s some nuance here, because in some cases you can omit this in the incremental case. The rest of this assumes you need the backup to be in sync with the source at the time the backup completes.
Intuition
If you look at the two second terms, you’ll note they are effectively the same.
They are both number_of_transactions_arriving_during_the_first_term / apply_rate
.
This means the second term does not matter. Because it’s the same in both formulas, it has no bearing on comparing them against one another. The only thing that matters is how long the first term takes.
Maybe this is obvious, but it took me a while to get it. An example might help.
Say:
- The source database you are backing up from is at transaction
1000
- New transactions are arriving at
1 tx / s
- The old backup you are doing incremental backup onto is at transaction
100
Then
- Once the first portion of the full backup is done, where it downloads the files, it will be, effectively, at transaction
1000
, same as the source it was downloaded from - Once the incremental backup has replayed the first
900
transactions, it will also be a snapshot of the store at transaction1000
Both base snapshots now need to replay the transactions that arrived at 1 tx / s
while they were produced.
How many of those transactions there are depends only on how long the base snapshots took to produce.
And so, we can factor out the second term in both formulas.
Putting it together
If you buy the argument above (and if you’re into this stuff you know I’m hand-waving several things), the rest is reasonably straight forward.
Referring back to the formula in the Setup section. If we take out the second term in both formulas, we get:
full_backup_base_snapshot_time =
download_time
incremental_backup_base_snapshot_time =
(backup_age / apply_rate)
Set them equal to tell us where they intersect:
download_time = backup_age / apply_rate
Solve for backup_age
to tell us how old a backup would need to be for a full backup to be better:
backup_age = download_time * apply_rate
Rule of thumb
A full backup is faster if
backup_age
is greater thandownload_time * apply_rate
Where
backup_age
is the number of transactions behind you are when you start the incremental backupdownload_time
is the time it takes to download the full database, not counting time to catch up afterwardsapply_rate
is the rate at which your database is able to recover / apply / replay transactions for your particular workload
Work on problems like this!
If you think queue-theory-ish things are interesting, we are hiring at Neo4j. Come build distributed systems and self-healing fleets of databases.
Acknowledgements
Foremost, to all the people that presumably figured this out a long time ago. Nothing is ever new in computing.
But also to Hugo and Andy for helping me think about this.