Most mainstream databases implement some form of “incremental” backup. If you’ve already performed a backup, subsequent ones can be done by only transferring what has changed since your last backup. It’s a clever optimization.

But it turns out, in many cases a full backup is much faster than an incremental one.

TL;DR:

This article will argue for the following rule of thumb:

A full backup is faster if backup_age is greater than download_time * apply_rate

Where

  • backup_age is the number of transactions behind you are when you start the incremental backup
  • download_time is the time it takes to download the full database, not counting time to catch up afterwards
  • apply_rate is the rate at which your database is able to recover / apply / replay transactions for your particular workload

Read on for specifics!

Background

This post applies to databases like Postgres, Neo4j and MySQL, where the “diff” for incremental backups is some form of transaction log.

It’s primarily intended for people that implement these protocols, as a help to decide when to pick one strategy over the other. That said, as a user of these databases you could determine the parameter values and choose the appropriate strategy in your backup scripts.

I found it interesting the arrival rate of new transactions during the backup didn’t affect which strategy is faster. I think you will as well.

First, what does each strategy involve?

Incremental backup

  • Given an existing database backup, ask the source to stream all transactions that have occurred since the backup was taken
  • Apply those transactions to the backup

Full backup

This is the procedure on Neo4j, I believe it’s roughly the same on Postgres and MySQL but I admit I’ve not read the code:

  • Given nothing, ask the source database to transfer all data
  • Once transferred, ask for all transactions that occurred during the transfer
  • Apply those transactions to the backup

The two last steps heals errors caused by copying the files from a live database while it is under load.

Benefit

You can see how the incremental approach may be faster on a low write-load database.

Say it takes 10 minutes to transfer the raw store wholesale, but that there’s only been a single write since your last backup. An incremental backup in this case would be effectively instant.

But conversely, imagine you have a two month old backup, and performing an incremental backup would involve applying hundreds of millions of transactions. The 10 minutes you save from not copying the full store will likely be overtaken by the long time to catch up your old backup image.

The question is: when exactly does that overtake point occur?

Setup

There are three key variables in this problem to be aware of:

  • download_time - How long does full backup take just to copy the database files
  • apply_rate - What is the the rate at which we can apply transactions to the backup
  • backup_age - How many transactions behind is the source backup in the incremental backup case

When I first started modelling this, I was convinced you also had to take into account the arrival rate of new transactions. It turns you do if arrival_rate is greater than apply_rate because then you are, if you’ll excuse my language, fucked. The backup would then never complete, falling further and further behind the source database. Think doing a backup from a large production server under high load onto a small backup machine.

Otherwise, it factors out.

We can estimate the time for each method with these two formulas:

full_backup_time = 
  download_time 
  + (transactions_that_arrived_during_download / apply_rate)
  
incremental_backup_time = 
  (backup_age / apply_rate) 
  + (transactions_that_arrived_during_catchup / apply_rate)

The first term in each (download_time and (backup_age / apply_rate)) is about creating a base snapshot. Once complete, that base snapshot is as up to date as the source was when the backup started.

The second term then brings the base snapshot up running speed with the source. There’s some nuance here, because in some cases you can omit this in the incremental case. The rest of this assumes you need the backup to be in sync with the source at the time the backup completes.

Intuition

If you look at the two second terms, you’ll note they are effectively the same. They are both number_of_transactions_arriving_during_the_first_term / apply_rate.

This means the second term does not matter. Because it’s the same in both formulas, it has no bearing on comparing them against one another. The only thing that matters is how long the first term takes.

Maybe this is obvious, but it took me a while to get it. An example might help.

Say:

  • The source database you are backing up from is at transaction 1000
  • New transactions are arriving at 1 tx / s
  • The old backup you are doing incremental backup onto is at transaction 100

Then

  • Once the first portion of the full backup is done, where it downloads the files, it will be, effectively, at transaction 1000, same as the source it was downloaded from
  • Once the incremental backup has replayed the first 900 transactions, it will also be a snapshot of the store at transaction 1000

Both base snapshots now need to replay the transactions that arrived at 1 tx / s while they were produced. How many of those transactions there are depends only on how long the base snapshots took to produce.

And so, we can factor out the second term in both formulas.

Putting it together

If you buy the argument above (and if you’re into this stuff you know I’m hand-waving several things), the rest is reasonably straight forward.

Referring back to the formula in the Setup section. If we take out the second term in both formulas, we get:

full_backup_base_snapshot_time = 
  download_time

incremental_backup_base_snapshot_time = 
  (backup_age / apply_rate)

Set them equal to tell us where they intersect:

download_time = backup_age / apply_rate

Solve for backup_age to tell us how old a backup would need to be for a full backup to be better:

backup_age = download_time * apply_rate

Rule of thumb

A full backup is faster if backup_age is greater than download_time * apply_rate

Where

  • backup_age is the number of transactions behind you are when you start the incremental backup
  • download_time is the time it takes to download the full database, not counting time to catch up afterwards
  • apply_rate is the rate at which your database is able to recover / apply / replay transactions for your particular workload

Work on problems like this!

If you think queue-theory-ish things are interesting, we are hiring at Neo4j. Come build distributed systems and self-healing fleets of databases.

Acknowledgements

Foremost, to all the people that presumably figured this out a long time ago. Nothing is ever new in computing.

But also to Hugo and Andy for helping me think about this.


See also