*by Neil Fraser & Christopher Allen*

A common problem when archiving backups is how to (at any given time) have a backup from a day ago, a backup from two days ago, a backup from from four days ago, and so on. Backups may be taken daily, but due to storage constraints most must be deleted. There should be more backups retained of recent history than ancient history. Figuring out which backups to keep and which to delete is tricky.

There are many backup rotation schemes, such as "Grandfather, Father, Son", or the more elegant "Tower of Hanoi". However, while these systems are simple and produce good results, they don't recover well when there are holes in the backup cycles (resulting from downtime, data-loss, or other factors). These strategies are also wasteful of available media before they hit their maximum saturation point (e.g. if there is room to store 10 backups, then at the end of 10 days all 10 daily backups should be available). Furthermore, they don't scale once the media is fully used, requiring either the oldest history to be dropped or additional media to be added (albeit at a logarithmically decreasing rate).

The logarithmic strategy presented here is based on the premise that the available media will be maximally utilized, but will not be expected to grow. This strategy also accommodates irregular backups and changes to the amount of media available.

The process starts by taking and storing a new backup. Then, before the next backup is taken, a determination is made as to whether the media space is exhausted (this can be done either by counting the number of backups or by computing whether there is sufficient free space to store a new backup of approximately the same size as the previous one). While there is insufficient space, then one existing backup needs to be chosen for deletion.

To delete one backup, we start with three pieces of data, all of which must be in the same units (suggest seconds):

`INTERVAL`

- The expected time between regular backup events.
`CURRENT_TIME`

- The present time stamp.
`BACKUP_TIMES`

- A sorted array of time stamps for any existing backups (most recent first, oldest last).

From this we compute the following:

`BACKUP_COUNT`

- The desired number of backups. This is
`BACKUP_TIMES.length - 1`

. `TOTAL_TIME`

- The total date range between now and the oldest backup.
This is
`CURRENT_TIME - BACKUP_TIMES.lastElement`

`MISSING`

- The number of backups that have been deleted or are otherwise not present.
This is
`max(TOTAL_TIME / INTERVAL - BACKUP_COUNT, 0)`

. `DECAY_RATE`

- A decay rate of 2.0 would dictate that each archived backup would ideally
be twice the age of its predecessor. The decay rate is computed as
`(MISSING + 1) ^ (1 / BACKUP_COUNT)`

. `IDEAL_TIMES`

- An array of time stamps that the backups should have. This is computed
with
`CURRENT_TIME - INTERVAL * (n + (DECAY_RATE ^ n) - 1)`

where`n`

is every integer from`0`

(inclusive) to`BACKUP_COUNT`

(inclusive).

By this stage we now have `IDEAL_TIMES`

(a list of ideal time
stamps for backups) and `BACKUP_TIMES`

(a list of actual times).
Both lists are the same length. The remaining task is just to identify the one
existing backup which should be deleted.

The method for choosing this backup is to simulate the deletion of each
backup, measuring the total difference between the elements of
`BACKUP_TIMES`

and `IDEAL_TIMES`

, and choosing the
deletion that results in the minimum difference. Implementing this in O(n^2)
time is straightforward, but doing so in O(n) time is more interesting.

The first step for computing a linear time solution is to create an array
called `RIGHT_DIFF`

of cumulative errors, starting from the right
side.
`RIGHT_DIFF[n] = abs(BACKUP_TIMES[n] - IDEAL_TIMES[n]) + RIGHT_DIFF[n + 1]`

The second step is to compute the cumulative errors starting from the left side.
`LEFT_DIFF[n] = abs(BACKUP_TIMES[n - 1] - IDEAL_TIMES[n]) + LEFT_DIFF[n - 1]`

Note that the `BACKUP_TIMES`

array is compared at `[n - 1]`

since this would be the alignment after one element has been deleted. Also note
that actually constructing a `LEFT_DIFF`

array is not needed since
only the previous value is ever needed.

The total projected error for deleting each element is ```
LEFT_DIFF[n] +
RIGHT_DIFF[n + 1]
```

. Where this formula returns the smallest result,
delete the element at position `n`

.

As can be see in the demo below, this algorithm tracks as closely as possible with the ideal curve.

Max number of backups:

Backup every seconds.

Current time:

Current backups:

Backup count: NaN

Total time: NaN

Decay rate: NaN

Start by just pressing the 'Play' button. First the algorithm fills up the available storage space. After ten cycles it starts deleting selected backups.

At any time one can edit the parameters, including deleting backups or introducing discontinuities in the list of backups. Open your browser's console to see a log of creation/deletion events while the demo is running. View this page's source for a JavaScript implementation.

Last modified: 29 November 2018