by Neil Fraser & Christopher Allen
A common problem when archiving backups is how to (at any given time) have a backup from a day ago, a backup from two days ago, a backup from from four days ago, and so on. Backups may be taken daily, but due to storage constraints most must be deleted. There should be more backups retained of recent history than ancient history. Figuring out which backups to keep and which to delete is tricky.
There are many backup rotation schemes, such as "Grandfather, Father, Son", or the more elegant "Tower of Hanoi". However, while these systems are simple and produce good results, they don't recover well when there are holes in the backup cycles (resulting from downtime, data-loss, or other factors). These strategies are also wasteful of available media before they hit their maximum saturation point (e.g. if there is room to store 10 backups, then at the end of 10 days all 10 daily backups should be available). Furthermore, they don't scale once the media is fully used, requiring either the oldest history to be dropped or additional media to be added (albeit at a logarithmically decreasing rate).
The logarithmic strategy presented here is based on the premise that the available media will be maximally utilized, but will not be expected to grow. This strategy also accommodates irregular backups and changes to the amount of media available.
The process starts by taking and storing a new backup. Then, before the next backup is taken, a determination is made as to whether the media space is exhausted (this can be done either by counting the number of backups or by computing whether there is sufficient free space to store a new backup of approximately the same size as the previous one). While there is insufficient space, then one existing backup needs to be chosen for deletion.
To delete one backup, we start with three pieces of data, all of which must be in the same units (suggest seconds):
From this we compute the following:
BACKUP_TIMES.length - 1.
CURRENT_TIME - BACKUP_TIMES.lastElement
max(TOTAL_TIME / INTERVAL - BACKUP_COUNT, 0).
(MISSING + 1) ^ (1 / BACKUP_COUNT).
CURRENT_TIME - INTERVAL * (n + (DECAY_RATE ^ n) - 1)where
nis every integer from
By this stage we now have
IDEAL_TIMES (a list of ideal time
stamps for backups) and
BACKUP_TIMES (a list of actual times).
Both lists are the same length. The remaining task is just to identify the one
existing backup which should be deleted.
The method for choosing this backup is to simulate the deletion of each
backup, measuring the total difference between the elements of
IDEAL_TIMES, and choosing the
deletion that results in the minimum difference. Implementing this in O(n^2)
time is straightforward, but doing so in O(n) time is more interesting.
The first step for computing a linear time solution is to create an array
RIGHT_DIFF of cumulative errors, starting from the right
RIGHT_DIFF[n] = abs(BACKUP_TIMES[n] - IDEAL_TIMES[n]) + RIGHT_DIFF[n + 1]
The second step is to compute the cumulative errors starting from the left side.
LEFT_DIFF[n] = abs(BACKUP_TIMES[n - 1] - IDEAL_TIMES[n]) + LEFT_DIFF[n - 1]
Note that the
BACKUP_TIMES array is compared at
[n - 1]
since this would be the alignment after one element has been deleted. Also note
that actually constructing a
LEFT_DIFF array is not needed since
only the previous value is ever needed.
The total projected error for deleting each element is
RIGHT_DIFF[n + 1]. Where this formula returns the smallest result,
delete the element at position
As can be see in the demo below, this algorithm tracks as closely as possible with the ideal curve.
Max number of backups:
Backup every seconds.
Backup count: NaN
Total time: NaN
Decay rate: NaN
Start by just pressing the 'Play' button. First the algorithm fills up the available storage space. After ten cycles it starts deleting selected backups.