1
0
mirror of https://github.com/gilbertchen/duplicacy synced 2025-12-06 00:03:38 +00:00

Compare commits

..

93 Commits

Author SHA1 Message Date
gilbertchen
8ba06efb85 Update DESIGN.md 2016-02-26 15:27:43 -05:00
gilbertchen
f00c47faf1 Update DESIGN.md 2016-02-26 13:45:58 -05:00
gilbertchen
da478ae340 Update DESIGN.md 2016-02-26 12:39:55 -05:00
gilbertchen
d76576e508 Update DESIGN.md 2016-02-26 12:37:20 -05:00
gilbertchen
21fb36a078 Update DESIGN.md 2016-02-26 12:36:12 -05:00
gilbertchen
22ba312ca8 Update GUIDE.md 2016-02-26 08:16:32 -05:00
gilbertchen
34a8090ca6 Update GUIDE.md 2016-02-26 08:14:31 -05:00
gilbertchen
58a876c4d6 Update README.md 2016-02-26 08:01:35 -05:00
gilbertchen
efca1b4459 Update README.md 2016-02-26 00:59:04 -05:00
gilbertchen
93a640cfc1 Update README.md 2016-02-26 00:17:46 -05:00
gilbertchen
d9f2eab28b Update README.md 2016-02-26 00:07:24 -05:00
gilbertchen
b09190a79c Update README.md 2016-02-26 00:06:04 -05:00
gilbertchen
2e58b8739e Update README.md 2016-02-25 22:01:13 -05:00
gilbertchen
38ba6a3400 Update README.md 2016-02-25 20:26:38 -05:00
gilbertchen
7ea2139b37 Update README.md 2016-02-25 20:26:07 -05:00
gilbertchen
1a1983cc09 Update README.md 2016-02-25 20:24:22 -05:00
gilbertchen
96b10648b5 Update README.md 2016-02-25 16:30:11 -05:00
gilbertchen
fe86a58a64 Update README.md 2016-02-25 15:13:04 -05:00
gilbertchen
30d4201192 Update README.md 2016-02-25 15:12:29 -05:00
gilbertchen
5631fc4a50 Update README.md 2016-02-25 15:11:38 -05:00
gilbertchen
fb55d31dd0 Update README.md 2016-02-25 15:10:34 -05:00
gilbertchen
74a72ac93c Update README.md 2016-02-25 08:13:06 -05:00
gilbertchen
c74b0681bc Update GUIDE.md 2016-02-25 00:43:11 -05:00
gilbertchen
76e4f4f267 Update README.md 2016-02-24 22:17:15 -05:00
gilbertchen
3410f4fff1 Update GUIDE.md 2016-02-24 22:12:45 -05:00
gilbertchen
0b44bed06e Update GUIDE.md 2016-02-24 22:11:41 -05:00
gilbertchen
f9df29edca Update GUIDE.md 2016-02-24 21:56:42 -05:00
gilbertchen
a96f0ba3d3 Update GUIDE.md 2016-02-24 21:34:16 -05:00
gilbertchen
a6dba8707c Update GUIDE.md 2016-02-24 21:33:23 -05:00
gilbertchen
0870af3008 Update README.md 2016-02-24 21:32:18 -05:00
gilbertchen
b9affbdb0d Update GUIDE.md 2016-02-24 21:02:46 -05:00
gilbertchen
46b22cc161 Update README.md 2016-02-24 19:36:16 -05:00
gilbertchen
d5bdd0bf81 Update GUIDE.md 2016-02-24 17:10:12 -05:00
gilbertchen
59fd834b14 Update GUIDE.md 2016-02-24 16:18:19 -05:00
gilbertchen
ab91fd414b Update GUIDE.md 2016-02-24 15:34:44 -05:00
gilbertchen
c3d0da9983 Update GUIDE.md 2016-02-24 15:30:54 -05:00
gilbertchen
c80a021542 Update GUIDE.md 2016-02-24 14:45:51 -05:00
gilbertchen
106ddd0581 Update GUIDE.md 2016-02-24 14:08:44 -05:00
gilbertchen
6153fdd254 Update GUIDE.md 2016-02-24 13:41:43 -05:00
gilbertchen
2bf7df9189 Update GUIDE.md 2016-02-24 13:37:57 -05:00
gilbertchen
1d4cf6f48b Update GUIDE.md 2016-02-24 12:28:01 -05:00
gilbertchen
013eaa5611 Create GUIDE.md 2016-02-24 12:02:35 -05:00
gilbertchen
011ed4e66e Update DESIGN.md 2016-02-24 11:54:18 -05:00
gilbertchen
0c67f47e2c Update DESIGN.md 2016-02-24 11:46:11 -05:00
gilbertchen
f5f1eeaaa5 Added files via upload 2016-02-24 11:41:11 -05:00
gilbertchen
9a8bd41057 Update README.md 2016-02-24 11:06:33 -05:00
gilbertchen
fdb788b026 Update README.md 2016-02-24 11:05:07 -05:00
gilbertchen
0aa3199291 Update README.md 2016-02-24 10:00:39 -05:00
gilbertchen
2255b756f6 Update README.md 2016-02-24 09:59:51 -05:00
gilbertchen
475b9ed378 Update README.md 2016-02-24 09:56:33 -05:00
gilbertchen
e23b486f15 Update README.md 2016-02-24 09:53:38 -05:00
gilbertchen
217dd99adc Delete start_dropbox_token.html 2016-02-24 09:51:39 -05:00
gilbertchen
fd0974e35a Create start_dropbox_token.html 2016-02-24 09:49:13 -05:00
gilbertchen
460baafe8e Update README.md 2016-02-24 09:46:58 -05:00
gilbertchen
b6aa983290 Update README.md 2016-02-24 09:42:32 -05:00
gilbertchen
24fb80ea61 Update README.md 2016-02-24 08:29:40 -05:00
gilbertchen
04b27d06ec Update README.md 2016-02-24 08:28:59 -05:00
gilbertchen
67d6706a2b Update DESIGN.md 2016-02-24 07:58:05 -05:00
gilbertchen
6f963a8194 Update README.md 2016-02-24 07:56:15 -05:00
gilbertchen
102150459e Update README.md 2016-02-23 23:02:37 -05:00
gilbertchen
04c8b69817 Update DESIGN.md 2016-02-23 22:54:55 -05:00
gilbertchen
d4b0e77ab4 Update DESIGN.md 2016-02-23 22:52:51 -05:00
gilbertchen
2a8126cf30 Update DESIGN.md 2016-02-23 22:51:58 -05:00
gilbertchen
36adbf59e4 Update DESIGN.md 2016-02-23 22:41:13 -05:00
gilbertchen
0ada49d11d Update DESIGN.md 2016-02-23 22:35:57 -05:00
gilbertchen
85d52feb42 Delete none 2016-02-23 22:29:05 -05:00
gilbertchen
a8ad7d130c Added files via upload 2016-02-23 22:28:39 -05:00
gilbertchen
d1a5874fc2 Create none 2016-02-23 22:27:40 -05:00
gilbertchen
4d50cf8622 Update README.md 2016-02-23 21:44:34 -05:00
gilbertchen
f1f16b5bab Update README.md 2016-02-23 21:36:09 -05:00
gilbertchen
bf4bfad413 Update README.md 2016-02-23 21:35:43 -05:00
gilbertchen
0a54795b7f Update README.md 2016-02-23 21:34:56 -05:00
gilbertchen
04a62b4ca2 Update README.md 2016-02-23 21:31:49 -05:00
gilbertchen
97419646f0 Update README.md 2016-02-23 21:31:26 -05:00
gilbertchen
0957eeba47 Update README.md 2016-02-23 21:30:27 -05:00
gilbertchen
1f585e2df3 Update README.md 2016-02-23 21:30:02 -05:00
gilbertchen
b3da6ad762 Update DESIGN.md 2016-02-23 21:02:08 -05:00
gilbertchen
328843b399 Update README.md 2016-02-23 16:17:22 -05:00
gilbertchen
896c2b5074 Update README.md 2016-02-23 15:40:33 -05:00
gilbertchen
9336fc97ae Update DESIGN.md 2016-02-23 15:38:02 -05:00
gilbertchen
ef9f1b7cb7 Update DESIGN.md 2016-02-23 15:35:07 -05:00
gilbertchen
9f1f5b7b23 Update README.md 2016-02-23 14:27:51 -05:00
gilbertchen
566a081224 Update DESIGN.md 2016-02-23 12:49:41 -05:00
gilbertchen
0e960106e4 Update DESIGN.md 2016-02-23 12:45:45 -05:00
gilbertchen
be187b7314 Update README.md 2016-02-23 12:41:15 -05:00
gilbertchen
19b0af86fc Update DESIGN.md 2016-02-23 12:40:53 -05:00
gilbertchen
b0d9fed137 Update DESIGN.md 2016-02-23 12:40:33 -05:00
gilbertchen
1c8fe0810d Update DESIGN.md 2016-02-23 12:23:00 -05:00
gilbertchen
9f816547b4 Create DESIGN.md 2016-02-23 12:19:43 -05:00
gilbertchen
73e5b398a4 Update README.md 2016-02-23 11:58:32 -05:00
gilbertchen
60881fb112 Update README.md 2016-02-23 11:47:35 -05:00
gilbertchen
57b297edfe Update README.md 2016-02-23 11:41:04 -05:00
gilbertchen
80c8ef7869 Add a short tutorial 2016-02-23 11:19:01 -05:00
6 changed files with 858 additions and 11 deletions

215
DESIGN.md Normal file
View File

@@ -0,0 +1,215 @@
## Lock-Free Deduplication
The three elements of lock-free deduplication are:
* Use variable-size chunking algorithm to split files into chunks
* Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database
* Apply a *two-step fossil collection* algorithm to remove chunks that become unreferenced after a backup is deleted
The variable-size chunking algorithm, also called Content-Defined Chunking, is well-known and has been adopted by many
backup tools. The main advantage of the variable-size chunking algorithm over the fixed-size chunking algorithm (as used
by rsync) is that in the former the rolling hash is only used to search for boundaries between chunks, after which a far
more collision-resistant hash function like MD5 or SHA256 is applied on each chunk. In contrast, in the fixed-size
chunking algorithm, for the purpose of detecting inserts or deletions, a lookup in the known hash table is required every
time the rolling hash window is shifted by one byte, thus significantly reducing the chunk splitting performance.
What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing
chunks and for determining which chunks are not needed any more. Instead, to check if a chunk has already been uploaded
before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk.
This effectively turn a cloud storage offering only a very limited
set of basic file operations into a powerful modern backup backend capable of both block-level and file-level deduplication. More importantly, the absence of a centralized indexing database means that there is no need to implement a distributed locking mechanism on top of the file storage.
By eliminating the chunk indexing database, lock-free duplication not only reduces the code complexity but also makes the deduplication less error-prone. Each chunk is saved individually in its own file, and once saved there is no need for modification. Data corruption is therefore less likely to occur because of the immutability of chunk files. Another benefit that comes naturally from lock-free duplication is that when one client creates a new chunk, other clients that happen to have the same original file will notice that the chunk already exist and therefore will not upload the same chunk again. This pushes the deduplication to its highest level -- clients without knowledge of each other can share identical chunks with no extra effort.
There is one problem, though.
Deletion of snapshots without an indexing database, when concurrent access is permitted, turns out to be a hard problem.
If exclusive access to a file storage by a single client can be guaranteed, the deletion procedure can simply search for
chunks not referenced by any backup and delete them. However, if concurrent access is required, an unreferenced chunk
can't be trivially removed, because of the possibility that a backup procedure in progress may reference the same chunk.
The ongoing backup procedure, still unknown to the deletion procedure, may have already encountered that chunk during its
file scanning phase, but decided not to upload the chunk again since it already exists in the file storage.
Fortunately, there is a solution to address the deletion problem and make lock-free deduplication practical. The solution is a *two-step fossil collection* algorithm that deletes unreferenced chunks in two steps: identify and collect them in the first step, and then permanently remove them once certain conditions are met.
## Two-Step Fossil Collection
Interestingly, the two-step fossil collection algorithm hinges on a basic file operation supported almost universally, *file renaming*.
When the deletion procedure identifies a chunk not referenced by any known snapshots, instead of deleting the chunk file
immediately, it changes the name of the chunk file (and possibly moves it to a different directory).
A chunk that has been renamed is called a *fossil*.
The fossil still exists in the file storage. Two rules are enforced regarding the access of fossils:
* A restore, list, or check procedure that reads existing backups can read the fossil if the original chunk cannot be found.
* A backup procedure does not check the existence of a fossil. That is, it must upload a chunk if it cannot find the chunk, even if an equivalent fossil exists.
In the first step of the deletion procedure, called the *fossil collection* step, the names of all identified fossils will
be saved in a fossil collection file. The deletion procedure then exits without performing further actions. This step has not effectively changed any chunk references due to the first fossil access rule. If a backup procedure references a chunk after it is marked as a fossil, a new chunk will be uploaded because of the second fossil access rule, as shown in Figure 1.
<p align="center">
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_1.png?raw=true"
alt="Reference after Rename"/>
</p>
The second step, called the *fossil deletion* step, will permanently delete fossils, but only when two conditions are met:
* For each snapshot id, there is a new snapshot that was not seen by the fossil collection step
* The new snapshot must finish after the fossil collection step
The first condition guarantees that if a backup procedure references a chunk before the deletion procedure turns it into a fossil, the reference will be detected in the fossil deletion step which will then turn the fossil back into a normal chunk.
The second condition guarantees that any backup procedure unknown to the fossil deletion step can start only after the fossil collection step finishes. Therefore, if it references a chunk that was identified as fossil in the fossil collection step, it should observe the fossil, not the chunk, so it will upload a new chunk, according to the second fossil access rule.
Therefore, if a backup procedure references a chunk before the chunk is marked a fossil, the fossil deletion step will not
delete the chunk until it sees that backup procedure finishes (as indicated by the appearance of a new snapshot file uploaded to the storage). This ensures that scenarios depicted in Figure 2 will never happen.
<p align="center">
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_2.png?raw=true"
alt="Reference before Rename"/>
</p>
## Snapshot Format
A snapshot file is a file that the backup procedure uploads to the file storage after it finishes splitting files into
chunks and uploading all new chunks. It mainly contains metadata for the backup overall, metadata for all the files,
and chunk references for each file. Here is an example snapshot file for a repository containing 3 files (file1, file2,
and dir1/file3):
```json
{
"id": "host1",
"revision": 1,
"tag": "first",
"start_time": 1455590487,
"end_time": 1455590487,
"files": [
{
"path": "file1",
"content": "0:0:2:6108",
"hash": "a533c0398194f93b90bd945381ea4f2adb0ad50bd99fd3585b9ec809da395b51",
"size": 151901,
"time": 1455590487,
"mode": 420
},
{
"path": "file2",
"content": "2:6108:3:7586",
"hash": "f6111c1562fde4df9c0bafe2cf665778c6e25b49bcab5fec63675571293ed644",
"size": 172071,
"time": 1455590487,
"mode": 420
},
{
"path": "dir1/",
"size": 102,
"time": 1455590487,
"mode": 2147484096
},
{
"path": "dir1/file3",
"content": "3:7586:4:1734",
"hash": "6bf9150424169006388146908d83d07de413de05d1809884c38011b2a74d9d3f",
"size": 118457,
"time": 1455590487,
"mode": 420
}
],
"chunks": [
"9f25db00881a10a8e7bcaa5a12b2659c2358a579118ea45a73c2582681f12919",
"6e903aace6cd05e26212fcec1939bb951611c4179c926351f3b20365ef2c212f",
"4b0d017bce5491dbb0558c518734429ec19b8a0d7c616f68ddf1b477916621f7",
"41841c98800d3b9faa01b1007d1afaf702000da182df89793c327f88a9aba698",
"7c11ee13ea32e9bb21a694c5418658b39e8894bbfecd9344927020a9e3129718"
],
"lengths": [
64638,
81155,
170593,
124309,
1734
]
}
```
When Duplicacy splits a file in chunks using the variable-size chunking algorithm, if the end of a file is reached and yet the boundary marker for terminating a chunk
hasn't been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all
files were packed into a big tar file which is then split into chunks.
The *content* field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
instance, *fiel1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.
The backup procedure can run in one of two modes. In the default quick mode, only modified or new files are scanned. Chunks only
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
files are appended. Indices for unchanged files need to be updated too.
In the safe mode (enabled by the -hash option), all files are scanned and the chunk sequence is regenerated.
The length sequence stores the lengths for all chunks, which are needed when calculating some statistics such as the total
length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous.
To make the situation worse, every time a big snapshot file would have been uploaded even if only a few files have been changed since
last backup. To save space, the variable-size chunking algorithm is also applied to the three dynamic fields of a snapshot
file, *files*, *chunks*, and *lengths*.
Chunks produced during this step are deduplicated and uploaded in the same way as regular file chunks. The final snapshot file
contains sequences of chunk hashes and other fixed size fields:
```json
{
"id": "host1",
"revision": 1,
"start_time": 1455590487,
"tag": "first",
"end_time": 1455590487,
"file_sequence": [
"21e4c69f3832e32349f653f31f13cefc7c52d52f5f3417ae21f2ef5a479c3437",
],
"chunk_sequence": [
"8a36ffb8f4959394fd39bba4f4a464545ff3dd6eed642ad4ccaa522253f2d5d6"
],
"length_sequence": [
"fc2758ae60a441c244dae05f035136e6dd33d3f3a0c5eb4b9025a9bed1d0c328"
]
}
```
In the extreme case where the repository has not been modified since last backup, a new backup procedure will not create any new chunks,
as shown by the following output from a real use case:
```
$ duplicacy backup -stats
Storage set to sftp://gchen@192.168.1.100/Duplicacy
Last backup at revision 260 found
Backup for /Users/gchen/duplicacy at revision 261 completed
Files: 42367 total, 2,204M bytes; 0 new, 0 bytes
File chunks: 447 total, 2,238M bytes; 0 new, 0 bytes, 0 bytes uploaded
Metadata chunks: 6 total, 11,753K bytes; 0 new, 0 bytes, 0 bytes uploaded
All chunks: 453 total, 2,249M bytes; 0 new, 0 bytes, 0 bytes uploaded
Total running time: 00:00:05
```
## Encryption
When encryption is enabled (by the -e option with the *init* or *add* command), Duplicacy will generate 4 random 256 bit keys:
* *Hash Key*: for generating a chunk hash from the content of a chunk
* *ID Key*: for generating a chunk id from a chunk hash
* *Chunk Key*: for encrypting chunk files
* *File Key*: for encrypting non-chunk files such as snapshot files.
Here is a diagram showing how these keys are used:
<p align="center">
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/duplicacy_encryption.png?raw=true"
alt="encryption"/>
</p>
Chunk hashes are used internally and stored in the snapshot file. They are never exposed unless the snapshot file is decrypted. Chunk ids are used as the file names for the chunks and therefore exposed. When the *cat* command is used to print out a snapshot file, the chunk hashes stored in the snapshot file will be converted into chunk ids first which are then displayed instead.
Chunk content is encrypted by AES-GCM, with an encryption key that is the HMAC-SHA256 of the chunk Hash with the *Chunk Key* as the secret key.
The snapshot is encrypted by AES-GCM too, using an encrypt key that is the HMAC-SHA256 of the file path with the *File Key* as the secret key.
There four random keys are saved in a file named 'config' in the file storage, encrypted with a master key derived from the PBKDF2 function on
the storage password chosen by the user.

479
GUIDE.md Normal file
View File

@@ -0,0 +1,479 @@
# Duplicacy User Guide
## Commands
#### Init
```
SYNOPSIS:
duplicacy init - Initialize the storage if necessary and the current directory as the repository
USAGE:
duplicacy init [command options] <snapshot id> <storage url>
OPTIONS:
-encrypt, -e encrypt the storage with a password
-chunk-size, -c 4M the average size of chunks
-max-chunk-size, -max 16M the maximum size of chunks (defaults to chunk-size * 4)
-min-chunk-size, -min 1M the minimum size of chunks (defaults to chunk-size / 4)
-compression-level, -l <level> compression level (defaults to -1)
```
The *init* command first connects to the storage specified by the storage URL. If the storage has been already been
initialized before, it will download the storage configuration (stored in the file named *config*) and ignore the options provided in the command line. Otherwise, it will create the configuration file from the options and upload the file.
The initialized storage will then become the default storage for other commands if the -storage option is not specified
for those commands. This default storage actually has a name, *default*.
After that, it will prepare the the current working directory as the repository to be backed up. Under the hood, it will create a directory
named *.duplicacy* in the repository and put a file named *preferences* that stores the snapshot id and encryption and storage options.
The snapshot id is an id used to distinguish different repositories connected to the same storage. Each repository must have a unique snapshot id.
The -e option controls whether or not encryption will be enabled for the storage. If encryption is enabled, you will be prompted to enter a storage password.
The three chunk size parameters are passed to the variable-size chunking algorithm. Their values are important to the overall performance, especially for cloud storages. If the chunk size is too small, a lot of overhead will be in sending requests and receiving responses. If the chunk size is too large, the effect of deduplication will be less obvious as more data will need to be transferred with each chunk.
The compression level parameter is passed to the zlib library. Valid values are -1 through 9, with 0 meaning no compression, 9 best compression (slowest), and -1 being the default value (equivalent to level 6).
Once a storage has been initialized with these parameters, these parameters cannot be modified any more.
#### Backup
```
SYNOPSIS:
duplicacy backup - Save a snapshot of the repository to the storage
USAGE:
duplicacy backup [command options]
OPTIONS:
-hash detect file differences by hash (rather than size and timestamp)
-t <tag> assign a tag to the backup
-stats show statistics during and after backup
-vss enable the Volume Shadow Copy service (Windows only)
-storage <storage name> backup to the specified storage instead of the default one
```
The *backup* command creates a snapshot of the repository and uploads it to the storage. If -hash is not provided,
it will upload new or modified files since last backup by comparing file sizes and timestamps.
Otherwise, every file is scanned to detect changes.
You can assign a tag to the snapshot so that later you can refer to it by tag in other commands.
If the -stats option is specified, statistical information such as transfer speed, number of chunks will be displayed
throughout the backup procedure.
The -vss option works on Windows only to turn on the Volume Shadow Copy service such that files opened by other
processes with exclusive locks can be read as usual.
When the repository can have multiple storages (added by the *add* command), you can select the storage to back up to
by giving a storage name.
You can specify patterns to include/exclude files by putting them in a file named *.duplicacy/filters*. Please refer to the [Include/Exclude Patterns](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#includeexclude-patterns) section for how to specify the patterns.
#### Restore
```
SYNOPSIS:
duplicacy restore - Restore the repository to a previously saved snapshot
USAGE:
duplicacy restore [command options] [--] [pattern] ...
OPTIONS:
-r <revision> the revision number of the snapshot (required)
-hash detect file differences by hash (rather than size and timestamp)
-overwrite overwrite existing files in the repository
-delete delete files not in the snapshot
-stats show statistics during and after restore
-storage <storage name> restore from the specified storage instead of the default one
```
The *restore* command restores the repository to a previous revision. By default the restore procedure will treat
files that have the same sizes and timestamps as those in the snapshot as unchanged files, but with the -hash option, every file will be fully scanned to make sure they are in fact unchanged.
By default the restore procedure will not overwriting existing files, unless the -overwrite option is specified.
The -delete option indicates that files not in the snapshot will be removed.
If the -stats option is specified, statistical information such as transfer speed, number of chunks will be displayed
throughout the restore procedure.
When the repository can have multiple storages (added by the *add* command), you can select the storage to restore from by specifying the storage name.
Unlike the *backup* procedure that reading the include/exclude patterns from a file, the *restore* procedure reads them
from the command line. If the patterns can cause confusion to the command line argument parser, -- should be prepended to
the patterns. Please refer to the [Include/Exclude Patterns](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#includeexclude-patterns) section for how to specify patterns.
#### List
```
SYNOPSIS:
duplicacy list - List snapshots
USAGE:
duplicacy list [command options]
OPTIONS:
-all, -a list snapshots with any id
-id <snapshot id> list snapshots with the specified id rather than the default one
-r <revision> [+] the revision number of the snapshot
-t <tag> list snapshots with the specified tag
-files print the file list in each snapshot
-chunks print chunks in each snapshot or all chunks if no snapshot specified
-reset-password take passwords from input rather than keychain/keyring or env
-storage <storage name> retrieve snapshots from the specified storage
```
The *list* command lists information about specified snapshots. By default it will list snapshots created from the
current repository, but you can list all snapshots stored in the storage by specifying the -all option, or list snapshots
with a different snapshot id using the -id option, and/or snapshots with a particular tag with the -t option.
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing
every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using
the -r option, which either takes a single revision number (-r 123) or a range (-r 123-456).
There can be multiple -r options.
If -files is specified, for each snapshot to be listed, this command will also print information about every file
contained in the snapshot.
If -chunks is specified, the command will also print out every chunk the snapshot references.
The -reset-password option is used to reset stored passwords and to allow passwords to be entered again. Please refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more information.
When the repository can have multiple storages (added by the *add* command), you can specify the storage to list
by specifying the storage name.
#### Check
```
SYNOPSIS:
duplicacy check - Check the integrity of snapshots
USAGE:
duplicacy check [command options]
OPTIONS:
-all, -a check snapshots with any id
-id <snapshot id> check snapshots with the specified id rather than the default one
-r <revision> [+] the revision number of the snapshot
-t <tag> check snapshots with the specified tag
-fossils search fossils if a chunk can't be found
-resurrect turn referenced fossils back into chunks
-files verify the integrity of every file
-storage <storage name> retrieve snapshots from the specified storage```
```
The *check* command checks, for each specified snapshot, that all referenced chunks exist in the storage.
By default the *check* command will check snapshots created from the
current repository, but you can check all snapshots stored in the storage at once by specifying the -all option, or
snapshots from a different repository using the -id option, and/or snapshots with a particular tag with the -t option.
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing
every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using
the -r option, which either takes a single revision number (-r 123) or a range (-r 123-456).
There can be multiple -r options.
By default the *check* command only verifies the existence of chunks. To verify the full integrity of a snapshot,
you should specify the -files option, which will download chunks and compute file hashes in memory, to
make sure that all hashes match.
By default the *check* command does not find fossils. If the -fossils option is specified, it will find
the fossil if the referenced chunk does not exist. if the -resurrect option is specified, it will turn the fossil back into a chunk.
When the repository can have multiple storages (added by the *add* command), you can specify the storage to check
by specifying the storage name.
#### Cat
```
SYNOPSIS:
duplicacy cat - Print to stdout the specified file, or the snapshot content if no file is specified
USAGE:
duplicacy cat [command options] [<file>]
OPTIONS:
-id <snapshot id> retrieve from the snapshot with the specified id
-r <revision> the revision number of the snapshot
-storage <storage name> retrieve the file from the specified storage
```
The *cat* command prints a file or the entire snapshot content if no file is specified.
The file must be specified with a path relative to the repository.
You can specify a different snapshot id rather than the default id.
The -r option is optional. If not specified, the latest revision will be selected.
You can use the -storage option to select a different storage other than the default one.
#### Diff
```
SYNOPSIS:
duplicacy diff - Compare two snapshots or two revisions of a file
USAGE:
duplicacy diff [command options] [<file>]
OPTIONS:
-id <snapshot id> diff with the snapshot with the specified id
-r <revision> [+] the revision number of the snapshot
-hash compute the hashes of on-disk files
-storage <storage name> retrieve files from the specified storage
```
The *diff* command compares the same file in two different snapshots if a file is given, otherwise compares the
two snapshots.
The file must be specified with a path relative to the repository.
You can specify a different snapshot id rather than the default snapshot id.
If only one revision is given by -r, the right hand side of the comparison will be the on-disk file.
The -hash option can then instruct this command to compute the hash of the file.
You can use the -storage option to select a different storage other than the default one.
#### History
```
SYNOPSIS:
duplicacy history - Show the history of a file
USAGE:
duplicacy history [command options] <file>
OPTIONS:
-id <snapshot id> find the file in the snapshot with the specified id
-r <revision> [+] show history of the specified revisions
-hash show the hash of the on-disk file
-storage <storage name> retrieve files from the specified storage
```
The *history* command shows how the hash, size, and timestamp of a file change over the specified set of revisions.
You can specify a different snapshot id rather than the default snapshot id, and multiple -r options to specify the
set of revisions.
The -hash option is to compute the hash of the on-disk file. Otherwise, only the size and timestamp of the on-disk
file will be included.
You can use the -storage option to select a different storage other than the default one.
#### Prune
```
SYNOPSIS:
duplicacy prune - Prune snapshots by revision, tag, or retention policy
USAGE:
duplicacy prune [command options]
OPTIONS:
-id <snapshot id> delete snapshots with the specified id instead of the default one
-all, -a match against all snapshot IDs
-r <revision> [+] delete snapshots with the specified revisions
-t <tag> [+] delete snapshots with the specified tags
-keep <n:m> [+] keep 1 snapshot every n days for snapshots older than m days
-exhaustive find all unreferenced chunks by scanning the storage
-exclusive assume exclusive access to the storage (disable two-step fossil collection)
-dry-run, -d show what would have been deleted
-delete-only delete fossils previously collected (if deletable) and don't collect fossils
-collect-only identify and collect fossils, but don't delete fossils previously collected
-ignore <id> [+] ignore the specified snapshot id when deciding if fossils can be deleted
-storage <storage name> prune snapshots from the specified storage
```
The *prune* command implements the two-step fossil collection algorithm. It will first find fossil collection files
from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it
will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil
collection file stored locally (the fossil collection step).
If a snapshot id is specified, that snapshot id will be used instead of the default one. The -a option will find
snapshots with any id. Snapshots to be deleted can be specified by revision numbers, by a tag, by retention policies,
or by any combination of them.
The retention policies are specified by the -keep option, which accepts an argument in the form of two numbers *n:m*, where *n* indicates the number of days between two consecutive snapshots to keep, and *m* means that the policy only applies to snapshots at least *m* day old. If *n* is zero, any snapshots older than *m* days will be removed.
Here are a few sample retention policies:
```sh
$ duplicacy prune -keep 1:7 # Keep 1 snapshot per day for snapshots older than 7 days
$ duplicacy prune -keep 7:30 # Keep 1 snapshot every 7 days for snapshots older than 30 days
$ duplicacy prune -keep 30:180 # Keep 1 snapshot every 30 days for snapshots older than 180 days
$ duplicacy prune -keep 0:360 # Keep no snapshots older than 360 days
```
Multiple -keep options must be sorted by their *m* values in decreasing order. For instance, to combine the above policies into one line, it would become:
```sh
$ duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7
```
The -exhaustive option will scan the list of all chunks in the storage, therefore it will find not only
unreferenced chunks from deleted snapshots, but also chunks that become unreferenced for other reasons, such as
those from an incomplete backup. It will also find any file that does not look like a chunk file.
In contrast, a default *prune* command will only identify
chunks referenced by deleted snapshots but not any other snapshots.
The -exclusive option will assume that no other clients are accessing the storage, effectively disabling the
*two-step fossil collection* algorithm. With this option, the *prune* command will immediately remove unreferenced chunks.
The -dryrun option is used to test what changes the *prune* command would have done. It is guaranteed not to make
any changes on the storage, not even creating the local fossil collection file. The following command checks if the
chunk directory is clean (i.e., if there are any unreferenced chunks, temporary files, or anything else):
```
$ duplicacy prune -d -exclusive -exhaustive # Prints out nothing if the chunk directory is clean
```
The -delete-only option will skip the fossil collection step, while the -collect-only option will skip the fossil deletion step.
For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least
one new snapshot from *each* snapshot id must be created between two runs of the *prune* command. However, some repository
may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days. It also provide an
-ignore option that can be used to skip certain repositories when deciding the deletion criteria.
You can use the -storage option to select a different storage other than the default one.
#### Password
```
SYNOPSIS:
duplicacy password - Change the storage password
USAGE:
duplicacy password [command options]
OPTIONS:
-storage <storage name> change the password used to access the specified storage
```
The *password* command decrypts the storage configuration file *config* using the old password, and re-encrypts the file
using a new password. It does not change all the encryption keys used to encrypt and decrypt chunk files,
snapshot files, etc.
You can specify the storage to change the password for when working with multiple storages.
#### Add
```
SYNOPSIS:
duplicacy add - Add an additional storage to be used for the existing repository
USAGE:
duplicacy add [command options] <storage name> <snapshot id> <storage url>
OPTIONS:
-encrypt, -e Encrypt the storage with a password
-chunk-size, -c 4M the average size of chunks
-max-chunk-size, -max 16M the maximum size of chunks (defaults to chunk-size * 4)
-min-chunk-size, -min 1M the minimum size of chunks (defaults to chunk-size / 4)
-compression-level, -l <level> compression level (defaults to -1)
-copy <storage name> make the new storage copy-compatible with an existing one
```
The *add* command connects another storage to the current repository. Like the *init* command, if the storage has not
been initialized before, a storage configuration file derived from the command line options will be uploaded, but those
options will be ignored if the configuration file already exists in the storage.
A unique storage name must be given in order to distinguish it from other storages.
The -copy option is required if later you want to copy snapshots between this storage and another storage.
Two storages are copy-compatible if they have the same average chunk size, the same maximum chunk size,
the same minimum chunk size, the same chunk seed (used in calculating the rolling hash in the variable-size chunks
algorithm), and the same hash key. If the -copy option is specified, these parameters will be copied from
the existing storage rather than from the command line.
#### Set
```
SYNOPSIS:
duplicacy set - Change the options for the default or specified storage
USAGE:
duplicacy set [command options]
OPTIONS:
-encrypt, e[=true] encrypt the storage with a password
-no-backup[=true] backup to this storage is prohibited
-no-restore[=true] restore from this storage is prohibited
-no-save-password[=true] don't save password or access keys to keychain/keyring
-key add a key/password whose value is supplied by the -value option
-value the value of the key/password
-storage <storage name> use the specified storage instead of the default one
```
The *set* command changes the options for the specified storage.
The -e option turns on the storage encryption. If specified as -e=false, it turns off the storage encryption.
The -no-backup option will not allow backups from this repository to be created.
The -no-restore option will not allow restoring this repository to a different revision.
The -no-save-password option will require every password or token to be entered every time and not saved anywhere.
The -key and -value options are used to store (in plain text) access keys or tokens need by various storages. Please
refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more details.
You can select a storage to change options for by specifying a storage name.
#### Copy
```
SYNOPSIS:
duplicacy copy - Copy snapshots between compatible storages
USAGE:
duplicacy copy [command options]
OPTIONS:
-id <snapshot id> copy snapshots with the specified id instead of all snapshot ids
-r <revision> [+] copy snapshots with the specified revisions
-from <storage name> copy snapshots from the specified storage
-to <storage name> copy snapshots to the specified storage
```
The *copy* command copies snapshots from one storage to another storage. They must be copy-compatible, i.e., some
configuration parameters must be the same. One storage must be initialized with the -copy option provided by the *add* command.
Instead of copying all snapshots, you can specify a set of snapshots to copy by giving the -r options. The *copy* command
preserves the revision numbers, so if a revision number already exists on the destination storage the command will fail.
If no -from option is given, the snapshots from the default storage will be copied. The -to option specified the
destination storage and is required.
## Include/Exclude Patterns
An include pattern starts with -, and an exclude pattern starts with +. Patterns may contain wildcard characters such as * and ? with their normal meaning.
When matching a path against a list of patterns, the path is compared with the part after + or -, one pattern at a time. Therefore, the order of the patterns is significant. If a match with an include pattern is found, the path is said to be included without further comparisons. If a match with an exclude pattern is found, the path is said to be excluded without further comparison. If a match is not found, the path will be excluded if all patterns are include patterns, but included otherwise.
Note that the path in Duplicacy for a directory always ends with a /, even on Windows. The path of a file does not end with a /. This can be used to exclude directories only.
For the *backup* command, the include/exclude patterns are read from a file named *filters* under the *.duplicacy* directory.
For the *restore* command, the include/exclude patterns are specified as the command line arguments.
## Managing Passwords
Duplicacy will attempt to retrieve in three ways the storage password and the storage-specific access tokens/keys.
* If a secret vault service is available, Duplicacy will store the password input by the user in such a secret vault and later retrieve it when needed. On Mac OS X it is Keychain, and on Linux it is gnome-keyring. On Windows the password is encrypted and decrypted by the Data Protection API, and encrypted password is stored in the file *.duplicacy/keyring*. However, if the -no-save-password option is specified for the storage, then Duplicacy will not save passwords this way
* If an environment variable for a password is provided, Duplicacy will always take it. The table below shows the name of the environment variable for each kind of password. Note that if the storage is not the default one, the storage name will be included in the name of the environment variable.
* If a matching key and its value are saved to the preference file (.duplicacy/preferences) by the *set* command, the value will be used as the password. The last column in the table below lists the name of the preference key for each type of password.
| password type | environment variable (default storage) | environment variable (non-default storage) | key in preferences |
|:----------------:|:----------------:|:----------------:|:----------------:|
| storage password | DUPLICACY_PASSWORD | DUPLICACY_<STORAGENAME>_PASSWORD | password |
| sftp password | DUPLICACY_SSH_PASSWORD | DUPLICACY_<STORAGENAME>_SSH_PASSWORD | ssh_password |
| Dropbox Token | DUPLICACY_DROPBOX_TOKEN | DUPLICACY_<STORAGENAME>_DROPBOX_TOEKN | dropbox_token |
| S3 Access ID | DUPLICACY_S3_ID | DUPLICACY_<STORAGENAME>_S3_ID | s3_id |
| S3 Secret Key | DUPLICACY_S3_KEY | DUPLICACY_<STORAGENAME>_S3_KEY | s3_key |
| BackBlaze Account ID | DUPLICACY_B2_ID | DUPLICACY_<STORAGENAME>_B2_ID | b2_id |
| Backblaze Application Key | DUPLICACY_B2_KEY | DUPLICACY_<STORAGENAME>_B2_KEY | b2_key |
| Azure Access Key | DUPLICACY_AZURE_KEY | DUPLICACY_<STORAGENAME>_AZURE_KEY | azure_key |
Note that the passwords stored in the environment variable and the preference need to be in plaintext and thus are insecure and should be avoided whenever possible.
## Scripts
You can instruct Duplicay to run a script before or after executing a command. For example, if you create a bash script with the name *pre-prune* under the *.duplicacy/scripts* directory, this bash script will be run before the *prune* command starts. A script named *post-prune* will be run after the *prune* command finishes. This rule applies to all commands except *init*.

175
README.md
View File

@@ -1,43 +1,196 @@
# Duplicacy: A new generation cloud backup tool based on Lock-Free Deduplication
# Duplicacy: A new generation cloud backup tool
Duplicacy supports major cloud storage providers (Amazon S3, Googld Cloud Storage, Microsoft Azure, Dropbox, and BackBlaze) and at the same time offers all essential features of a modern backup tool:
This repository contains only binary releases and documentation for Duplicacy. It also serves as an issue tracker for user-developer communication during the beta testing phase.
## Overview
Duplicacy supports major cloud storage providers (Amazon S3, Google Cloud Storage, Microsoft Azure, Dropbox, and BackBlaze) and offers all essential features of a modern backup tool:
* Incremental backup: only back up what has been changed
* Full snapshot : although each backup is incremental, it must appear to be a full snapshot independent of others
* Full snapshot : although each backup is incremental, it behaves like a full snapshot
* Deduplication: identical files must be stored as one copy (file-level deduplication), and identical parts from different files must be stored as one copy (block-level deduplication)
* Encryption: encrypt not only file contents but also file paths, sizes, times, etc.
* Deletion: every backup can be deleted independently without affecting others
* Concurrent access: multiple clients can back up to the same storage at the same time
The key idea behind Duplicacy is a technique called **Lock-Free Deduplication**. There are three elements of lock-free deduplication:
The key idea behind Duplicacy is a concept called **Lock-Free Deduplication**, which can be summarized as follows:
* Use variable-size chunking algorithm to split files into chunks
* Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database
* Apply a *two-step fossil collection* algorithm to remove chunks that become unreferenced after a backup is deleted
The [design document](https://github.com/gilbertchen/duplicacy-beta/blob/master/DESIGN.md) explains lock-free deduplication in detail.
## Getting Started
```sh
$ cd path/to/your/dir
$ duplicacy init mywork sftp://192.168.1.100/Duplicacy
During beta testing only binaries are available. Please visit the [releases page](https://github.com/gilbertchen/duplicacy-beta/releases/latest) to download and run the executable for your platform. Installation is not needed.
Once you have the Duplicacy executable under your path, you can change to the directory that you want to back up (called *repository*) and run the *init* command:
```
$ cd path/to/your/repository
$ duplicacy init mywork sftp://192.168.1.100/path/to/storage
```
This *init* command connects the repository with the remote storage at 192.168.1.00 via SFTP. It will initialize the remote storage if this has not been done before. It also assigns the snapshot id *mywork* to the repository. This snapshot id is used to uniquely identify this repository if there are other repositories that also back up to the same storage.
You can now create snapshots of the repository by invoking the *backup* command. The first snapshot may take a while depending on the size of the repository and the upload bandwidth. Subsequent snapshots will be much faster, as only new or modified files will be uploaded. Each snapshot is identified by the snapshot id and an increasing revision number starting from 1.
```sh
$ duplicacy backup
$ duplicacy backup -stats
```
Duplicacy provides a set of commands, such as list, check, diff, cat history, to manage snapshots:
```makefile
$ duplicacy list # List all snapshots
$ duplicacy check # Check integrity of snapshots
$ duplicacy diff # Compare two snapshots, or the same file in two snapshots
$ duplicacy cat # Print a file in a snapshot
$ duplicacy history # Show how a file changes over time
```
The *restore* command rolls back the repository to a previous revision:
```sh
$ duplicacy restore -r 1
```
The *prune* command removes snapshots by revisions, or tags, or retention policies:
```sh
$ duplicacy list
$ duplicacy prune -r 1 # Remove the snapshot with revision number 1
$ duplicacy prune -t quick # Remove all snapshots with the tag 'quick'
$ duplicacy prune -keep 1:7 # Keep 1 snapshot per day for snapshots older than 7 days
$ duplicacy prune -keep 7:30 # Keep 1 snapshot every 7 days for snapshots older than 30 days
$ duplicacy prune -keep 0:180 # Remove all snapshots older than 180 days
```
The first time the *prune* command is called, it removes the specified snapshots but keeps all unreferenced chunks as fossils.
Since it uses the two-step fossil collection algorithm to clean chunks, you will need to run it again to remove those fossils from the storage:
```sh
$ duplicacy add s3 mywork s3://amazon.com/duplicacy/mywork
$ duplicacy prune # Chunks from deleted snapshots will be removed if deletion criteria are met
```
To back up to multiple storages, use the *add* command to add a new storage. The *add* command is similar to the *init* command, except that the first argument is a storage name used to distinguish different storages:
```sh
$ duplicacy copy -r 1-2 -to s3
$ duplicacy add s3 mywork s3://amazon.com/mybucket/path/to/storage
```
You can back up to any storage by specifying the storage name:
```sh
$ duplicacy backup -storage s3
```
However, snapshots created this way will be different on different storages, if the repository has been changed during two backup operations. A better approach, is to use the *copy* command to copy specified snapshots from one storage to another:
```sh
$ duplicacy copy -r 1 -to s3 # Copy snapshot at revision 1 to the s3 storage
$ duplicacy copy -to s3 # Copy every snapshot to the s3 storage
```
The [User Guide](https://github.com/gilbertchen/duplicacy-beta/blob/master/DESIGN.md) contains a complete reference to
all commands and other features of Duplicacy.
## Storages
Duplicacy currently supports local file storage, SFTP, and 5 cloud storage providers.
#### Local disk
```
Storage URL: /path/to/storage (on Linux or Mac OS X)
C:\path\to\storage (on Windows)
```
#### SFTP
```
Storage URL: sftp://username@server/path/to/storage
```
Login methods include password authentication and public key authentication. Due to a limitation of the underlying Go SSH library, the key pair for public key authentication be generated without a passphrase. To work with a key that has a passphrase, you can set up SSH agent forwarding which is also supported by Duplicacy.
#### Dropbox
```
Storage URL: dropbox://path/to/storage
```
For Duplicacy to access your Dropbox storage, you must provide an access token that can be obtained in one of two ways:
* Create your own app on the [Dropbox Developer](https://www.dropbox.com/developers) page, and then generate the [access token](https://blogs.dropbox.com/developers/2014/05/generate-an-access-token-for-your-own-account/)
* Or authorize Duplicacy to access its app folder inside your Dropbox (following [this link](https://dl.dropboxusercontent.com/u/95866350/start_dropbox_token.html)), and Dropbox will generate the access token (which is not visible to us, as the redirect page showing the token is merely a static html hosted by Dropbox)
Dropbox has two advantages over other cloud providers. First, if you are already a paid user then to use the unused space as the backup storage is basically free. Second, unlike other providers Dropbox does not charge API usage fees.
#### Amazon S3
```
Storage URL: s3://amazon.com/bucket/path/to/storage (default region is us-east-1)
s3://region@amazon.com/bucket/path/to/storage (other regions must be specified)
```
You'll need to input an access key and a secret key to access your Amazon S3 storage.
#### Google Cloud Storage
```
Storage URL: s3://storage.googleapis.com/bucket/path/to/storage
```
Duplicacy uses the s3 protocol to access Google Cloud Storage, so you must enable the [s3 interoperability](https://cloud.google.com/storage/docs/migrating#migration-simple) in your Google Cloud Storage settings.
#### Microsoft Azure
```
Storage URL: azure://account/container
```
You'll need to input the access key once prompted.
#### BackBlaze
```
Storage URL: b2://bucket
```
You'll need to input the account id and application key.
BackBlaze offers perhaps the least expensive cloud storage at 0.5 cent per GB per month. Unfortunately their API does not support file renaming, so the -exclusive option is required when pruning old backups. This means concurrent access and deletion can't be permitted at the same time.
## Comparison with Other Backup Tools
[duplicity](http://duplicity.nongnu.org) works by applying the rsync algorithm (or more specific, the [librsync](https://github.com/librsync/librsync) library)
to find the differences from previous backups and only then uploading the differences. It is the only existing backup tool with extensive cloud support -- the [long list](http://duplicity.nongnu.org/duplicity.1.html#sect7) of storage backends covers almost every cloud provider one can think of. However, duplicity's biggest flaw lies in its incremental model -- a chain of dependent backups starts with a full backup followed by a number of incremental ones, and ends when another full backup is uploaded. Deleting one backup will render useless all the subsequent backups on the same chain. Periodic full backups are required, in order to make previous backups disposable.
[bup](https://github.com/bup/bup) also uses librsync to split files into chunks but save chunks in the git packfile format. It doesn't support any cloud storage, or deletion of old backups.
[Obnam](http://obnam.org) got the incremental backup model right in the sense that every incremental backup is actually a full snapshot. Although Obnam also splits files into chunks, it does not adopt either the rsync algorithm or the variable-size chunking algorithm. As a result, deletions or insertions of a few bytes will foil the
[deduplication](http://obnam.org/faq/dedup).
Deletion of old backups is possible, but no cloud storages are supported.
Multiple clients can back up to the same storage, but only sequential access is granted by the [locking on-disk data structures](http://obnam.org/locking/).
It is unclear if the lack of cloud backends is due to difficulties in porting the locking data structures to cloud storage APIs.
[Attic](https://attic-backup.org) has been acclaimed by some as the [Holy Grail of backups](https://www.stavros.io/posts/holy-grail-backups). It follows the same incremental backup model as Obnam, but embraces the variable-size chunk algorithm for better performance and better deduplication. Deletions of old backup is also supported. However, no cloud backends are implemented, as in Obnam. Although concurrent backups from multiple clients to the same storage is in theory possible by the use of locking, it is
[not recommended](http://librelist.com/browser//attic/2014/11/11/backing-up-multiple-servers-into-a-single-repository/#e96345aa5a3469a87786675d65da492b) by the developer due to chunk indices being kept in a local cache.
Concurrent access is not only a convenience; it is a necessity for better deduplication. For instance, if multiple machines with the same OS installed can back up their entire drives to the same storage, only one copy of the system files needs to be stored, greatly reducing the storage space regardless of the number of machines. Attic still adopts the traditional approach of using a centralized indexing database to manage chunks, and relies heavily on caching to improve performance. The presence of exclusive locking makes it hard to be adapted for cloud storage APIs and reduces the level of deduplication.
[restic](https://restic.github.io) is a more recent addition. It is worth mentioning here because, like Duplicacy, it is written in Go. It uses a format similar to the git packfile format, but not exactly the same. Multiple clients backing up to the same storage are still guarded by
[locks](https://github.com/restic/restic/blob/master/doc/Design.md#locks).
A command to delete old backups is in the developer's [plan](https://github.com/restic/restic/issues/18). S3 storage is supported, although it is unclear how hard it is to support other cloud storage APIs because of the need for locking. Overall, it still falls in the same category as Attic. Whether it will eventually reach the same level as Attic remains to be seen.
The following table compares the feature lists of all these backup tools:
| Tool | Incremental Backup | Full Snapshot | Deduplication | Encryption | Deletion | Concurrent Access |Cloud Support |
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| duplicity | Yes | No | Weak | Yes | No | No | Extensive |
| bup | Yes | Yes | Yes | Yes | No | No | No |
| Obnam | Yes | Yes | Weak | Yes | Yes | Exclusive locking | No |
| Attic | Yes | Yes | Yes | Yes | Yes | Not recommended | No |
| restic | Yes | Yes | Yes | Yes | No | Exclusive locking | S3 only |
| **Duplicacy** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** | **Lock-free** | **S3, GCS, Azure, Dropbox, BackBlaze** |

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB