mirror of
https://github.com/gilbertchen/duplicacy
synced 2025-12-06 00:03:38 +00:00
Compare commits
93 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8ba06efb85 | ||
|
|
f00c47faf1 | ||
|
|
da478ae340 | ||
|
|
d76576e508 | ||
|
|
21fb36a078 | ||
|
|
22ba312ca8 | ||
|
|
34a8090ca6 | ||
|
|
58a876c4d6 | ||
|
|
efca1b4459 | ||
|
|
93a640cfc1 | ||
|
|
d9f2eab28b | ||
|
|
b09190a79c | ||
|
|
2e58b8739e | ||
|
|
38ba6a3400 | ||
|
|
7ea2139b37 | ||
|
|
1a1983cc09 | ||
|
|
96b10648b5 | ||
|
|
fe86a58a64 | ||
|
|
30d4201192 | ||
|
|
5631fc4a50 | ||
|
|
fb55d31dd0 | ||
|
|
74a72ac93c | ||
|
|
c74b0681bc | ||
|
|
76e4f4f267 | ||
|
|
3410f4fff1 | ||
|
|
0b44bed06e | ||
|
|
f9df29edca | ||
|
|
a96f0ba3d3 | ||
|
|
a6dba8707c | ||
|
|
0870af3008 | ||
|
|
b9affbdb0d | ||
|
|
46b22cc161 | ||
|
|
d5bdd0bf81 | ||
|
|
59fd834b14 | ||
|
|
ab91fd414b | ||
|
|
c3d0da9983 | ||
|
|
c80a021542 | ||
|
|
106ddd0581 | ||
|
|
6153fdd254 | ||
|
|
2bf7df9189 | ||
|
|
1d4cf6f48b | ||
|
|
013eaa5611 | ||
|
|
011ed4e66e | ||
|
|
0c67f47e2c | ||
|
|
f5f1eeaaa5 | ||
|
|
9a8bd41057 | ||
|
|
fdb788b026 | ||
|
|
0aa3199291 | ||
|
|
2255b756f6 | ||
|
|
475b9ed378 | ||
|
|
e23b486f15 | ||
|
|
217dd99adc | ||
|
|
fd0974e35a | ||
|
|
460baafe8e | ||
|
|
b6aa983290 | ||
|
|
24fb80ea61 | ||
|
|
04b27d06ec | ||
|
|
67d6706a2b | ||
|
|
6f963a8194 | ||
|
|
102150459e | ||
|
|
04c8b69817 | ||
|
|
d4b0e77ab4 | ||
|
|
2a8126cf30 | ||
|
|
36adbf59e4 | ||
|
|
0ada49d11d | ||
|
|
85d52feb42 | ||
|
|
a8ad7d130c | ||
|
|
d1a5874fc2 | ||
|
|
4d50cf8622 | ||
|
|
f1f16b5bab | ||
|
|
bf4bfad413 | ||
|
|
0a54795b7f | ||
|
|
04a62b4ca2 | ||
|
|
97419646f0 | ||
|
|
0957eeba47 | ||
|
|
1f585e2df3 | ||
|
|
b3da6ad762 | ||
|
|
328843b399 | ||
|
|
896c2b5074 | ||
|
|
9336fc97ae | ||
|
|
ef9f1b7cb7 | ||
|
|
9f1f5b7b23 | ||
|
|
566a081224 | ||
|
|
0e960106e4 | ||
|
|
be187b7314 | ||
|
|
19b0af86fc | ||
|
|
b0d9fed137 | ||
|
|
1c8fe0810d | ||
|
|
9f816547b4 | ||
|
|
73e5b398a4 | ||
|
|
60881fb112 | ||
|
|
57b297edfe | ||
|
|
80c8ef7869 |
215
DESIGN.md
Normal file
215
DESIGN.md
Normal file
@@ -0,0 +1,215 @@
|
||||
## Lock-Free Deduplication
|
||||
|
||||
The three elements of lock-free deduplication are:
|
||||
|
||||
* Use variable-size chunking algorithm to split files into chunks
|
||||
* Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database
|
||||
* Apply a *two-step fossil collection* algorithm to remove chunks that become unreferenced after a backup is deleted
|
||||
|
||||
The variable-size chunking algorithm, also called Content-Defined Chunking, is well-known and has been adopted by many
|
||||
backup tools. The main advantage of the variable-size chunking algorithm over the fixed-size chunking algorithm (as used
|
||||
by rsync) is that in the former the rolling hash is only used to search for boundaries between chunks, after which a far
|
||||
more collision-resistant hash function like MD5 or SHA256 is applied on each chunk. In contrast, in the fixed-size
|
||||
chunking algorithm, for the purpose of detecting inserts or deletions, a lookup in the known hash table is required every
|
||||
time the rolling hash window is shifted by one byte, thus significantly reducing the chunk splitting performance.
|
||||
|
||||
What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing
|
||||
chunks and for determining which chunks are not needed any more. Instead, to check if a chunk has already been uploaded
|
||||
before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk.
|
||||
This effectively turn a cloud storage offering only a very limited
|
||||
set of basic file operations into a powerful modern backup backend capable of both block-level and file-level deduplication. More importantly, the absence of a centralized indexing database means that there is no need to implement a distributed locking mechanism on top of the file storage.
|
||||
|
||||
By eliminating the chunk indexing database, lock-free duplication not only reduces the code complexity but also makes the deduplication less error-prone. Each chunk is saved individually in its own file, and once saved there is no need for modification. Data corruption is therefore less likely to occur because of the immutability of chunk files. Another benefit that comes naturally from lock-free duplication is that when one client creates a new chunk, other clients that happen to have the same original file will notice that the chunk already exist and therefore will not upload the same chunk again. This pushes the deduplication to its highest level -- clients without knowledge of each other can share identical chunks with no extra effort.
|
||||
|
||||
There is one problem, though.
|
||||
Deletion of snapshots without an indexing database, when concurrent access is permitted, turns out to be a hard problem.
|
||||
If exclusive access to a file storage by a single client can be guaranteed, the deletion procedure can simply search for
|
||||
chunks not referenced by any backup and delete them. However, if concurrent access is required, an unreferenced chunk
|
||||
can't be trivially removed, because of the possibility that a backup procedure in progress may reference the same chunk.
|
||||
The ongoing backup procedure, still unknown to the deletion procedure, may have already encountered that chunk during its
|
||||
file scanning phase, but decided not to upload the chunk again since it already exists in the file storage.
|
||||
|
||||
Fortunately, there is a solution to address the deletion problem and make lock-free deduplication practical. The solution is a *two-step fossil collection* algorithm that deletes unreferenced chunks in two steps: identify and collect them in the first step, and then permanently remove them once certain conditions are met.
|
||||
|
||||
## Two-Step Fossil Collection
|
||||
|
||||
Interestingly, the two-step fossil collection algorithm hinges on a basic file operation supported almost universally, *file renaming*.
|
||||
When the deletion procedure identifies a chunk not referenced by any known snapshots, instead of deleting the chunk file
|
||||
immediately, it changes the name of the chunk file (and possibly moves it to a different directory).
|
||||
A chunk that has been renamed is called a *fossil*.
|
||||
|
||||
The fossil still exists in the file storage. Two rules are enforced regarding the access of fossils:
|
||||
|
||||
* A restore, list, or check procedure that reads existing backups can read the fossil if the original chunk cannot be found.
|
||||
* A backup procedure does not check the existence of a fossil. That is, it must upload a chunk if it cannot find the chunk, even if an equivalent fossil exists.
|
||||
|
||||
In the first step of the deletion procedure, called the *fossil collection* step, the names of all identified fossils will
|
||||
be saved in a fossil collection file. The deletion procedure then exits without performing further actions. This step has not effectively changed any chunk references due to the first fossil access rule. If a backup procedure references a chunk after it is marked as a fossil, a new chunk will be uploaded because of the second fossil access rule, as shown in Figure 1.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_1.png?raw=true"
|
||||
alt="Reference after Rename"/>
|
||||
</p>
|
||||
|
||||
The second step, called the *fossil deletion* step, will permanently delete fossils, but only when two conditions are met:
|
||||
|
||||
* For each snapshot id, there is a new snapshot that was not seen by the fossil collection step
|
||||
* The new snapshot must finish after the fossil collection step
|
||||
|
||||
The first condition guarantees that if a backup procedure references a chunk before the deletion procedure turns it into a fossil, the reference will be detected in the fossil deletion step which will then turn the fossil back into a normal chunk.
|
||||
|
||||
The second condition guarantees that any backup procedure unknown to the fossil deletion step can start only after the fossil collection step finishes. Therefore, if it references a chunk that was identified as fossil in the fossil collection step, it should observe the fossil, not the chunk, so it will upload a new chunk, according to the second fossil access rule.
|
||||
|
||||
Therefore, if a backup procedure references a chunk before the chunk is marked a fossil, the fossil deletion step will not
|
||||
delete the chunk until it sees that backup procedure finishes (as indicated by the appearance of a new snapshot file uploaded to the storage). This ensures that scenarios depicted in Figure 2 will never happen.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_2.png?raw=true"
|
||||
alt="Reference before Rename"/>
|
||||
</p>
|
||||
|
||||
|
||||
## Snapshot Format
|
||||
|
||||
A snapshot file is a file that the backup procedure uploads to the file storage after it finishes splitting files into
|
||||
chunks and uploading all new chunks. It mainly contains metadata for the backup overall, metadata for all the files,
|
||||
and chunk references for each file. Here is an example snapshot file for a repository containing 3 files (file1, file2,
|
||||
and dir1/file3):
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "host1",
|
||||
"revision": 1,
|
||||
"tag": "first",
|
||||
"start_time": 1455590487,
|
||||
"end_time": 1455590487,
|
||||
"files": [
|
||||
{
|
||||
"path": "file1",
|
||||
"content": "0:0:2:6108",
|
||||
"hash": "a533c0398194f93b90bd945381ea4f2adb0ad50bd99fd3585b9ec809da395b51",
|
||||
"size": 151901,
|
||||
"time": 1455590487,
|
||||
"mode": 420
|
||||
},
|
||||
{
|
||||
"path": "file2",
|
||||
"content": "2:6108:3:7586",
|
||||
"hash": "f6111c1562fde4df9c0bafe2cf665778c6e25b49bcab5fec63675571293ed644",
|
||||
"size": 172071,
|
||||
"time": 1455590487,
|
||||
"mode": 420
|
||||
},
|
||||
{
|
||||
"path": "dir1/",
|
||||
"size": 102,
|
||||
"time": 1455590487,
|
||||
"mode": 2147484096
|
||||
},
|
||||
{
|
||||
"path": "dir1/file3",
|
||||
"content": "3:7586:4:1734",
|
||||
"hash": "6bf9150424169006388146908d83d07de413de05d1809884c38011b2a74d9d3f",
|
||||
"size": 118457,
|
||||
"time": 1455590487,
|
||||
"mode": 420
|
||||
}
|
||||
],
|
||||
"chunks": [
|
||||
"9f25db00881a10a8e7bcaa5a12b2659c2358a579118ea45a73c2582681f12919",
|
||||
"6e903aace6cd05e26212fcec1939bb951611c4179c926351f3b20365ef2c212f",
|
||||
"4b0d017bce5491dbb0558c518734429ec19b8a0d7c616f68ddf1b477916621f7",
|
||||
"41841c98800d3b9faa01b1007d1afaf702000da182df89793c327f88a9aba698",
|
||||
"7c11ee13ea32e9bb21a694c5418658b39e8894bbfecd9344927020a9e3129718"
|
||||
],
|
||||
"lengths": [
|
||||
64638,
|
||||
81155,
|
||||
170593,
|
||||
124309,
|
||||
1734
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
When Duplicacy splits a file in chunks using the variable-size chunking algorithm, if the end of a file is reached and yet the boundary marker for terminating a chunk
|
||||
hasn't been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all
|
||||
files were packed into a big tar file which is then split into chunks.
|
||||
|
||||
The *content* field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
|
||||
instance, *fiel1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.
|
||||
|
||||
The backup procedure can run in one of two modes. In the default quick mode, only modified or new files are scanned. Chunks only
|
||||
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
|
||||
files are appended. Indices for unchanged files need to be updated too.
|
||||
|
||||
In the safe mode (enabled by the -hash option), all files are scanned and the chunk sequence is regenerated.
|
||||
|
||||
The length sequence stores the lengths for all chunks, which are needed when calculating some statistics such as the total
|
||||
length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous.
|
||||
To make the situation worse, every time a big snapshot file would have been uploaded even if only a few files have been changed since
|
||||
last backup. To save space, the variable-size chunking algorithm is also applied to the three dynamic fields of a snapshot
|
||||
file, *files*, *chunks*, and *lengths*.
|
||||
|
||||
Chunks produced during this step are deduplicated and uploaded in the same way as regular file chunks. The final snapshot file
|
||||
contains sequences of chunk hashes and other fixed size fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "host1",
|
||||
"revision": 1,
|
||||
"start_time": 1455590487,
|
||||
"tag": "first",
|
||||
"end_time": 1455590487,
|
||||
"file_sequence": [
|
||||
"21e4c69f3832e32349f653f31f13cefc7c52d52f5f3417ae21f2ef5a479c3437",
|
||||
],
|
||||
"chunk_sequence": [
|
||||
"8a36ffb8f4959394fd39bba4f4a464545ff3dd6eed642ad4ccaa522253f2d5d6"
|
||||
],
|
||||
"length_sequence": [
|
||||
"fc2758ae60a441c244dae05f035136e6dd33d3f3a0c5eb4b9025a9bed1d0c328"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
In the extreme case where the repository has not been modified since last backup, a new backup procedure will not create any new chunks,
|
||||
as shown by the following output from a real use case:
|
||||
|
||||
```
|
||||
$ duplicacy backup -stats
|
||||
Storage set to sftp://gchen@192.168.1.100/Duplicacy
|
||||
Last backup at revision 260 found
|
||||
Backup for /Users/gchen/duplicacy at revision 261 completed
|
||||
Files: 42367 total, 2,204M bytes; 0 new, 0 bytes
|
||||
File chunks: 447 total, 2,238M bytes; 0 new, 0 bytes, 0 bytes uploaded
|
||||
Metadata chunks: 6 total, 11,753K bytes; 0 new, 0 bytes, 0 bytes uploaded
|
||||
All chunks: 453 total, 2,249M bytes; 0 new, 0 bytes, 0 bytes uploaded
|
||||
Total running time: 00:00:05
|
||||
```
|
||||
|
||||
## Encryption
|
||||
|
||||
When encryption is enabled (by the -e option with the *init* or *add* command), Duplicacy will generate 4 random 256 bit keys:
|
||||
|
||||
* *Hash Key*: for generating a chunk hash from the content of a chunk
|
||||
* *ID Key*: for generating a chunk id from a chunk hash
|
||||
* *Chunk Key*: for encrypting chunk files
|
||||
* *File Key*: for encrypting non-chunk files such as snapshot files.
|
||||
|
||||
Here is a diagram showing how these keys are used:
|
||||
|
||||
<p align="center">
|
||||
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/duplicacy_encryption.png?raw=true"
|
||||
alt="encryption"/>
|
||||
</p>
|
||||
|
||||
Chunk hashes are used internally and stored in the snapshot file. They are never exposed unless the snapshot file is decrypted. Chunk ids are used as the file names for the chunks and therefore exposed. When the *cat* command is used to print out a snapshot file, the chunk hashes stored in the snapshot file will be converted into chunk ids first which are then displayed instead.
|
||||
|
||||
Chunk content is encrypted by AES-GCM, with an encryption key that is the HMAC-SHA256 of the chunk Hash with the *Chunk Key* as the secret key.
|
||||
|
||||
The snapshot is encrypted by AES-GCM too, using an encrypt key that is the HMAC-SHA256 of the file path with the *File Key* as the secret key.
|
||||
|
||||
There four random keys are saved in a file named 'config' in the file storage, encrypted with a master key derived from the PBKDF2 function on
|
||||
the storage password chosen by the user.
|
||||
|
||||
479
GUIDE.md
Normal file
479
GUIDE.md
Normal file
@@ -0,0 +1,479 @@
|
||||
# Duplicacy User Guide
|
||||
|
||||
## Commands
|
||||
|
||||
#### Init
|
||||
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy init - Initialize the storage if necessary and the current directory as the repository
|
||||
|
||||
USAGE:
|
||||
duplicacy init [command options] <snapshot id> <storage url>
|
||||
|
||||
OPTIONS:
|
||||
-encrypt, -e encrypt the storage with a password
|
||||
-chunk-size, -c 4M the average size of chunks
|
||||
-max-chunk-size, -max 16M the maximum size of chunks (defaults to chunk-size * 4)
|
||||
-min-chunk-size, -min 1M the minimum size of chunks (defaults to chunk-size / 4)
|
||||
-compression-level, -l <level> compression level (defaults to -1)
|
||||
```
|
||||
|
||||
The *init* command first connects to the storage specified by the storage URL. If the storage has been already been
|
||||
initialized before, it will download the storage configuration (stored in the file named *config*) and ignore the options provided in the command line. Otherwise, it will create the configuration file from the options and upload the file.
|
||||
|
||||
The initialized storage will then become the default storage for other commands if the -storage option is not specified
|
||||
for those commands. This default storage actually has a name, *default*.
|
||||
|
||||
After that, it will prepare the the current working directory as the repository to be backed up. Under the hood, it will create a directory
|
||||
named *.duplicacy* in the repository and put a file named *preferences* that stores the snapshot id and encryption and storage options.
|
||||
|
||||
The snapshot id is an id used to distinguish different repositories connected to the same storage. Each repository must have a unique snapshot id.
|
||||
|
||||
The -e option controls whether or not encryption will be enabled for the storage. If encryption is enabled, you will be prompted to enter a storage password.
|
||||
|
||||
The three chunk size parameters are passed to the variable-size chunking algorithm. Their values are important to the overall performance, especially for cloud storages. If the chunk size is too small, a lot of overhead will be in sending requests and receiving responses. If the chunk size is too large, the effect of deduplication will be less obvious as more data will need to be transferred with each chunk.
|
||||
|
||||
The compression level parameter is passed to the zlib library. Valid values are -1 through 9, with 0 meaning no compression, 9 best compression (slowest), and -1 being the default value (equivalent to level 6).
|
||||
|
||||
Once a storage has been initialized with these parameters, these parameters cannot be modified any more.
|
||||
|
||||
#### Backup
|
||||
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy backup - Save a snapshot of the repository to the storage
|
||||
|
||||
USAGE:
|
||||
duplicacy backup [command options]
|
||||
|
||||
OPTIONS:
|
||||
-hash detect file differences by hash (rather than size and timestamp)
|
||||
-t <tag> assign a tag to the backup
|
||||
-stats show statistics during and after backup
|
||||
-vss enable the Volume Shadow Copy service (Windows only)
|
||||
-storage <storage name> backup to the specified storage instead of the default one
|
||||
```
|
||||
|
||||
The *backup* command creates a snapshot of the repository and uploads it to the storage. If -hash is not provided,
|
||||
it will upload new or modified files since last backup by comparing file sizes and timestamps.
|
||||
Otherwise, every file is scanned to detect changes.
|
||||
|
||||
You can assign a tag to the snapshot so that later you can refer to it by tag in other commands.
|
||||
|
||||
If the -stats option is specified, statistical information such as transfer speed, number of chunks will be displayed
|
||||
throughout the backup procedure.
|
||||
|
||||
The -vss option works on Windows only to turn on the Volume Shadow Copy service such that files opened by other
|
||||
processes with exclusive locks can be read as usual.
|
||||
|
||||
When the repository can have multiple storages (added by the *add* command), you can select the storage to back up to
|
||||
by giving a storage name.
|
||||
|
||||
You can specify patterns to include/exclude files by putting them in a file named *.duplicacy/filters*. Please refer to the [Include/Exclude Patterns](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#includeexclude-patterns) section for how to specify the patterns.
|
||||
|
||||
#### Restore
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy restore - Restore the repository to a previously saved snapshot
|
||||
|
||||
USAGE:
|
||||
duplicacy restore [command options] [--] [pattern] ...
|
||||
|
||||
OPTIONS:
|
||||
-r <revision> the revision number of the snapshot (required)
|
||||
-hash detect file differences by hash (rather than size and timestamp)
|
||||
-overwrite overwrite existing files in the repository
|
||||
-delete delete files not in the snapshot
|
||||
-stats show statistics during and after restore
|
||||
-storage <storage name> restore from the specified storage instead of the default one
|
||||
```
|
||||
|
||||
The *restore* command restores the repository to a previous revision. By default the restore procedure will treat
|
||||
files that have the same sizes and timestamps as those in the snapshot as unchanged files, but with the -hash option, every file will be fully scanned to make sure they are in fact unchanged.
|
||||
|
||||
By default the restore procedure will not overwriting existing files, unless the -overwrite option is specified.
|
||||
|
||||
The -delete option indicates that files not in the snapshot will be removed.
|
||||
|
||||
If the -stats option is specified, statistical information such as transfer speed, number of chunks will be displayed
|
||||
throughout the restore procedure.
|
||||
|
||||
When the repository can have multiple storages (added by the *add* command), you can select the storage to restore from by specifying the storage name.
|
||||
|
||||
Unlike the *backup* procedure that reading the include/exclude patterns from a file, the *restore* procedure reads them
|
||||
from the command line. If the patterns can cause confusion to the command line argument parser, -- should be prepended to
|
||||
the patterns. Please refer to the [Include/Exclude Patterns](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#includeexclude-patterns) section for how to specify patterns.
|
||||
|
||||
|
||||
#### List
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy list - List snapshots
|
||||
|
||||
USAGE:
|
||||
duplicacy list [command options]
|
||||
|
||||
OPTIONS:
|
||||
-all, -a list snapshots with any id
|
||||
-id <snapshot id> list snapshots with the specified id rather than the default one
|
||||
-r <revision> [+] the revision number of the snapshot
|
||||
-t <tag> list snapshots with the specified tag
|
||||
-files print the file list in each snapshot
|
||||
-chunks print chunks in each snapshot or all chunks if no snapshot specified
|
||||
-reset-password take passwords from input rather than keychain/keyring or env
|
||||
-storage <storage name> retrieve snapshots from the specified storage
|
||||
```
|
||||
|
||||
The *list* command lists information about specified snapshots. By default it will list snapshots created from the
|
||||
current repository, but you can list all snapshots stored in the storage by specifying the -all option, or list snapshots
|
||||
with a different snapshot id using the -id option, and/or snapshots with a particular tag with the -t option.
|
||||
|
||||
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing
|
||||
every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using
|
||||
the -r option, which either takes a single revision number (-r 123) or a range (-r 123-456).
|
||||
There can be multiple -r options.
|
||||
|
||||
If -files is specified, for each snapshot to be listed, this command will also print information about every file
|
||||
contained in the snapshot.
|
||||
|
||||
If -chunks is specified, the command will also print out every chunk the snapshot references.
|
||||
|
||||
The -reset-password option is used to reset stored passwords and to allow passwords to be entered again. Please refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more information.
|
||||
|
||||
When the repository can have multiple storages (added by the *add* command), you can specify the storage to list
|
||||
by specifying the storage name.
|
||||
|
||||
#### Check
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy check - Check the integrity of snapshots
|
||||
|
||||
USAGE:
|
||||
duplicacy check [command options]
|
||||
|
||||
OPTIONS:
|
||||
-all, -a check snapshots with any id
|
||||
-id <snapshot id> check snapshots with the specified id rather than the default one
|
||||
-r <revision> [+] the revision number of the snapshot
|
||||
-t <tag> check snapshots with the specified tag
|
||||
-fossils search fossils if a chunk can't be found
|
||||
-resurrect turn referenced fossils back into chunks
|
||||
-files verify the integrity of every file
|
||||
-storage <storage name> retrieve snapshots from the specified storage```
|
||||
```
|
||||
The *check* command checks, for each specified snapshot, that all referenced chunks exist in the storage.
|
||||
|
||||
By default the *check* command will check snapshots created from the
|
||||
current repository, but you can check all snapshots stored in the storage at once by specifying the -all option, or
|
||||
snapshots from a different repository using the -id option, and/or snapshots with a particular tag with the -t option.
|
||||
|
||||
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing
|
||||
every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using
|
||||
the -r option, which either takes a single revision number (-r 123) or a range (-r 123-456).
|
||||
There can be multiple -r options.
|
||||
|
||||
By default the *check* command only verifies the existence of chunks. To verify the full integrity of a snapshot,
|
||||
you should specify the -files option, which will download chunks and compute file hashes in memory, to
|
||||
make sure that all hashes match.
|
||||
|
||||
By default the *check* command does not find fossils. If the -fossils option is specified, it will find
|
||||
the fossil if the referenced chunk does not exist. if the -resurrect option is specified, it will turn the fossil back into a chunk.
|
||||
|
||||
When the repository can have multiple storages (added by the *add* command), you can specify the storage to check
|
||||
by specifying the storage name.
|
||||
|
||||
|
||||
#### Cat
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy cat - Print to stdout the specified file, or the snapshot content if no file is specified
|
||||
|
||||
USAGE:
|
||||
duplicacy cat [command options] [<file>]
|
||||
|
||||
OPTIONS:
|
||||
-id <snapshot id> retrieve from the snapshot with the specified id
|
||||
-r <revision> the revision number of the snapshot
|
||||
-storage <storage name> retrieve the file from the specified storage
|
||||
```
|
||||
|
||||
The *cat* command prints a file or the entire snapshot content if no file is specified.
|
||||
|
||||
The file must be specified with a path relative to the repository.
|
||||
|
||||
You can specify a different snapshot id rather than the default id.
|
||||
|
||||
The -r option is optional. If not specified, the latest revision will be selected.
|
||||
|
||||
You can use the -storage option to select a different storage other than the default one.
|
||||
|
||||
#### Diff
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy diff - Compare two snapshots or two revisions of a file
|
||||
|
||||
USAGE:
|
||||
duplicacy diff [command options] [<file>]
|
||||
|
||||
OPTIONS:
|
||||
-id <snapshot id> diff with the snapshot with the specified id
|
||||
-r <revision> [+] the revision number of the snapshot
|
||||
-hash compute the hashes of on-disk files
|
||||
-storage <storage name> retrieve files from the specified storage
|
||||
```
|
||||
The *diff* command compares the same file in two different snapshots if a file is given, otherwise compares the
|
||||
two snapshots.
|
||||
|
||||
The file must be specified with a path relative to the repository.
|
||||
|
||||
You can specify a different snapshot id rather than the default snapshot id.
|
||||
|
||||
If only one revision is given by -r, the right hand side of the comparison will be the on-disk file.
|
||||
The -hash option can then instruct this command to compute the hash of the file.
|
||||
|
||||
You can use the -storage option to select a different storage other than the default one.
|
||||
|
||||
#### History
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy history - Show the history of a file
|
||||
|
||||
USAGE:
|
||||
duplicacy history [command options] <file>
|
||||
|
||||
OPTIONS:
|
||||
-id <snapshot id> find the file in the snapshot with the specified id
|
||||
-r <revision> [+] show history of the specified revisions
|
||||
-hash show the hash of the on-disk file
|
||||
-storage <storage name> retrieve files from the specified storage
|
||||
```
|
||||
|
||||
The *history* command shows how the hash, size, and timestamp of a file change over the specified set of revisions.
|
||||
|
||||
You can specify a different snapshot id rather than the default snapshot id, and multiple -r options to specify the
|
||||
set of revisions.
|
||||
|
||||
The -hash option is to compute the hash of the on-disk file. Otherwise, only the size and timestamp of the on-disk
|
||||
file will be included.
|
||||
|
||||
You can use the -storage option to select a different storage other than the default one.
|
||||
|
||||
#### Prune
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy prune - Prune snapshots by revision, tag, or retention policy
|
||||
|
||||
USAGE:
|
||||
duplicacy prune [command options]
|
||||
|
||||
OPTIONS:
|
||||
-id <snapshot id> delete snapshots with the specified id instead of the default one
|
||||
-all, -a match against all snapshot IDs
|
||||
-r <revision> [+] delete snapshots with the specified revisions
|
||||
-t <tag> [+] delete snapshots with the specified tags
|
||||
-keep <n:m> [+] keep 1 snapshot every n days for snapshots older than m days
|
||||
-exhaustive find all unreferenced chunks by scanning the storage
|
||||
-exclusive assume exclusive access to the storage (disable two-step fossil collection)
|
||||
-dry-run, -d show what would have been deleted
|
||||
-delete-only delete fossils previously collected (if deletable) and don't collect fossils
|
||||
-collect-only identify and collect fossils, but don't delete fossils previously collected
|
||||
-ignore <id> [+] ignore the specified snapshot id when deciding if fossils can be deleted
|
||||
-storage <storage name> prune snapshots from the specified storage
|
||||
```
|
||||
|
||||
The *prune* command implements the two-step fossil collection algorithm. It will first find fossil collection files
|
||||
from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it
|
||||
will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil
|
||||
collection file stored locally (the fossil collection step).
|
||||
|
||||
If a snapshot id is specified, that snapshot id will be used instead of the default one. The -a option will find
|
||||
snapshots with any id. Snapshots to be deleted can be specified by revision numbers, by a tag, by retention policies,
|
||||
or by any combination of them.
|
||||
|
||||
The retention policies are specified by the -keep option, which accepts an argument in the form of two numbers *n:m*, where *n* indicates the number of days between two consecutive snapshots to keep, and *m* means that the policy only applies to snapshots at least *m* day old. If *n* is zero, any snapshots older than *m* days will be removed.
|
||||
|
||||
Here are a few sample retention policies:
|
||||
|
||||
```sh
|
||||
$ duplicacy prune -keep 1:7 # Keep 1 snapshot per day for snapshots older than 7 days
|
||||
$ duplicacy prune -keep 7:30 # Keep 1 snapshot every 7 days for snapshots older than 30 days
|
||||
$ duplicacy prune -keep 30:180 # Keep 1 snapshot every 30 days for snapshots older than 180 days
|
||||
$ duplicacy prune -keep 0:360 # Keep no snapshots older than 360 days
|
||||
```
|
||||
|
||||
Multiple -keep options must be sorted by their *m* values in decreasing order. For instance, to combine the above policies into one line, it would become:
|
||||
|
||||
```sh
|
||||
$ duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7
|
||||
```
|
||||
|
||||
The -exhaustive option will scan the list of all chunks in the storage, therefore it will find not only
|
||||
unreferenced chunks from deleted snapshots, but also chunks that become unreferenced for other reasons, such as
|
||||
those from an incomplete backup. It will also find any file that does not look like a chunk file.
|
||||
In contrast, a default *prune* command will only identify
|
||||
chunks referenced by deleted snapshots but not any other snapshots.
|
||||
|
||||
The -exclusive option will assume that no other clients are accessing the storage, effectively disabling the
|
||||
*two-step fossil collection* algorithm. With this option, the *prune* command will immediately remove unreferenced chunks.
|
||||
|
||||
The -dryrun option is used to test what changes the *prune* command would have done. It is guaranteed not to make
|
||||
any changes on the storage, not even creating the local fossil collection file. The following command checks if the
|
||||
chunk directory is clean (i.e., if there are any unreferenced chunks, temporary files, or anything else):
|
||||
|
||||
```
|
||||
$ duplicacy prune -d -exclusive -exhaustive # Prints out nothing if the chunk directory is clean
|
||||
```
|
||||
|
||||
The -delete-only option will skip the fossil collection step, while the -collect-only option will skip the fossil deletion step.
|
||||
|
||||
For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least
|
||||
one new snapshot from *each* snapshot id must be created between two runs of the *prune* command. However, some repository
|
||||
may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days. It also provide an
|
||||
-ignore option that can be used to skip certain repositories when deciding the deletion criteria.
|
||||
|
||||
You can use the -storage option to select a different storage other than the default one.
|
||||
|
||||
|
||||
#### Password
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy password - Change the storage password
|
||||
|
||||
USAGE:
|
||||
duplicacy password [command options]
|
||||
|
||||
OPTIONS:
|
||||
-storage <storage name> change the password used to access the specified storage
|
||||
```
|
||||
|
||||
The *password* command decrypts the storage configuration file *config* using the old password, and re-encrypts the file
|
||||
using a new password. It does not change all the encryption keys used to encrypt and decrypt chunk files,
|
||||
snapshot files, etc.
|
||||
|
||||
You can specify the storage to change the password for when working with multiple storages.
|
||||
|
||||
|
||||
#### Add
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy add - Add an additional storage to be used for the existing repository
|
||||
|
||||
USAGE:
|
||||
duplicacy add [command options] <storage name> <snapshot id> <storage url>
|
||||
|
||||
OPTIONS:
|
||||
-encrypt, -e Encrypt the storage with a password
|
||||
-chunk-size, -c 4M the average size of chunks
|
||||
-max-chunk-size, -max 16M the maximum size of chunks (defaults to chunk-size * 4)
|
||||
-min-chunk-size, -min 1M the minimum size of chunks (defaults to chunk-size / 4)
|
||||
-compression-level, -l <level> compression level (defaults to -1)
|
||||
-copy <storage name> make the new storage copy-compatible with an existing one
|
||||
```
|
||||
|
||||
The *add* command connects another storage to the current repository. Like the *init* command, if the storage has not
|
||||
been initialized before, a storage configuration file derived from the command line options will be uploaded, but those
|
||||
options will be ignored if the configuration file already exists in the storage.
|
||||
|
||||
A unique storage name must be given in order to distinguish it from other storages.
|
||||
|
||||
The -copy option is required if later you want to copy snapshots between this storage and another storage.
|
||||
Two storages are copy-compatible if they have the same average chunk size, the same maximum chunk size,
|
||||
the same minimum chunk size, the same chunk seed (used in calculating the rolling hash in the variable-size chunks
|
||||
algorithm), and the same hash key. If the -copy option is specified, these parameters will be copied from
|
||||
the existing storage rather than from the command line.
|
||||
|
||||
#### Set
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy set - Change the options for the default or specified storage
|
||||
|
||||
USAGE:
|
||||
duplicacy set [command options]
|
||||
|
||||
OPTIONS:
|
||||
-encrypt, e[=true] encrypt the storage with a password
|
||||
-no-backup[=true] backup to this storage is prohibited
|
||||
-no-restore[=true] restore from this storage is prohibited
|
||||
-no-save-password[=true] don't save password or access keys to keychain/keyring
|
||||
-key add a key/password whose value is supplied by the -value option
|
||||
-value the value of the key/password
|
||||
-storage <storage name> use the specified storage instead of the default one
|
||||
```
|
||||
|
||||
The *set* command changes the options for the specified storage.
|
||||
|
||||
The -e option turns on the storage encryption. If specified as -e=false, it turns off the storage encryption.
|
||||
|
||||
The -no-backup option will not allow backups from this repository to be created.
|
||||
|
||||
The -no-restore option will not allow restoring this repository to a different revision.
|
||||
|
||||
The -no-save-password option will require every password or token to be entered every time and not saved anywhere.
|
||||
|
||||
The -key and -value options are used to store (in plain text) access keys or tokens need by various storages. Please
|
||||
refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more details.
|
||||
|
||||
You can select a storage to change options for by specifying a storage name.
|
||||
|
||||
|
||||
#### Copy
|
||||
```
|
||||
SYNOPSIS:
|
||||
duplicacy copy - Copy snapshots between compatible storages
|
||||
|
||||
USAGE:
|
||||
duplicacy copy [command options]
|
||||
|
||||
OPTIONS:
|
||||
-id <snapshot id> copy snapshots with the specified id instead of all snapshot ids
|
||||
-r <revision> [+] copy snapshots with the specified revisions
|
||||
-from <storage name> copy snapshots from the specified storage
|
||||
-to <storage name> copy snapshots to the specified storage
|
||||
```
|
||||
|
||||
The *copy* command copies snapshots from one storage to another storage. They must be copy-compatible, i.e., some
|
||||
configuration parameters must be the same. One storage must be initialized with the -copy option provided by the *add* command.
|
||||
|
||||
Instead of copying all snapshots, you can specify a set of snapshots to copy by giving the -r options. The *copy* command
|
||||
preserves the revision numbers, so if a revision number already exists on the destination storage the command will fail.
|
||||
|
||||
If no -from option is given, the snapshots from the default storage will be copied. The -to option specified the
|
||||
destination storage and is required.
|
||||
|
||||
## Include/Exclude Patterns
|
||||
|
||||
An include pattern starts with -, and an exclude pattern starts with +. Patterns may contain wildcard characters such as * and ? with their normal meaning.
|
||||
|
||||
When matching a path against a list of patterns, the path is compared with the part after + or -, one pattern at a time. Therefore, the order of the patterns is significant. If a match with an include pattern is found, the path is said to be included without further comparisons. If a match with an exclude pattern is found, the path is said to be excluded without further comparison. If a match is not found, the path will be excluded if all patterns are include patterns, but included otherwise.
|
||||
|
||||
Note that the path in Duplicacy for a directory always ends with a /, even on Windows. The path of a file does not end with a /. This can be used to exclude directories only.
|
||||
|
||||
For the *backup* command, the include/exclude patterns are read from a file named *filters* under the *.duplicacy* directory.
|
||||
|
||||
For the *restore* command, the include/exclude patterns are specified as the command line arguments.
|
||||
|
||||
## Managing Passwords
|
||||
|
||||
Duplicacy will attempt to retrieve in three ways the storage password and the storage-specific access tokens/keys.
|
||||
|
||||
* If a secret vault service is available, Duplicacy will store the password input by the user in such a secret vault and later retrieve it when needed. On Mac OS X it is Keychain, and on Linux it is gnome-keyring. On Windows the password is encrypted and decrypted by the Data Protection API, and encrypted password is stored in the file *.duplicacy/keyring*. However, if the -no-save-password option is specified for the storage, then Duplicacy will not save passwords this way
|
||||
* If an environment variable for a password is provided, Duplicacy will always take it. The table below shows the name of the environment variable for each kind of password. Note that if the storage is not the default one, the storage name will be included in the name of the environment variable.
|
||||
* If a matching key and its value are saved to the preference file (.duplicacy/preferences) by the *set* command, the value will be used as the password. The last column in the table below lists the name of the preference key for each type of password.
|
||||
|
||||
| password type | environment variable (default storage) | environment variable (non-default storage) | key in preferences |
|
||||
|:----------------:|:----------------:|:----------------:|:----------------:|
|
||||
| storage password | DUPLICACY_PASSWORD | DUPLICACY_<STORAGENAME>_PASSWORD | password |
|
||||
| sftp password | DUPLICACY_SSH_PASSWORD | DUPLICACY_<STORAGENAME>_SSH_PASSWORD | ssh_password |
|
||||
| Dropbox Token | DUPLICACY_DROPBOX_TOKEN | DUPLICACY_<STORAGENAME>_DROPBOX_TOEKN | dropbox_token |
|
||||
| S3 Access ID | DUPLICACY_S3_ID | DUPLICACY_<STORAGENAME>_S3_ID | s3_id |
|
||||
| S3 Secret Key | DUPLICACY_S3_KEY | DUPLICACY_<STORAGENAME>_S3_KEY | s3_key |
|
||||
| BackBlaze Account ID | DUPLICACY_B2_ID | DUPLICACY_<STORAGENAME>_B2_ID | b2_id |
|
||||
| Backblaze Application Key | DUPLICACY_B2_KEY | DUPLICACY_<STORAGENAME>_B2_KEY | b2_key |
|
||||
| Azure Access Key | DUPLICACY_AZURE_KEY | DUPLICACY_<STORAGENAME>_AZURE_KEY | azure_key |
|
||||
|
||||
Note that the passwords stored in the environment variable and the preference need to be in plaintext and thus are insecure and should be avoided whenever possible.
|
||||
|
||||
## Scripts
|
||||
|
||||
You can instruct Duplicay to run a script before or after executing a command. For example, if you create a bash script with the name *pre-prune* under the *.duplicacy/scripts* directory, this bash script will be run before the *prune* command starts. A script named *post-prune* will be run after the *prune* command finishes. This rule applies to all commands except *init*.
|
||||
175
README.md
175
README.md
@@ -1,43 +1,196 @@
|
||||
# Duplicacy: A new generation cloud backup tool based on Lock-Free Deduplication
|
||||
# Duplicacy: A new generation cloud backup tool
|
||||
|
||||
Duplicacy supports major cloud storage providers (Amazon S3, Googld Cloud Storage, Microsoft Azure, Dropbox, and BackBlaze) and at the same time offers all essential features of a modern backup tool:
|
||||
This repository contains only binary releases and documentation for Duplicacy. It also serves as an issue tracker for user-developer communication during the beta testing phase.
|
||||
|
||||
## Overview
|
||||
|
||||
Duplicacy supports major cloud storage providers (Amazon S3, Google Cloud Storage, Microsoft Azure, Dropbox, and BackBlaze) and offers all essential features of a modern backup tool:
|
||||
|
||||
* Incremental backup: only back up what has been changed
|
||||
* Full snapshot : although each backup is incremental, it must appear to be a full snapshot independent of others
|
||||
* Full snapshot : although each backup is incremental, it behaves like a full snapshot
|
||||
* Deduplication: identical files must be stored as one copy (file-level deduplication), and identical parts from different files must be stored as one copy (block-level deduplication)
|
||||
* Encryption: encrypt not only file contents but also file paths, sizes, times, etc.
|
||||
* Deletion: every backup can be deleted independently without affecting others
|
||||
* Concurrent access: multiple clients can back up to the same storage at the same time
|
||||
|
||||
The key idea behind Duplicacy is a technique called **Lock-Free Deduplication**. There are three elements of lock-free deduplication:
|
||||
The key idea behind Duplicacy is a concept called **Lock-Free Deduplication**, which can be summarized as follows:
|
||||
|
||||
* Use variable-size chunking algorithm to split files into chunks
|
||||
* Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database
|
||||
* Apply a *two-step fossil collection* algorithm to remove chunks that become unreferenced after a backup is deleted
|
||||
|
||||
The [design document](https://github.com/gilbertchen/duplicacy-beta/blob/master/DESIGN.md) explains lock-free deduplication in detail.
|
||||
|
||||
## Getting Started
|
||||
|
||||
```sh
|
||||
$ cd path/to/your/dir
|
||||
$ duplicacy init mywork sftp://192.168.1.100/Duplicacy
|
||||
During beta testing only binaries are available. Please visit the [releases page](https://github.com/gilbertchen/duplicacy-beta/releases/latest) to download and run the executable for your platform. Installation is not needed.
|
||||
|
||||
Once you have the Duplicacy executable under your path, you can change to the directory that you want to back up (called *repository*) and run the *init* command:
|
||||
|
||||
```
|
||||
$ cd path/to/your/repository
|
||||
$ duplicacy init mywork sftp://192.168.1.100/path/to/storage
|
||||
```
|
||||
This *init* command connects the repository with the remote storage at 192.168.1.00 via SFTP. It will initialize the remote storage if this has not been done before. It also assigns the snapshot id *mywork* to the repository. This snapshot id is used to uniquely identify this repository if there are other repositories that also back up to the same storage.
|
||||
|
||||
You can now create snapshots of the repository by invoking the *backup* command. The first snapshot may take a while depending on the size of the repository and the upload bandwidth. Subsequent snapshots will be much faster, as only new or modified files will be uploaded. Each snapshot is identified by the snapshot id and an increasing revision number starting from 1.
|
||||
|
||||
```sh
|
||||
$ duplicacy backup
|
||||
$ duplicacy backup -stats
|
||||
```
|
||||
|
||||
Duplicacy provides a set of commands, such as list, check, diff, cat history, to manage snapshots:
|
||||
|
||||
```makefile
|
||||
$ duplicacy list # List all snapshots
|
||||
$ duplicacy check # Check integrity of snapshots
|
||||
$ duplicacy diff # Compare two snapshots, or the same file in two snapshots
|
||||
$ duplicacy cat # Print a file in a snapshot
|
||||
$ duplicacy history # Show how a file changes over time
|
||||
```
|
||||
|
||||
The *restore* command rolls back the repository to a previous revision:
|
||||
```sh
|
||||
$ duplicacy restore -r 1
|
||||
```
|
||||
|
||||
The *prune* command removes snapshots by revisions, or tags, or retention policies:
|
||||
|
||||
```sh
|
||||
$ duplicacy list
|
||||
$ duplicacy prune -r 1 # Remove the snapshot with revision number 1
|
||||
$ duplicacy prune -t quick # Remove all snapshots with the tag 'quick'
|
||||
$ duplicacy prune -keep 1:7 # Keep 1 snapshot per day for snapshots older than 7 days
|
||||
$ duplicacy prune -keep 7:30 # Keep 1 snapshot every 7 days for snapshots older than 30 days
|
||||
$ duplicacy prune -keep 0:180 # Remove all snapshots older than 180 days
|
||||
```
|
||||
|
||||
The first time the *prune* command is called, it removes the specified snapshots but keeps all unreferenced chunks as fossils.
|
||||
Since it uses the two-step fossil collection algorithm to clean chunks, you will need to run it again to remove those fossils from the storage:
|
||||
|
||||
```sh
|
||||
$ duplicacy add s3 mywork s3://amazon.com/duplicacy/mywork
|
||||
$ duplicacy prune # Chunks from deleted snapshots will be removed if deletion criteria are met
|
||||
```
|
||||
|
||||
To back up to multiple storages, use the *add* command to add a new storage. The *add* command is similar to the *init* command, except that the first argument is a storage name used to distinguish different storages:
|
||||
|
||||
```sh
|
||||
$ duplicacy copy -r 1-2 -to s3
|
||||
$ duplicacy add s3 mywork s3://amazon.com/mybucket/path/to/storage
|
||||
```
|
||||
|
||||
You can back up to any storage by specifying the storage name:
|
||||
|
||||
```sh
|
||||
$ duplicacy backup -storage s3
|
||||
```
|
||||
|
||||
However, snapshots created this way will be different on different storages, if the repository has been changed during two backup operations. A better approach, is to use the *copy* command to copy specified snapshots from one storage to another:
|
||||
|
||||
```sh
|
||||
$ duplicacy copy -r 1 -to s3 # Copy snapshot at revision 1 to the s3 storage
|
||||
$ duplicacy copy -to s3 # Copy every snapshot to the s3 storage
|
||||
```
|
||||
|
||||
The [User Guide](https://github.com/gilbertchen/duplicacy-beta/blob/master/DESIGN.md) contains a complete reference to
|
||||
all commands and other features of Duplicacy.
|
||||
|
||||
## Storages
|
||||
|
||||
Duplicacy currently supports local file storage, SFTP, and 5 cloud storage providers.
|
||||
|
||||
#### Local disk
|
||||
|
||||
```
|
||||
Storage URL: /path/to/storage (on Linux or Mac OS X)
|
||||
C:\path\to\storage (on Windows)
|
||||
```
|
||||
|
||||
#### SFTP
|
||||
|
||||
```
|
||||
Storage URL: sftp://username@server/path/to/storage
|
||||
```
|
||||
|
||||
Login methods include password authentication and public key authentication. Due to a limitation of the underlying Go SSH library, the key pair for public key authentication be generated without a passphrase. To work with a key that has a passphrase, you can set up SSH agent forwarding which is also supported by Duplicacy.
|
||||
|
||||
#### Dropbox
|
||||
|
||||
```
|
||||
Storage URL: dropbox://path/to/storage
|
||||
```
|
||||
|
||||
For Duplicacy to access your Dropbox storage, you must provide an access token that can be obtained in one of two ways:
|
||||
|
||||
* Create your own app on the [Dropbox Developer](https://www.dropbox.com/developers) page, and then generate the [access token](https://blogs.dropbox.com/developers/2014/05/generate-an-access-token-for-your-own-account/)
|
||||
|
||||
* Or authorize Duplicacy to access its app folder inside your Dropbox (following [this link](https://dl.dropboxusercontent.com/u/95866350/start_dropbox_token.html)), and Dropbox will generate the access token (which is not visible to us, as the redirect page showing the token is merely a static html hosted by Dropbox)
|
||||
|
||||
Dropbox has two advantages over other cloud providers. First, if you are already a paid user then to use the unused space as the backup storage is basically free. Second, unlike other providers Dropbox does not charge API usage fees.
|
||||
|
||||
#### Amazon S3
|
||||
|
||||
```
|
||||
Storage URL: s3://amazon.com/bucket/path/to/storage (default region is us-east-1)
|
||||
s3://region@amazon.com/bucket/path/to/storage (other regions must be specified)
|
||||
```
|
||||
|
||||
You'll need to input an access key and a secret key to access your Amazon S3 storage.
|
||||
|
||||
|
||||
#### Google Cloud Storage
|
||||
|
||||
```
|
||||
Storage URL: s3://storage.googleapis.com/bucket/path/to/storage
|
||||
```
|
||||
|
||||
Duplicacy uses the s3 protocol to access Google Cloud Storage, so you must enable the [s3 interoperability](https://cloud.google.com/storage/docs/migrating#migration-simple) in your Google Cloud Storage settings.
|
||||
|
||||
#### Microsoft Azure
|
||||
|
||||
```
|
||||
Storage URL: azure://account/container
|
||||
```
|
||||
|
||||
You'll need to input the access key once prompted.
|
||||
|
||||
#### BackBlaze
|
||||
|
||||
```
|
||||
Storage URL: b2://bucket
|
||||
```
|
||||
|
||||
You'll need to input the account id and application key.
|
||||
|
||||
BackBlaze offers perhaps the least expensive cloud storage at 0.5 cent per GB per month. Unfortunately their API does not support file renaming, so the -exclusive option is required when pruning old backups. This means concurrent access and deletion can't be permitted at the same time.
|
||||
|
||||
## Comparison with Other Backup Tools
|
||||
|
||||
[duplicity](http://duplicity.nongnu.org) works by applying the rsync algorithm (or more specific, the [librsync](https://github.com/librsync/librsync) library)
|
||||
to find the differences from previous backups and only then uploading the differences. It is the only existing backup tool with extensive cloud support -- the [long list](http://duplicity.nongnu.org/duplicity.1.html#sect7) of storage backends covers almost every cloud provider one can think of. However, duplicity's biggest flaw lies in its incremental model -- a chain of dependent backups starts with a full backup followed by a number of incremental ones, and ends when another full backup is uploaded. Deleting one backup will render useless all the subsequent backups on the same chain. Periodic full backups are required, in order to make previous backups disposable.
|
||||
|
||||
[bup](https://github.com/bup/bup) also uses librsync to split files into chunks but save chunks in the git packfile format. It doesn't support any cloud storage, or deletion of old backups.
|
||||
|
||||
[Obnam](http://obnam.org) got the incremental backup model right in the sense that every incremental backup is actually a full snapshot. Although Obnam also splits files into chunks, it does not adopt either the rsync algorithm or the variable-size chunking algorithm. As a result, deletions or insertions of a few bytes will foil the
|
||||
[deduplication](http://obnam.org/faq/dedup).
|
||||
Deletion of old backups is possible, but no cloud storages are supported.
|
||||
Multiple clients can back up to the same storage, but only sequential access is granted by the [locking on-disk data structures](http://obnam.org/locking/).
|
||||
It is unclear if the lack of cloud backends is due to difficulties in porting the locking data structures to cloud storage APIs.
|
||||
|
||||
[Attic](https://attic-backup.org) has been acclaimed by some as the [Holy Grail of backups](https://www.stavros.io/posts/holy-grail-backups). It follows the same incremental backup model as Obnam, but embraces the variable-size chunk algorithm for better performance and better deduplication. Deletions of old backup is also supported. However, no cloud backends are implemented, as in Obnam. Although concurrent backups from multiple clients to the same storage is in theory possible by the use of locking, it is
|
||||
[not recommended](http://librelist.com/browser//attic/2014/11/11/backing-up-multiple-servers-into-a-single-repository/#e96345aa5a3469a87786675d65da492b) by the developer due to chunk indices being kept in a local cache.
|
||||
Concurrent access is not only a convenience; it is a necessity for better deduplication. For instance, if multiple machines with the same OS installed can back up their entire drives to the same storage, only one copy of the system files needs to be stored, greatly reducing the storage space regardless of the number of machines. Attic still adopts the traditional approach of using a centralized indexing database to manage chunks, and relies heavily on caching to improve performance. The presence of exclusive locking makes it hard to be adapted for cloud storage APIs and reduces the level of deduplication.
|
||||
|
||||
[restic](https://restic.github.io) is a more recent addition. It is worth mentioning here because, like Duplicacy, it is written in Go. It uses a format similar to the git packfile format, but not exactly the same. Multiple clients backing up to the same storage are still guarded by
|
||||
[locks](https://github.com/restic/restic/blob/master/doc/Design.md#locks).
|
||||
A command to delete old backups is in the developer's [plan](https://github.com/restic/restic/issues/18). S3 storage is supported, although it is unclear how hard it is to support other cloud storage APIs because of the need for locking. Overall, it still falls in the same category as Attic. Whether it will eventually reach the same level as Attic remains to be seen.
|
||||
|
||||
The following table compares the feature lists of all these backup tools:
|
||||
|
||||
| Tool | Incremental Backup | Full Snapshot | Deduplication | Encryption | Deletion | Concurrent Access |Cloud Support |
|
||||
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
|
||||
| duplicity | Yes | No | Weak | Yes | No | No | Extensive |
|
||||
| bup | Yes | Yes | Yes | Yes | No | No | No |
|
||||
| Obnam | Yes | Yes | Weak | Yes | Yes | Exclusive locking | No |
|
||||
| Attic | Yes | Yes | Yes | Yes | Yes | Not recommended | No |
|
||||
| restic | Yes | Yes | Yes | Yes | No | Exclusive locking | S3 only |
|
||||
| **Duplicacy** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** | **Lock-free** | **S3, GCS, Azure, Dropbox, BackBlaze** |
|
||||
|
||||
|
||||
BIN
images/duplicacy_encryption.png
Normal file
BIN
images/duplicacy_encryption.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 29 KiB |
BIN
images/fossil_collection_1.png
Normal file
BIN
images/fossil_collection_1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 29 KiB |
BIN
images/fossil_collection_2.png
Normal file
BIN
images/fossil_collection_2.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 28 KiB |
Reference in New Issue
Block a user