1
0
mirror of https://github.com/gilbertchen/duplicacy synced 2025-12-06 00:03:38 +00:00

cleanup markdown

This commit is contained in:
m@
2017-08-31 22:18:05 -05:00
parent dfa6113279
commit 46ec852d4d
4 changed files with 98 additions and 158 deletions

View File

@@ -8,9 +8,9 @@ Duplicacy is based on the following open source projects:
|https://github.com/Azure/azure-sdk-for-go | Apache-2.0 |
|https://github.com/tj/go-dropbox | MIT |
|https://github.com/aws/aws-sdk-go | Apache-2.0 |
|https://github.com/goamz/goamz | LGPL with static link exception |
|https://github.com/goamz/goamz | LGPL with static link exception |
|https://github.com/howeyc/gopass | ISC |
|https://github.com/tmc/keyring | ISC |
|https://github.com/pcwizz/xattr | BSD-2-Clause |
|https://github.com/pcwizz/xattr | BSD-2-Clause |
|https://github.com/minio/blake2b-simd | Apache-2.0 |
|https://github.com/go-ole/go-ole | MIT |

View File

@@ -27,7 +27,7 @@ If exclusive access to a file storage by a single client can be guaranteed, the
chunks not referenced by any backup and delete them. However, if concurrent access is required, an unreferenced chunk
can't be trivially removed, because of the possibility that a backup procedure in progress may reference the same chunk.
The ongoing backup procedure, still unknown to the deletion procedure, may have already encountered that chunk during its
file scanning phase, but decided not to upload the chunk again since it already exists in the file storage.
file scanning phase, but decided not to upload the chunk again since it already exists in the file storage.
Fortunately, there is a solution to address the deletion problem and make lock-free deduplication practical. The solution is a *two-step fossil collection* algorithm that deletes unreferenced chunks in two steps: identify and collect them in the first step, and then permanently remove them once certain conditions are met.
@@ -47,7 +47,7 @@ In the first step of the deletion procedure, called the *fossil collection* step
be saved in a fossil collection file. The deletion procedure then exits without performing further actions. This step has not effectively changed any chunk references due to the first fossil access rule. If a backup procedure references a chunk after it is marked as a fossil, a new chunk will be uploaded because of the second fossil access rule, as shown in Figure 1.
<p align="center">
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_1.png?raw=true"
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_1.png?raw=true"
alt="Reference after Rename"/>
</p>
@@ -64,7 +64,7 @@ Therefore, if a backup procedure references a chunk before the chunk is marked a
delete the chunk until it sees that backup procedure finishes (as indicated by the appearance of a new snapshot file uploaded to the storage). This ensures that scenarios depicted in Figure 2 will never happen.
<p align="center">
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_2.png?raw=true"
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/fossil_collection_2.png?raw=true"
alt="Reference before Rename"/>
</p>
@@ -128,25 +128,25 @@ and dir1/file3):
170593,
124309,
1734
]
]
}
```
When Duplicacy splits a file in chunks using the variable-size chunking algorithm, if the end of a file is reached and yet the boundary marker for terminating a chunk
hasn't been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all
hasn't been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all
files were packed into a big tar file which is then split into chunks.
The *content* field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
instance, *file1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.
The backup procedure can run in one of two modes. In the default quick mode, only modified or new files are scanned. Chunks only
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
files are appended. Indices for unchanged files need to be updated too.
In the safe mode (enabled by the -hash option), all files are scanned and the chunk sequence is regenerated.
The length sequence stores the lengths for all chunks, which are needed when calculating some statistics such as the total
length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous.
length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous.
To make the situation worse, every time a big snapshot file would have been uploaded even if only a few files have been changed since
last backup. To save space, the variable-size chunking algorithm is also applied to the three dynamic fields of a snapshot
file, *files*, *chunks*, and *lengths*.
@@ -200,7 +200,7 @@ When encryption is enabled (by the -e option with the *init* or *add* command),
Here is a diagram showing how these keys are used:
<p align="center">
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/duplicacy_encryption.png?raw=true"
<img src="https://github.com/gilbertchen/duplicacy-beta/blob/master/images/duplicacy_encryption.png?raw=true"
alt="encryption"/>
</p>
@@ -210,6 +210,4 @@ Chunk content is encrypted by AES-GCM, with an encryption key that is the HMAC-S
The snapshot is encrypted by AES-GCM too, using an encrypt key that is the HMAC-SHA256 of the file path with the *File Key* as the secret key.
These four random keys are saved in a file named 'config' in the storage, encrypted with a master key derived from the PBKDF2 function on
the storage password chosen by the user.
These four random keys are saved in a file named 'config' in the storage, encrypted with a master key derived from the PBKDF2 function on the storage password chosen by the user.

181
GUIDE.md
View File

@@ -16,25 +16,22 @@ OPTIONS:
-chunk-size, -c 4M the average size of chunks
-max-chunk-size, -max 16M the maximum size of chunks (defaults to chunk-size * 4)
-min-chunk-size, -min 1M the minimum size of chunks (defaults to chunk-size / 4)
-pref-dir <preference directory path> Specify alternate location for .duplicacy preferences directory
-pref-dir <preference directory path> Specify alternate location for .duplicacy preferences directory
```
The *init* command first connects to the storage specified by the storage URL. If the storage has been already been
initialized before, it will download the storage configuration (stored in the file named *config*) and ignore the options provided in the command line. Otherwise, it will create the configuration file from the options and upload the file.
The *init* command first connects to the storage specified by the storage URL. If the storage has been already been initialized before, it will download the storage configuration (stored in the file named *config*) and ignore the options provided in the command line. Otherwise, it will create the configuration file from the options and upload the file.
The initialized storage will then become the default storage for other commands if the -storage option is not specified
for those commands. This default storage actually has a name, *default*.
The initialized storage will then become the default storage for other commands if the `-storage` option is not specified for those commands. This default storage actually has a name, *default*.
After that, it will prepare the the current working directory as the repository to be backed up. Under the hood, it will create a directory
named *.duplicacy* in the repository and put a file named *preferences* that stores the snapshot id and encryption and storage options.
After that, it will prepare the current working directory as the repository to be backed up. Under the hood, it will create a directory named *.duplicacy* in the repository and put a file named *preferences* that stores the snapshot id and encryption and storage options.
The snapshot id is an id used to distinguish different repositories connected to the same storage. Each repository must have a unique snapshot id. A snapshot id must contain only characters valid in Linux and Windows paths (alphabet, digits, underscore, dash, etc), but cannot include `/`, `\`, or `@`.
The -e option controls whether or not encryption will be enabled for the storage. If encryption is enabled, you will be prompted to enter a storage password.
The `-e` option controls whether or not encryption will be enabled for the storage. If encryption is enabled, you will be prompted to enter a storage password.
The three chunk size parameters are passed to the variable-size chunking algorithm. Their values are important to the overall performance, especially for cloud storages. If the chunk size is too small, a lot of overhead will be in sending requests and receiving responses. If the chunk size is too large, the effect of deduplication will be less obvious as more data will need to be transferred with each chunk.
The three chunk size parameters are passed to the variable-size chunking algorithm. Their values are important to the overall performance, especially for cloud storages. If the chunk size is too small, a lot of overhead will be in sending requests and receiving responses. If the chunk size is too large, the effect of de-duplication will be less obvious as more data will need to be transferred with each chunk.
The -pref-dir controls the location of the preferences directory. If not specified, a directory named .duplicacy is created in the repository. If specified, it must point to a non-existing directory. The directory is created and a .duplicacy file is created in the repository. The .duplicacy file contains the absolute path name to the preferences directory.
The `-pref-dir` controls the location of the preferences directory. If not specified, a directory named .duplicacy is created in the repository. If specified, it must point to a non-existing directory. The directory is created and a .duplicacy file is created in the repository. The .duplicacy file contains the absolute path name to the preferences directory.
Once a storage has been initialized with these parameters, these parameters cannot be modified any more.
@@ -52,29 +49,24 @@ OPTIONS:
-t <tag> assign a tag to the backup
-stats show statistics during and after backup
-threads <n> number of uploading threads
-limit-rate <kB/s> the maximum upload rate (in kilobytes/sec)
-limit-rate <kB/s> the maximum upload rate (in kilobytes/sec)
-vss enable the Volume Shadow Copy service (Windows only)
-storage <storage name> backup to the specified storage instead of the default one
```
The *backup* command creates a snapshot of the repository and uploads it to the storage. If -hash is not provided,
it will upload new or modified files since last backup by comparing file sizes and timestamps.
Otherwise, every file is scanned to detect changes.
The *backup* command creates a snapshot of the repository and uploads it to the storage. If `-hash` is not provided,it will upload new or modified files since last backup by comparing file sizes and timestamps. Otherwise, every file is scanned to detect changes.
You can assign a tag to the snapshot so that later you can refer to it by tag in other commands.
If the -stats option is specified, statistical information such as transfer speed, the number of chunks will be displayed
throughout the backup procedure.
If the `-stats` option is specified, statistical information such as transfer speed, and the number of chunks will be displayed throughout the backup procedure.
The -threads option can be used to specify more than one thread to upload chunks.
The `-threads` option can be used to specify more than one thread to upload chunks.
The -limit-rate option sets a cape on the maximum upload rate.
The `-limit-rate` option sets a cap on the maximum upload rate.
The -vss option works on Windows only to turn on the Volume Shadow Copy service such that files opened by other
processes with exclusive locks can be read as usual.
The `-vss` option works on Windows only to turn on the Volume Shadow Copy service such that files opened by other processes with exclusive locks can be read as usual.
When the repository can have multiple storages (added by the *add* command), you can select the storage to back up to
by giving a storage name.
When the repository can have multiple storages (added by the *add* command), you can select the storage to back up to by giving a storage name.
You can specify patterns to include/exclude files by putting them in a file named *.duplicacy/filters*. Please refer to the [Include/Exclude Patterns](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#includeexclude-patterns) section for how to specify the patterns.
@@ -93,29 +85,25 @@ OPTIONS:
-delete delete files not in the snapshot
-stats show statistics during and after restore
-threads <n> number of downloading threads
-limit-rate <kB/s> the maximum download rate (in kilobytes/sec)
-limit-rate <kB/s> the maximum download rate (in kilobytes/sec)
-storage <storage name> restore from the specified storage instead of the default one
```
The *restore* command restores the repository to a previous revision. By default the restore procedure will treat
files that have the same sizes and timestamps as those in the snapshot as unchanged files, but with the -hash option, every file will be fully scanned to make sure they are in fact unchanged.
The *restore* command restores the repository to a previous revision. By default the restore procedure will treat files that have the same sizes and timestamps as those in the snapshot as unchanged files, but with the -hash option, every file will be fully scanned to make sure they are in fact unchanged.
By default the restore procedure will not overwriting existing files, unless the -overwrite option is specified.
By default the restore procedure will not overwriting existing files, unless the `-overwrite` option is specified.
The -delete option indicates that files not in the snapshot will be removed.
The `-delete` option indicates that files not in the snapshot will be removed.
If the -stats option is specified, statistical information such as transfer speed, number of chunks will be displayed
throughout the restore procedure.
If the `-stats` option is specified, statistical information such as transfer speed, and number of chunks will be displayed throughout the restore procedure.
The -threads option can be used to specify more than one thread to download chunks.
The `-threads` option can be used to specify more than one thread to download chunks.
The -limit-rate option sets a cape on the maximum upload rate.
The `-limit-rate` option sets a cap on the maximum upload rate.
When the repository can have multiple storages (added by the *add* command), you can select the storage to restore from by specifying the storage name.
Unlike the *backup* procedure that reading the include/exclude patterns from a file, the *restore* procedure reads them
from the command line. If the patterns can cause confusion to the command line argument parser, -- should be prepended to
the patterns. Please refer to the [Include/Exclude Patterns](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#includeexclude-patterns) section for how to specify patterns.
Unlike the *backup* procedure that reading the include/exclude patterns from a file, the *restore* procedure reads them from the command line. If the patterns can cause confusion to the command line argument parser, -- should be prepended to the patterns. Please refer to the [Include/Exclude Patterns](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#includeexclude-patterns) section for how to specify patterns.
#### List
@@ -124,7 +112,7 @@ SYNOPSIS:
duplicacy list - List snapshots
USAGE:
duplicacy list [command options]
duplicacy list [command options]
OPTIONS:
-all, -a list snapshots with any id
@@ -137,24 +125,17 @@ OPTIONS:
-storage <storage name> retrieve snapshots from the specified storage
```
The *list* command lists information about specified snapshots. By default it will list snapshots created from the
current repository, but you can list all snapshots stored in the storage by specifying the -all option, or list snapshots
with a different snapshot id using the -id option, and/or snapshots with a particular tag with the -t option.
The *list* command lists information about specified snapshots. By default it will list snapshots created from the current repository, but you can list all snapshots stored in the storage by specifying the -all option, or list snapshots with a different snapshot id using the `-id` option, and/or snapshots with a particular tag with the `-t` option.
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing
every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using
the -r option, which either takes a single revision number (-r 123) or a range (-r 123-456).
There can be multiple -r options.
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using the `-r` option, which either takes a single revision number `-r 123` or a range `-r 123-456`. There can be multiple `-r` options.
If -files is specified, for each snapshot to be listed, this command will also print information about every file
contained in the snapshot.
If `-files` is specified, for each snapshot to be listed, this command will also print information about every file contained in the snapshot.
If -chunks is specified, the command will also print out every chunk the snapshot references.
If `-chunks` is specified, the command will also print out every chunk the snapshot references.
The -reset-password option is used to reset stored passwords and to allow passwords to be entered again. Please refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more information.
The `-reset-password` option is used to reset stored passwords and to allow passwords to be entered again. Please refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more information.
When the repository can have multiple storages (added by the *add* command), you can specify the storage to list
by specifying the storage name.
When the repository can have multiple storages (added by the *add* command), you can specify the storage to list by specifying the storage name.
#### Check
```
@@ -178,23 +159,15 @@ OPTIONS:
The *check* command checks, for each specified snapshot, that all referenced chunks exist in the storage.
By default the *check* command will check snapshots created from the
current repository, but you can check all snapshots stored in the storage at once by specifying the -all option, or
snapshots from a different repository using the -id option, and/or snapshots with a particular tag with the -t option.
current repository, but you can check all snapshots stored in the storage at once by specifying the `-all` option, or snapshots from a different repository using the `-id` option, and/or snapshots with a particular tag with the `-t` option.
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing
every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using
the -r option, which either takes a single revision number (-r 123) or a range (-r 123-456).
There can be multiple -r options.
The revision number is a number assigned to the snapshot when it is being created. This number will keep increasing every time a new snapshot is created from a repository. You can refer to snapshots by their revision numbers using the `-r` option, which either takes a single revision number `-r 123` or a range `-r 123-456`. There can be multiple `-r` options.
By default the *check* command only verifies the existence of chunks. To verify the full integrity of a snapshot,
you should specify the -files option, which will download chunks and compute file hashes in memory, to
make sure that all hashes match.
By default the *check* command only verifies the existence of chunks. To verify the full integrity of a snapshot, you should specify the `-files` option, which will download chunks and compute file hashes in memory, to make sure that all hashes match.
By default the *check* command does not find fossils. If the -fossils option is specified, it will find
the fossil if the referenced chunk does not exist. if the -resurrect option is specified, it will turn the fossil back into a chunk.
By default the *check* command does not find fossils. If the `-fossils` option is specified, it will find the fossil if the referenced chunk does not exist. if the `-resurrect` option is specified, it will turn the fossil back into a chunk.
When the repository can have multiple storages (added by the *add* command), you can specify the storage to check
by specifying the storage name.
When the repository can have multiple storages (added by the *add* command), you can specify the storage to check by specifying the storage name.
#### Cat
@@ -217,9 +190,9 @@ The file must be specified with a path relative to the repository.
You can specify a different snapshot id rather than the default id.
The -r option is optional. If not specified, the latest revision will be selected.
The `-r` option is optional. If not specified, the latest revision will be selected.
You can use the -storage option to select a different storage other than the default one.
You can use the `-storage` option to select a different storage other than the default one.
#### Diff
```
@@ -235,17 +208,15 @@ OPTIONS:
-hash compute the hashes of on-disk files
-storage <storage name> retrieve files from the specified storage
```
The *diff* command compares the same file in two different snapshots if a file is given, otherwise compares the
two snapshots.
The *diff* command compares the same file in two different snapshots if a file is given, otherwise compares the two snapshots.
The file must be specified with a path relative to the repository.
You can specify a different snapshot id rather than the default snapshot id.
If only one revision is given by -r, the right hand side of the comparison will be the on-disk file.
The -hash option can then instruct this command to compute the hash of the file.
If only one revision is given by `-r`, the right hand side of the comparison will be the on-disk file. The `-hash` option can then instruct this command to compute the hash of the file.
You can use the -storage option to select a different storage other than the default one.
You can use the `-storage` option to select a different storage other than the default one.
#### History
```
@@ -264,13 +235,11 @@ OPTIONS:
The *history* command shows how the hash, size, and timestamp of a file change over the specified set of revisions.
You can specify a different snapshot id rather than the default snapshot id, and multiple -r options to specify the
set of revisions.
You can specify a different snapshot id rather than the default snapshot id, and multiple `-r` options to specify the set of revisions.
The -hash option is to compute the hash of the on-disk file. Otherwise, only the size and timestamp of the on-disk
file will be included.
The `-hash` option is to compute the hash of the on-disk file. Otherwise, only the size and timestamp of the on-disk file will be included.
You can use the -storage option to select a different storage other than the default one.
You can use the `-storage` option to select a different storage other than the default one.
#### Prune
```
@@ -295,16 +264,11 @@ OPTIONS:
-storage <storage name> prune snapshots from the specified storage
```
The *prune* command implements the two-step fossil collection algorithm. It will first find fossil collection files
from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it
will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil
collection file stored locally (the fossil collection step).
The *prune* command implements the two-step fossil collection algorithm. It will first find fossil collection files from previous runs and check if contained fossils are eligible for permanent deletion (the fossil deletion step). Then it will search for snapshots to be deleted, mark unreferenced chunks as fossils (by renaming) and save them in a new fossil collection file stored locally (the fossil collection step).
If a snapshot id is specified, that snapshot id will be used instead of the default one. The -a option will find
snapshots with any id. Snapshots to be deleted can be specified by revision numbers, by a tag, by retention policies,
or by any combination of them.
If a snapshot id is specified, that snapshot id will be used instead of the default one. The `-a` option will find snapshots with any id. Snapshots to be deleted can be specified by revision numbers, by a tag, by retention policies, or by any combination of them.
The retention policies are specified by the -keep option, which accepts an argument in the form of two numbers *n:m*, where *n* indicates the number of days between two consecutive snapshots to keep, and *m* means that the policy only applies to snapshots at least *m* day old. If *n* is zero, any snapshots older than *m* days will be removed.
The retention policies are specified by the `-keep` option, which accepts an argument in the form of two numbers *n:m*, where *n* indicates the number of days between two consecutive snapshots to keep, and *m* means that the policy only applies to snapshots at least *m* day old. If *n* is zero, any snapshots older than *m* days will be removed.
Here are a few sample retention policies:
@@ -315,37 +279,28 @@ $ duplicacy prune -keep 30:180 # Keep 1 snapshot every 30 days for snapshots
$ duplicacy prune -keep 0:360 # Keep no snapshots older than 360 days
```
Multiple -keep options must be sorted by their *m* values in decreasing order. For instance, to combine the above policies into one line, it would become:
Multiple `-keep` options must be sorted by their *m* values in decreasing order. For instance, to combine the above policies into one line, it would become:
```sh
$ duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7
```
The -exhaustive option will scan the list of all chunks in the storage, therefore it will find not only
unreferenced chunks from deleted snapshots, but also chunks that become unreferenced for other reasons, such as
those from an incomplete backup. It will also find any file that does not look like a chunk file.
In contrast, a default *prune* command will only identify
The `-exhaustive` option will scan the list of all chunks in the storage, therefore it will find not only unreferenced chunks from deleted snapshots, but also chunks that become unreferenced for other reasons, such as those from an incomplete backup. It will also find any file that does not look like a chunk file. In contrast, a default *prune* command will only identify
chunks referenced by deleted snapshots but not any other snapshots.
The -exclusive option will assume that no other clients are accessing the storage, effectively disabling the
*two-step fossil collection* algorithm. With this option, the *prune* command will immediately remove unreferenced chunks.
The `-exclusive` option will assume that no other clients are accessing the storage, effectively disabling the *two-step fossil collection* algorithm. With this option, the *prune* command will immediately remove unreferenced chunks.
The -dryrun option is used to test what changes the *prune* command would have done. It is guaranteed not to make
any changes on the storage, not even creating the local fossil collection file. The following command checks if the
chunk directory is clean (i.e., if there are any unreferenced chunks, temporary files, or anything else):
The `-dryrun` option is used to test what changes the *prune* command would have done. It is guaranteed not to make any changes on the storage, not even creating the local fossil collection file. The following command checks if the chunk directory is clean (i.e., if there are any unreferenced chunks, temporary files, or anything else):
```
$ duplicacy prune -d -exclusive -exhaustive # Prints out nothing if the chunk directory is clean
```
The -delete-only option will skip the fossil collection step, while the -collect-only option will skip the fossil deletion step.
The `-delete-only` option will skip the fossil collection step, while the `-collect-only` option will skip the fossil deletion step.
For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least
one new snapshot from *each* snapshot id must be created between two runs of the *prune* command. However, some repository
may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days. It also provide an
-ignore option that can be used to skip certain repositories when deciding the deletion criteria.
For fossils collected in the fossil collection step to be eligible for safe deletion in the fossil deletion step, at least one new snapshot from *each* snapshot id must be created between two runs of the *prune* command. However, some repository may not be set up to back up with a regular schedule, and thus literally blocking other repositories from deleting any fossils. Duplicacy by default will ignore repositories that have no new backup in the past 7 days. It also provide an `-ignore` option that can be used to skip certain repositories when deciding the deletion criteria.
You can use the -storage option to select a different storage other than the default one.
You can use the `-storage` option to select a different storage other than the default one.
#### Password
@@ -384,17 +339,11 @@ OPTIONS:
-copy <storage name> make the new storage copy-compatible with an existing one
```
The *add* command connects another storage to the current repository. Like the *init* command, if the storage has not
been initialized before, a storage configuration file derived from the command line options will be uploaded, but those
options will be ignored if the configuration file already exists in the storage.
The *add* command connects another storage to the current repository. Like the *init* command, if the storage has not been initialized before, a storage configuration file derived from the command line options will be uploaded, but those options will be ignored if the configuration file already exists in the storage.
A unique storage name must be given in order to distinguish it from other storages.
The -copy option is required if later you want to copy snapshots between this storage and another storage.
Two storages are copy-compatible if they have the same average chunk size, the same maximum chunk size,
the same minimum chunk size, the same chunk seed (used in calculating the rolling hash in the variable-size chunks
algorithm), and the same hash key. If the -copy option is specified, these parameters will be copied from
the existing storage rather than from the command line.
The `-copy` option is required if later you want to copy snapshots between this storage and another storage. Two storages are copy-compatible if they have the same average chunk size, the same maximum chunk size, the same minimum chunk size, the same chunk seed (used in calculating the rolling hash in the variable-size chunks algorithm), and the same hash key. If the `-copy` option is specified, these parameters will be copied from the existing storage rather than from the command line.
#### Set
```
@@ -416,16 +365,15 @@ OPTIONS:
The *set* command changes the options for the specified storage.
The -e option turns on the storage encryption. If specified as -e=false, it turns off the storage encryption.
The `-e` option turns on the storage encryption. If specified as `-e=false`, it turns off the storage encryption.
The -no-backup option will not allow backups from this repository to be created.
The `-no-backup` option will not allow backups from this repository to be created.
The -no-restore option will not allow restoring this repository to a different revision.
The `-no-restore` option will not allow restoring this repository to a different revision.
The -no-save-password option will require every password or token to be entered every time and not saved anywhere.
The `-no-save-password` option will require every password or token to be entered every time and not saved anywhere.
The -key and -value options are used to store (in plain text) access keys or tokens need by various storages. Please
refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more details.
The `-key` and `-value` options are used to store (in plain text) access keys or tokens need by various storages. Please refer to the [Managing Passwords](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md#managing-passwords) section for more details.
You can select a storage to change options for by specifying a storage name.
@@ -445,14 +393,11 @@ OPTIONS:
-to <storage name> copy snapshots to the specified storage
```
The *copy* command copies snapshots from one storage to another storage. They must be copy-compatible, i.e., some
configuration parameters must be the same. One storage must be initialized with the -copy option provided by the *add* command.
The *copy* command copies snapshots from one storage to another storage. They must be copy-compatible, i.e., some configuration parameters must be the same. One storage must be initialized with the `-copy` option provided by the *add* command.
Instead of copying all snapshots, you can specify a set of snapshots to copy by giving the -r options. The *copy* command
preserves the revision numbers, so if a revision number already exists on the destination storage the command will fail.
Instead of copying all snapshots, you can specify a set of snapshots to copy by giving the `-r` options. The *copy* command preserves the revision numbers, so if a revision number already exists on the destination storage the command will fail.
If no -from option is given, the snapshots from the default storage will be copied. The -to option specified the
destination storage and is required.
If no `-from` option is given, the snapshots from the default storage will be copied. The `-to` option specified the destination storage and is required.
## Include/Exclude Patterns
@@ -521,7 +466,7 @@ Duplicacy maintains a local cache under the `.duplicacy/cache` folder in the rep
At the end of a backup operation, Duplicacy will clean up the local cache in such a way that only chunks composing the snapshot file from the last backup will stay in the cache. All other chunks will be removed from the cache. However, if the *prune* command has been run before (which will leave a the `.duplicacy/collection` folder in the repository, then the *backup* command won't perform any cache cleanup and instead defer that to the *prune* command.
At the end of a prune operation, Duplicacy will remove all chunks from the local cache except those composing the snapshot file from the last backup (those that would be kept by the *backup* command), as well as chunks that contain information about chunks referenced by *all* backups from *all* repositories connected to the same storage url.
At the end of a prune operation, Duplicacy will remove all chunks from the local cache except those composing the snapshot file from the last backup (those that would be kept by the *backup* command), as well as chunks that contain information about chunks referenced by *all* backups from *all* repositories connected to the same storage url.
Other commands, such as *list*, *check*, does not clean up the local cache at all, so the local cache may keep growing if many of these commands run consecutively. However, once a *backup* or a *prune* command is invoked, the local cache should shrink to its normal size.

View File

@@ -8,10 +8,10 @@ There is a special edition of Duplicacy developed for VMware vSphere (ESXi) name
## Features
Duplicacy currently supports major cloud storage providers (Amazon S3, Google Cloud Storage, Microsoft Azure, Dropbox, Backblaze, Google Drive, Microsoft OneDrive, and Hubic) and offers all essential features of a modern backup tool:
Duplicacy currently supports major cloud storage providers (Amazon S3, Google Cloud Storage, Microsoft Azure, Dropbox, Backblaze B2, Google Drive, Microsoft OneDrive, and Hubic) and offers all essential features of a modern backup tool:
* Incremental backup: only back up what has been changed
* Full snapshot : although each backup is incremental, it must behave like a full snapshot for easy restore and deletion
* Full snapshot: although each backup is incremental, it must behave like a full snapshot for easy restore and deletion
* Deduplication: identical files must be stored as one copy (file-level deduplication), and identical parts from different files must be stored as one copy (block-level deduplication)
* Encryption: encrypt not only file contents but also file paths, sizes, times, etc.
* Deletion: every backup can be deleted independently without affecting others
@@ -133,7 +133,7 @@ Storage URL: /path/to/storage (on Linux or Mac OS X)
```
</details>
<details> <summary>SFTP</summary>
<details> <summary>SFTP</summary>
```
Storage URL: sftp://username@server/path/to/storage (path relative to the home directory)
@@ -200,7 +200,7 @@ Storage URL: gcs://bucket/path/to/storage
```
Starting from version 2.0.0, a new Google Cloud Storage backend is added which is implemented using the [official Google client library](https://godoc.org/cloud.google.com/go/storage). You must first obtain a credential file by [authorizing](https://duplicacy.com/gcp_start) Duplicacy to access your Google Cloud Storage account or by [downloading](https://console.cloud.google.com/projectselector/iam-admin/serviceaccounts) a service account credential file.
You can also use the s3 protocol to access Google Cloud Storage. To do this, you must enable the [s3 interoperability](https://cloud.google.com/storage/docs/migrating#migration-simple) in your Google Cloud Storage settings and set the storage url as `s3://storage.googleapis.com/bucket/path/to/storage`.
</details>
@@ -233,8 +233,7 @@ Backblaze's B2 storage is one of the least expensive (at 0.5 cent per GB per mon
Storage URL: gcd://path/to/storage
```
To use Google Drive as the storage, you first need to download a token file from https://duplicacy.com/gcd_start by
authorizing Duplicacy to access your Google Drive, and then enter the path to this token file to Duplicacy when prompted.
To use Google Drive as the storage, you first need to download a token file from https://duplicacy.com/gcd_start by authorizing Duplicacy to access your Google Drive, and then enter the path to this token file to Duplicacy when prompted.
</details>
@@ -244,8 +243,7 @@ authorizing Duplicacy to access your Google Drive, and then enter the path to th
Storage URL: one://path/to/storage
```
To use Microsoft OneDrive as the storage, you first need to download a token file from https://duplicacy.com/one_start by
authorizing Duplicacy to access your OneDrive, and then enter the path to this token file to Duplicacy when prompted.
To use Microsoft OneDrive as the storage, you first need to download a token file from https://duplicacy.com/one_start by authorizing Duplicacy to access your OneDrive, and then enter the path to this token file to Duplicacy when prompted.
</details>
@@ -255,8 +253,7 @@ authorizing Duplicacy to access your OneDrive, and then enter the path to this t
Storage URL: hubic://path/to/storage
```
To use Hubic as the storage, you first need to download a token file from https://duplicacy.com/hubic_start by
authorizing Duplicacy to access your Hubic drive, and then enter the path to this token file to Duplicacy when prompted.
To use Hubic as the storage, you first need to download a token file from https://duplicacy.com/hubic_start by authorizing Duplicacy to access your Hubic drive, and then enter the path to this token file to Duplicacy when prompted.
Hubic offers the most free space (25GB) of all major cloud providers and there is no bandwidth charge (same as Google Drive and OneDrive), so it may be worth a try.
@@ -275,18 +272,18 @@ Deletion of old backups is possible, but no cloud storages are supported.
Multiple clients can back up to the same storage, but only sequential access is granted by the [locking on-disk data structures](http://obnam.org/locking/).
It is unclear if the lack of cloud backends is due to difficulties in porting the locking data structures to cloud storage APIs.
[Attic](https://attic-backup.org) has been acclaimed by some as the [Holy Grail of backups](https://www.stavros.io/posts/holy-grail-backups). It follows the same incremental backup model as Obnam, but embraces the variable-size chunk algorithm for better performance and better deduplication. Deletions of old backup is also supported. However, no cloud backends are implemented, as in Obnam. Although concurrent backups from multiple clients to the same storage is in theory possible by the use of locking, it is
[not recommended](http://librelist.com/browser//attic/2014/11/11/backing-up-multiple-servers-into-a-single-repository/#e96345aa5a3469a87786675d65da492b) by the developer due to chunk indices being kept in a local cache.
[Attic](https://attic-backup.org) has been acclaimed by some as the [Holy Grail of backups](https://www.stavros.io/posts/holy-grail-backups). It follows the same incremental backup model as Obnam, but embraces the variable-size chunk algorithm for better performance and better deduplication. Deletions of old backup is also supported. However, no cloud backends are implemented, as in Obnam. Although concurrent backups from multiple clients to the same storage is in theory possible by the use of locking, it is
[not recommended](http://librelist.com/browser//attic/2014/11/11/backing-up-multiple-servers-into-a-single-repository/#e96345aa5a3469a87786675d65da492b) by the developer due to chunk indices being kept in a local cache.
Concurrent access is not only a convenience; it is a necessity for better deduplication. For instance, if multiple machines with the same OS installed can back up their entire drives to the same storage, only one copy of the system files needs to be stored, greatly reducing the storage space regardless of the number of machines. Attic still adopts the traditional approach of using a centralized indexing database to manage chunks, and relies heavily on caching to improve performance. The presence of exclusive locking makes it hard to be adapted for cloud storage APIs and reduces the level of deduplication.
[restic](https://restic.github.io) is a more recent addition. It is worth mentioning here because, like Duplicacy, it is written in Go. It uses a format similar to the git packfile format. Multiple clients backing up to the same storage are still guarded by
[restic](https://restic.github.io) is a more recent addition. It is worth mentioning here because, like Duplicacy, it is written in Go. It uses a format similar to the git packfile format. Multiple clients backing up to the same storage are still guarded by
[locks](https://github.com/restic/restic/blob/master/doc/Design.md#locks). A prune operation will therefore completely block all other clients connected to the storage from doing their regular backups. Moreover, since most cloud storage services do not provide a locking service, the best effort is to use some basic file operations to simulate a lock, but distributed locking is known to be a hard problem and it is unclear how reliable restic's lock implementation is. A faulty implementation may cause a prune operation to accidentally delete data still in use, resulting in unrecoverable data loss. This is the exact problem that we avoided by taking the lock-free approach.
The following table compares the feature lists of all these backup tools:
| Feature/Tool | duplicity | bup | Obnam | Attic | restic | **Duplicacy** |
| Feature/Tool | duplicity | bup | Obnam | Attic | restic | **Duplicacy** |
|:------------------:|:---------:|:---:|:-----------------:|:---------------:|:-----------------:|:-------------:|
| Incremental Backup | Yes | Yes | Yes | Yes | Yes | **Yes** |
| Full Snapshot | No | Yes | Yes | Yes | Yes | **Yes** |
@@ -303,20 +300,20 @@ The following table compares the feature lists of all these backup tools:
Duplicacy is not only more feature-rich but also faster than other backup tools. The following table lists the running times in seconds of backing up the [Linux code base](https://github.com/torvalds/linux) using Duplicacy and 3 other tools. Clearly Duplicacy is the fastest by a significant margin.
| | Duplicacy | restic | Attic | duplicity |
| | Duplicacy | restic | Attic | duplicity |
|:------------------:|:----------------:|:----------:|:----------:|:-----------:|
| Initial backup | 13.7 | 20.7 | 26.9 | 44.2 |
| 2nd backup | 4.8 | 8.0 | 15.4 | 19.5 |
| 3rd backup | 6.9 | 11.9 | 19.6 | 29.8 |
| 4th backup | 3.3 | 7.0 | 13.7 | 18.6 |
| 5th backup | 9.9 | 11.4 | 19.9 | 28.0 |
| 6th backup | 3.8 | 8.0 | 16.8 | 22.0 |
| 7th backup | 5.1 | 7.8 | 14.3 | 21.6 |
| 8th backup | 9.5 | 13.5 | 18.3 | 35.0 |
| 9th backup | 4.3 | 9.0 | 15.7 | 24.9 |
| 10th backup | 7.9 | 20.2 | 32.2 | 35.0 |
| 11th backup | 4.6 | 9.1 | 16.8 | 28.1 |
| 12th backup | 7.4 | 12.0 | 21.7 | 37.4 |
| Initial backup | 13.7 | 20.7 | 26.9 | 44.2 |
| 2nd backup | 4.8 | 8.0 | 15.4 | 19.5 |
| 3rd backup | 6.9 | 11.9 | 19.6 | 29.8 |
| 4th backup | 3.3 | 7.0 | 13.7 | 18.6 |
| 5th backup | 9.9 | 11.4 | 19.9 | 28.0 |
| 6th backup | 3.8 | 8.0 | 16.8 | 22.0 |
| 7th backup | 5.1 | 7.8 | 14.3 | 21.6 |
| 8th backup | 9.5 | 13.5 | 18.3 | 35.0 |
| 9th backup | 4.3 | 9.0 | 15.7 | 24.9 |
| 10th backup | 7.9 | 20.2 | 32.2 | 35.0 |
| 11th backup | 4.6 | 9.1 | 16.8 | 28.1 |
| 12th backup | 7.4 | 12.0 | 21.7 | 37.4 |
For more details and other speed comparison results, please visit https://github.com/gilbertchen/benchmarking. There you can also find test scripts that you can use to run your own experiments.