mirror of
https://github.com/gilbertchen/duplicacy
synced 2025-12-06 00:03:38 +00:00
Compare commits
39 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8cb6635ba6 | ||
|
|
ee56652c90 | ||
|
|
cd7f18f284 | ||
|
|
18766c86dc | ||
|
|
f8d2671038 | ||
|
|
1d12fa3dd8 | ||
|
|
6c793f25ee | ||
|
|
901605ca68 | ||
|
|
1f83a6e793 | ||
|
|
117317af3f | ||
|
|
77dbabf5d3 | ||
|
|
2fde44c2ec | ||
|
|
6828843dfc | ||
|
|
895a785765 | ||
|
|
f71570728a | ||
|
|
405cad1d7b | ||
|
|
1be6a78cc0 | ||
|
|
6ebc2394e5 | ||
|
|
79c440f9fe | ||
|
|
b54558e6fe | ||
|
|
0d8d691664 | ||
|
|
7abcd5b45e | ||
|
|
1d4979cde4 | ||
|
|
79ccd78a3b | ||
|
|
51a99d9a2c | ||
|
|
2b8e9a1f11 | ||
|
|
8c76154a21 | ||
|
|
3a46779a58 | ||
|
|
f0ff4a3ec1 | ||
|
|
50eaaa94f2 | ||
|
|
ee682bad52 | ||
|
|
38a778557b | ||
|
|
a08e54d6d3 | ||
|
|
de3e2b9823 | ||
|
|
b44dfc3ba5 | ||
|
|
09ebcf79d2 | ||
|
|
a93faa84b1 | ||
|
|
b51d979bec | ||
|
|
c2cc41d47c |
@@ -16,7 +16,7 @@ time the rolling hash window is shifted by one byte, thus significantly reducing
|
||||
What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing
|
||||
chunks and for determining which chunks are not needed any more. Instead, to check if a chunk has already been uploaded
|
||||
before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk.
|
||||
This effectively turn a cloud storage offering only a very limited
|
||||
This effectively turns a cloud storage offering only a very limited
|
||||
set of basic file operations into a powerful modern backup backend capable of both block-level and file-level deduplication. More importantly, the absence of a centralized indexing database means that there is no need to implement a distributed locking mechanism on top of the file storage.
|
||||
|
||||
By eliminating the chunk indexing database, lock-free duplication not only reduces the code complexity but also makes the deduplication less error-prone. Each chunk is saved individually in its own file, and once saved there is no need for modification. Data corruption is therefore less likely to occur because of the immutability of chunk files. Another benefit that comes naturally from lock-free duplication is that when one client creates a new chunk, other clients that happen to have the same original file will notice that the chunk already exist and therefore will not upload the same chunk again. This pushes the deduplication to its highest level -- clients without knowledge of each other can share identical chunks with no extra effort.
|
||||
@@ -137,7 +137,7 @@ hasn't been found, the next file, if there is one, will be read in and the chunk
|
||||
files were packed into a big tar file which is then split into chunks.
|
||||
|
||||
The *content* field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
|
||||
instance, *fiel1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.
|
||||
instance, *file1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.
|
||||
|
||||
The backup procedure can run in one of two modes. In the default quick mode, only modified or new files are scanned. Chunks only
|
||||
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
|
||||
@@ -210,6 +210,6 @@ Chunk content is encrypted by AES-GCM, with an encryption key that is the HMAC-S
|
||||
|
||||
The snapshot is encrypted by AES-GCM too, using an encrypt key that is the HMAC-SHA256 of the file path with the *File Key* as the secret key.
|
||||
|
||||
There four random keys are saved in a file named 'config' in the file storage, encrypted with a master key derived from the PBKDF2 function on
|
||||
These four random keys are saved in a file named 'config' in the storage, encrypted with a master key derived from the PBKDF2 function on
|
||||
the storage password chosen by the user.
|
||||
|
||||
|
||||
47
GUIDE.md
47
GUIDE.md
@@ -51,6 +51,8 @@ OPTIONS:
|
||||
-hash detect file differences by hash (rather than size and timestamp)
|
||||
-t <tag> assign a tag to the backup
|
||||
-stats show statistics during and after backup
|
||||
-threads <n> number of uploading threads (Backblaze only)
|
||||
-limit-rate <kB/s> the maximum upload rate (in kilobytes/sec)
|
||||
-vss enable the Volume Shadow Copy service (Windows only)
|
||||
-storage <storage name> backup to the specified storage instead of the default one
|
||||
```
|
||||
@@ -61,9 +63,13 @@ Otherwise, every file is scanned to detect changes.
|
||||
|
||||
You can assign a tag to the snapshot so that later you can refer to it by tag in other commands.
|
||||
|
||||
If the -stats option is specified, statistical information such as transfer speed, number of chunks will be displayed
|
||||
If the -stats option is specified, statistical information such as transfer speed, the number of chunks will be displayed
|
||||
throughout the backup procedure.
|
||||
|
||||
The -threads option can be used to specify more than one thread to upload chunks. Currently this option is available only when the Backblaze B2 storage is selected.
|
||||
|
||||
The -limit-rate option sets a cape on the maximum upload rate.
|
||||
|
||||
The -vss option works on Windows only to turn on the Volume Shadow Copy service such that files opened by other
|
||||
processes with exclusive locks can be read as usual.
|
||||
|
||||
@@ -86,6 +92,7 @@ OPTIONS:
|
||||
-overwrite overwrite existing files in the repository
|
||||
-delete delete files not in the snapshot
|
||||
-stats show statistics during and after restore
|
||||
-limit-rate <kB/s> the maximum download rate (in kilobytes/sec)
|
||||
-storage <storage name> restore from the specified storage instead of the default one
|
||||
```
|
||||
|
||||
@@ -99,6 +106,8 @@ The -delete option indicates that files not in the snapshot will be removed.
|
||||
If the -stats option is specified, statistical information such as transfer speed, number of chunks will be displayed
|
||||
throughout the restore procedure.
|
||||
|
||||
The -limit-rate option sets a cape on the maximum upload rate.
|
||||
|
||||
When the repository can have multiple storages (added by the *add* command), you can select the storage to restore from by specifying the storage name.
|
||||
|
||||
Unlike the *backup* procedure that reading the include/exclude patterns from a file, the *restore* procedure reads them
|
||||
@@ -160,6 +169,7 @@ OPTIONS:
|
||||
-fossils search fossils if a chunk can't be found
|
||||
-resurrect turn referenced fossils back into chunks
|
||||
-files verify the integrity of every file
|
||||
-stats show deduplication statistics (imply -all and all revisions)
|
||||
-storage <storage name> retrieve snapshots from the specified storage```
|
||||
```
|
||||
The *check* command checks, for each specified snapshot, that all referenced chunks exist in the storage.
|
||||
@@ -443,34 +453,47 @@ destination storage and is required.
|
||||
|
||||
## Include/Exclude Patterns
|
||||
|
||||
An include pattern starts with -, and an exclude pattern starts with +. Patterns may contain wildcard characters such as * and ? with their normal meaning.
|
||||
An include pattern starts with +, and an exclude pattern starts with -. Patterns may contain wildcard characters such as * and ? with their normal meaning.
|
||||
|
||||
When matching a path against a list of patterns, the path is compared with the part after + or -, one pattern at a time. Therefore, the order of the patterns is significant. If a match with an include pattern is found, the path is said to be included without further comparisons. If a match with an exclude pattern is found, the path is said to be excluded without further comparison. If a match is not found, the path will be excluded if all patterns are include patterns, but included otherwise.
|
||||
|
||||
Note that the path in Duplicacy for a directory always ends with a /, even on Windows. The path of a file does not end with a /. This can be used to exclude directories only.
|
||||
Patterns ending with a / apply to directories only, and patterns not ending with a / apply to files only. When a directory is excluded, all files and subdirectires under it will also be excluded. Note that the path separator is always /, even on Windows.
|
||||
|
||||
The following pattern list includes only files under the directory foo/ but not files under the subdirectory foo/bar:
|
||||
|
||||
```
|
||||
-foo/bar/
|
||||
+foo/*
|
||||
-*
|
||||
```
|
||||
|
||||
For the *backup* command, the include/exclude patterns are read from a file named *filters* under the *.duplicacy* directory.
|
||||
|
||||
For the *restore* command, the include/exclude patterns are specified as the command line arguments.
|
||||
|
||||
|
||||
## Managing Passwords
|
||||
|
||||
Duplicacy will attempt to retrieve in three ways the storage password and the storage-specific access tokens/keys.
|
||||
|
||||
* If a secret vault service is available, Duplicacy will store the password input by the user in such a secret vault and later retrieve it when needed. On Mac OS X it is Keychain, and on Linux it is gnome-keyring. On Windows the password is encrypted and decrypted by the Data Protection API, and encrypted password is stored in the file *.duplicacy/keyring*. However, if the -no-save-password option is specified for the storage, then Duplicacy will not save passwords this way
|
||||
* If a secret vault service is available, Duplicacy will store passwords/keys entered by the user in such a secret vault and later retrieve them when needed. On Mac OS X it is Keychain, and on Linux it is gnome-keyring. On Windows the passwords/keys are encrypted and decrypted by the Data Protection API, and encrypted passwords/keys are stored in the file *.duplicacy/keyring*. However, if the -no-save-password option is specified for the storage, then Duplicacy will not save passwords this way.
|
||||
* If an environment variable for a password is provided, Duplicacy will always take it. The table below shows the name of the environment variable for each kind of password. Note that if the storage is not the default one, the storage name will be included in the name of the environment variable.
|
||||
* If a matching key and its value are saved to the preference file (.duplicacy/preferences) by the *set* command, the value will be used as the password. The last column in the table below lists the name of the preference key for each type of password.
|
||||
|
||||
| password type | environment variable (default storage) | environment variable (non-default storage) | key in preferences |
|
||||
|:----------------:|:----------------:|:----------------:|:----------------:|
|
||||
| storage password | DUPLICACY_PASSWORD | DUPLICACY_<STORAGENAME>_PASSWORD | password |
|
||||
| sftp password | DUPLICACY_SSH_PASSWORD | DUPLICACY_<STORAGENAME>_SSH_PASSWORD | ssh_password |
|
||||
| Dropbox Token | DUPLICACY_DROPBOX_TOKEN | DUPLICACY_<STORAGENAME>_DROPBOX_TOEKN | dropbox_token |
|
||||
| S3 Access ID | DUPLICACY_S3_ID | DUPLICACY_<STORAGENAME>_S3_ID | s3_id |
|
||||
| S3 Secret Key | DUPLICACY_S3_KEY | DUPLICACY_<STORAGENAME>_S3_KEY | s3_key |
|
||||
| BackBlaze Account ID | DUPLICACY_B2_ID | DUPLICACY_<STORAGENAME>_B2_ID | b2_id |
|
||||
| Backblaze Application Key | DUPLICACY_B2_KEY | DUPLICACY_<STORAGENAME>_B2_KEY | b2_key |
|
||||
| Azure Access Key | DUPLICACY_AZURE_KEY | DUPLICACY_<STORAGENAME>_AZURE_KEY | azure_key |
|
||||
| storage password | DUPLICACY_PASSWORD | DUPLICACY_<STORAGENAME>_PASSWORD | password |
|
||||
| sftp password | DUPLICACY_SSH_PASSWORD | DUPLICACY_<STORAGENAME>_SSH_PASSWORD | ssh_password |
|
||||
| sftp key file | DUPLICACY_SSH_KEY_FILE | DUPLICACY_<STORAGENAME>_SSH_KEY_FILE | ssh_keyfile |
|
||||
| Dropbox Token | DUPLICACY_DROPBOX_TOKEN | DUPLICACY_<STORAGENAME>>_DROPBOX_TOKEN | dropbox_token |
|
||||
| S3 Access ID | DUPLICACY_S3_ID | DUPLICACY_<STORAGENAME>_S3_ID | s3_id |
|
||||
| S3 Secret Key | DUPLICACY_S3_SECRET | DUPLICACY_<STORAGENAME>_S3_SECRET | s3_secret |
|
||||
| BackBlaze Account ID | DUPLICACY_B2_ID | DUPLICACY_<STORAGENAME>_B2_ID | b2_id |
|
||||
| Backblaze Application Key | DUPLICACY_B2_KEY | DUPLICACY_<STORAGENAME>_B2_KEY | b2_key |
|
||||
| Azure Access Key | DUPLICACY_AZURE_KEY | DUPLICACY_<STORAGENAME>_AZURE_KEY | azure_key |
|
||||
| Google Drive Token File | DUPLICACY_GCD_TOKEN | DUPLICACY_<STORAGENAME>_GCD_TOKEN | gcd_token |
|
||||
| Microsoft OneDrive Token File | DUPLICACY_ONE_TOKEN | DUPLICACY_<STORAGENAME>_ONE_TOKEN | one_token |
|
||||
| Hubic Token File | DUPLICACY_HUBIC_TOKEN | DUPLICACY_<STORAGENAME>_HUBIC_TOKEN | hubic_token |
|
||||
|
||||
Note that the passwords stored in the environment variable and the preference need to be in plaintext and thus are insecure and should be avoided whenever possible.
|
||||
|
||||
|
||||
81
README.md
81
README.md
@@ -1,19 +1,22 @@
|
||||
# Duplicacy: A new generation cloud backup tool
|
||||
# Duplicacy: A lock-free deduplication cloud backup tool
|
||||
|
||||
This repository contains only binary releases and documentation for Duplicacy. It also serves as an issue tracker for user-developer communication during the beta testing phase.
|
||||
Duplicacy is a new generation cross-platform cloud backup tool based on the idea of [Lock-Free Deduplication](https://github.com/gilbertchen/duplicacy-beta/blob/master/DESIGN.md). It is the only cloud backup tool that allows multiple computers to back up to the same storage simultaneously without using any locks (thus readily amenable to various cloud storage services).
|
||||
|
||||
## Overview
|
||||
The repository hosts design documents as well as binary releases of the command line version. There is also a Duplicacy GUI frontend built for Windows and Mac OS X downloadable from https://duplicacy.com. The source code of the command line version is available to the commercial users of the Duplicacy GUI version upon request.
|
||||
|
||||
Duplicacy supports major cloud storage providers (Amazon S3, Google Cloud Storage, Microsoft Azure, Dropbox, and BackBlaze) and offers all essential features of a modern backup tool:
|
||||
## Features
|
||||
|
||||
Duplicacy currently supports major cloud storage providers (Amazon S3, Google Cloud Storage, Microsoft Azure, Dropbox, Backblaze, Google Drive, Microsoft OneDrive, and Hubic) and offers all essential features of a modern backup tool:
|
||||
|
||||
* Incremental backup: only back up what has been changed
|
||||
* Full snapshot : although each backup is incremental, it behaves like a full snapshot
|
||||
* Full snapshot : although each backup is incremental, it must behave like a full snapshot for easy restore and deletion
|
||||
* Deduplication: identical files must be stored as one copy (file-level deduplication), and identical parts from different files must be stored as one copy (block-level deduplication)
|
||||
* Encryption: encrypt not only file contents but also file paths, sizes, times, etc.
|
||||
* Deletion: every backup can be deleted independently without affecting others
|
||||
* Concurrent access: multiple clients can back up to the same storage at the same time
|
||||
* Snapshot migration: all or selected snapshots can be migrated from one storage to another
|
||||
|
||||
The key idea behind Duplicacy is a concept called **Lock-Free Deduplication**, which can be summarized as follows:
|
||||
The key idea of **Lock-Free Deduplication** can be summarized as follows:
|
||||
|
||||
* Use variable-size chunking algorithm to split files into chunks
|
||||
* Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database
|
||||
@@ -90,7 +93,7 @@ $ duplicacy copy -r 1 -to s3 # Copy snapshot at revision 1 to the s3 storage
|
||||
$ duplicacy copy -to s3 # Copy every snapshot to the s3 storage
|
||||
```
|
||||
|
||||
The [User Guide](https://github.com/gilbertchen/duplicacy-beta/blob/master/DESIGN.md) contains a complete reference to
|
||||
The [User Guide](https://github.com/gilbertchen/duplicacy-beta/blob/master/GUIDE.md) contains a complete reference to
|
||||
all commands and other features of Duplicacy.
|
||||
|
||||
## Storages
|
||||
@@ -110,7 +113,7 @@ Storage URL: /path/to/storage (on Linux or Mac OS X)
|
||||
Storage URL: sftp://username@server/path/to/storage
|
||||
```
|
||||
|
||||
Login methods include password authentication and public key authentication. Due to a limitation of the underlying Go SSH library, the key pair for public key authentication be generated without a passphrase. To work with a key that has a passphrase, you can set up SSH agent forwarding which is also supported by Duplicacy.
|
||||
Login methods include password authentication and public key authentication. Due to a limitation of the underlying Go SSH library, the key pair for public key authentication must be generated without a passphrase. To work with a key that has a passphrase, you can set up SSH agent forwarding which is also supported by Duplicacy.
|
||||
|
||||
#### Dropbox
|
||||
|
||||
@@ -124,7 +127,7 @@ For Duplicacy to access your Dropbox storage, you must provide an access token t
|
||||
|
||||
* Or authorize Duplicacy to access its app folder inside your Dropbox (following [this link](https://dl.dropboxusercontent.com/u/95866350/start_dropbox_token.html)), and Dropbox will generate the access token (which is not visible to us, as the redirect page showing the token is merely a static html hosted by Dropbox)
|
||||
|
||||
Dropbox has two advantages over other cloud providers. First, if you are already a paid user then to use the unused space as the backup storage is basically free. Second, unlike other providers Dropbox does not charge API usage fees.
|
||||
Dropbox has two advantages over other cloud providers. First, if you are already a paid user then to use the unused space as the backup storage is basically free. Second, unlike other providers Dropbox does not charge bandwidth or API usage fees.
|
||||
|
||||
#### Amazon S3
|
||||
|
||||
@@ -152,7 +155,7 @@ Storage URL: azure://account/container
|
||||
|
||||
You'll need to input the access key once prompted.
|
||||
|
||||
#### BackBlaze
|
||||
#### Backblaze
|
||||
|
||||
```
|
||||
Storage URL: b2://bucket
|
||||
@@ -160,7 +163,37 @@ Storage URL: b2://bucket
|
||||
|
||||
You'll need to input the account id and application key.
|
||||
|
||||
BackBlaze offers perhaps the least expensive cloud storage at 0.5 cent per GB per month. Unfortunately their API does not support file renaming, so the -exclusive option is required when pruning old backups. This means concurrent access and deletion can't be permitted at the same time.
|
||||
Backblaze's B2 storage is not only the least expensive (at 0.5 cent per GB per month), but also the fastest. We have been working closely with their developers to leverage the full potentials provided by the B2 API in order to maximumize the transfer speed. As a result, the B2 storage is the only one to support the multi-threading option which can easily max out your upload link.
|
||||
|
||||
#### Google Drive
|
||||
|
||||
```
|
||||
Storage URL: gcd://path/to/storage
|
||||
```
|
||||
|
||||
To use Google Drive as the storage, you first need to download a token file from https://duplicacy.com/gcd_start by
|
||||
authorizing Duplicacy to access your Google Drive, and then enter the path to this token file to Duplicacy when prompted.
|
||||
|
||||
#### Microsoft OneDrive
|
||||
|
||||
```
|
||||
Storage URL: one://path/to/storage
|
||||
```
|
||||
|
||||
To use Microsoft OneDrive as the storage, you first need to download a token file from https://duplicacy.com/one_start by
|
||||
authorizing Duplicacy to access your OneDrive, and then enter the path to this token file to Duplicacy when prompted.
|
||||
|
||||
#### Hubic
|
||||
|
||||
```
|
||||
Storage URL: hubic://path/to/storage
|
||||
```
|
||||
|
||||
To use Hubic as the storage, you first need to download a token file from https://duplicacy.com/hubic_start by
|
||||
authorizing Duplicacy to access your Hubic drive, and then enter the path to this token file to Duplicacy when prompted.
|
||||
|
||||
Hubic offers the most free space (25GB) of all major cloud providers and there is no bandwidth charge (same as Google Drive and OneDrive), so it may be worth a try.
|
||||
|
||||
|
||||
## Comparison with Other Backup Tools
|
||||
|
||||
@@ -185,12 +218,22 @@ A command to delete old backups is in the developer's [plan](https://github.com/
|
||||
|
||||
The following table compares the feature lists of all these backup tools:
|
||||
|
||||
| Tool | Incremental Backup | Full Snapshot | Deduplication | Encryption | Deletion | Concurrent Access |Cloud Support |
|
||||
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
|
||||
| duplicity | Yes | No | Weak | Yes | No | No | Extensive |
|
||||
| bup | Yes | Yes | Yes | Yes | No | No | No |
|
||||
| Obnam | Yes | Yes | Weak | Yes | Yes | Exclusive locking | No |
|
||||
| Attic | Yes | Yes | Yes | Yes | Yes | Not recommended | No |
|
||||
| restic | Yes | Yes | Yes | Yes | No | Exclusive locking | S3 only |
|
||||
| **Duplicacy** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** | **Lock-free** | **S3, GCS, Azure, Dropbox, BackBlaze** |
|
||||
|
||||
| Feature/Tool | duplicity | bup | Obnam | Attic | restic | **Duplicacy** |
|
||||
|:------------------:|:---------:|:---:|:-----------------:|:---------------:|:-----------------:|:-------------:|
|
||||
| Incremental Backup | Yes | Yes | Yes | Yes | Yes | **Yes** |
|
||||
| Full Snapshot | No | Yes | Yes | Yes | Yes | **Yes** |
|
||||
| Deduplication | Weak | Yes | Weak | Yes | Yes | **Yes** |
|
||||
| Encryption | Yes | Yes | Yes | Yes | Yes | **Yes** |
|
||||
| Deletion | No | No | Yes | Yes | No | **Yes** |
|
||||
| Concurrent Access | No | No | Exclusive locking | Not recommended | Exclusive locking | **Lock-free** |
|
||||
| Cloud Support | Extensive | No | No | No | S3 only | **S3, GCS, Azure, Dropbox, Backblaze, Google Drive, OneDrive, and Hubic**|
|
||||
| Snapshot Migration | No | No | No | No | No | **Yes** |
|
||||
|
||||
|
||||
|
||||
## License
|
||||
|
||||
Duplicacy CLI is free for personal use without restrictions.
|
||||
|
||||
For commercial use, a valid [commercial license](https://duplicacy.com/buy.html) is required for each computer on which backups will be created. There are no restrictions if Duplicacy CLI is used to restore files from backups or check the integrity of backups.
|
||||
|
||||
Reference in New Issue
Block a user