Create DESIGN.md

2025-12-06 00:03:38 +00:00 · 2016-02-23 12:19:43 -05:00
parent 73e5b398a4
commit 9f816547b4
1 changed files with 89 additions and 0 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@@ -0,0 +1,89 @@
+## Lock-free deduplication
+
+## Snapshot Format
+
+A snapshot file is a file that the backup procedure uploads to the file storage after it finishes breaking files into
+chunks and uploading all new chunks. It mainly contains metadata for the backup overall, metadata for all the files,
+and chunk references for each file. Here is an example snapshot file for a repository containing 3 files (file1, file2,
+and dir1/file3):
+
+```json
+{
+  "id": "host1",
+  "revision": 1,
+  "tag": "first",
+  "start_time": 1455590487,
+  "end_time": 1455590487,
+  "files": [
+{
+"path": "file1",
+"content": "0:0:2:6108",
+"hash": "a533c0398194f93b90bd945381ea4f2adb0ad50bd99fd3585b9ec809da395b51",
+      "size": 151901,
+      "time": 1455590487,
+      "mode": 420
+}, {
+"path": "file2",
+"content": "2:6108:3:7586",
+"hash": "f6111c1562fde4df9c0bafe2cf665778c6e25b49bcab5fec63675571293ed644", "size": 172071,
+"time": 1455590487,
+"mode": 420
+}, {
+      "path": "dir1/",
+      "size": 102,
+      "time": 1455590487,
+      "mode": 2147484096
+}, {
+"path": "dir1/file3",
+"content": "3:7586:4:1734",
+"hash": "6bf9150424169006388146908d83d07de413de05d1809884c38011b2a74d9d3f", "size": 118457,
+"time": 1455590487,
+"mode": 420
+} ],
+"chunks": [ "9f25db00881a10a8e7bcaa5a12b2659c2358a579118ea45a73c2582681f12919", "6e903aace6cd05e26212fcec1939bb951611c4179c926351f3b20365ef2c212f", "4b0d017bce5491dbb0558c518734429ec19b8a0d7c616f68ddf1b477916621f7", "41841c98800d3b9faa01b1007d1afaf702000da182df89793c327f88a9aba698", "7c11ee13ea32e9bb21a694c5418658b39e8894bbfecd9344927020a9e3129718"
+  ],
+  "lengths": [
+    64638,
+    81155,
+    170593,
+    124309,
+    1734
+] }
+```
+
+When Duplicacy splits a file in chunks, if the end of a file is reached and yet the boundary marker for terminating a chunk
+hasn't been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all 
+files were packed into a big zip file which is then split into chunks.
+
+The *content* field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
+instance, *fiel1* starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by *file2*.
+
+The backup procedure can run in one of two modes. In the quick mode, only modified or new files are scanned. Chunks only
+referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new 
+files are appended. Indices for unchanged files need to be updated too.
+
+In the safe mode, all files are scanned and the chunk sequence is regenerated.
+
+The length sequence stores the lengths for all chunks, which are needed when calculating some statistics such as the total
+length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous. 
+To make the situation worse, very time a big snapshot file must be uploaded even if only a few files have changed since
+last backup. To save space, the variable-size chunking algorithm is also applied to the three dynamic fields of a snapshot
+file, *files*, *chunks*, and *lengths*.
+
+Chunks produced during this step are deduplicated and uploaded in the same way as regular file chunks. The final snapshot file
+contains sequences of chunk hashes and other fixed size fields:
+
+```json
+{
+  "id": "host1",
+  "revision": 1,
+  "start_time": 1455590487,
+  "tag": "first",
+  "end_time": 1455590487,
+"file_sequence": [ "21e4c69f3832e32349f653f31f13cefc7c52d52f5f3417ae21f2ef5a479c3437",
+  ],
+  "chunk_sequence": [
+"8a36ffb8f4959394fd39bba4f4a464545ff3dd6eed642ad4ccaa522253f2d5d6" ],
+"length_sequence": [ "fc2758ae60a441c244dae05f035136e6dd33d3f3a0c5eb4b9025a9bed1d0c328"
+] }
+```