diff --git a/Big-syncs-with-millions-of-files.md b/Big-syncs-with-millions-of-files.md new file mode 100644 index 0000000..374cc4e --- /dev/null +++ b/Big-syncs-with-millions-of-files.md @@ -0,0 +1,51 @@ +# The problem + +Rclone syncs on a directory by directory basis. If you have 10,000,000 directories with 1,000 files in and it will sync fine, but if you have a directory with 100,000,000 files in you will a lot of RAM to process it. + +The log is then filled by : +``` +2023/07/06 15:30:35 INFO : +Transferred: 0 B / 0 B, -, 0 B/s, ETA - +Elapsed time: 1m0.0s +``` + +... although HTTP REQUEST requests are made, with HTTP RESPONSE 200 in response (--dump-headers option), no copy is made. + +This problem exists until at least version rclone v1.64.0-beta.7132.f1a842081. + +# Workaround + +We can get around the problem as follows. + +- First read file or object names + +``` +rclone lsf --files-only -R src:bucket | sort > src +rclone lsf --files-only -R dst:bucket | sort > dst +``` + +- Now use comm to find what files/objects need to be transferred + +``` +comm -23 src dst > need-to-transfer +comm -13 src dst > need-to-delete +``` + +You now have a list of files you need to transfer from src to dst and another list of files in dst that aren't in src so should likely be deleted. + +Then break the need-to-transfer file up into chunks of (say) 10,000 lines with something like split -l 10000 need-to-transfer and run this on each chunk to transfer 10,000 files at a time. The --files-from and the --no-traverse means that this won't list the source or the destination so will avoid using too much memory. + +``` +rclone copy src:bucket dst:bucket --files-from need-to-transfer-aa --no-traverse +``` + +It's the same for deletion. + +If you need to sync changes, you can include hash and/or size in the listing : + +``` +rclone lsf --files-only --format "ph" -R src:bucket | sort -t';' -k1 > src +rclone lsf --files-only --format "ph" -R dst:bucket | sort -t';' -k1 > dst +``` + +The `comm` tool will then filter the two fields as one. \ No newline at end of file