Created Big syncs with millions of files (markdown)

2025-08-13 23:38:51 +02:00 · 2023-07-21 17:09:40 +02:00
parent 1ccb3e37c7
commit 6ccb592edc
1 changed files with 51 additions and 0 deletions
--- a/Big-syncs-with-millions-of-files.md
+++ b/Big-syncs-with-millions-of-files.md
@ -0,0 +1,51 @@
+# The problem
+
+Rclone syncs on a directory by directory basis. If you have 10,000,000 directories with 1,000 files in and it will sync fine, but if you have a directory with 100,000,000 files in you will a lot of RAM to process it.
+
+The log is then filled by :
+```
+2023/07/06 15:30:35 INFO  :
+Transferred:              0 B / 0 B, -, 0 B/s, ETA -
+Elapsed time:       1m0.0s
+```
+
+... although HTTP REQUEST requests are made, with HTTP RESPONSE 200 in response (--dump-headers option), no copy is made.
+
+This problem exists until at least version rclone v1.64.0-beta.7132.f1a842081.
+
+# Workaround
+
+We can get around the problem as follows.
+
+- First read file or object names
+
+```
+rclone lsf --files-only -R src:bucket | sort > src
+rclone lsf --files-only -R dst:bucket | sort > dst
+```
+
+- Now use comm to find what files/objects need to be transferred
+
+```
+comm -23 src dst > need-to-transfer
+comm -13 src dst > need-to-delete
+```
+
+You now have a list of files you need to transfer from src to dst and another list of files in dst that aren't in src so should likely be deleted.
+
+Then break the need-to-transfer file up into chunks of (say) 10,000 lines with something like split -l 10000 need-to-transfer and run this on each chunk to transfer 10,000 files at a time. The --files-from and the --no-traverse means that this won't list the source or the destination so will avoid using too much memory.
+
+```
+rclone copy src:bucket dst:bucket --files-from need-to-transfer-aa --no-traverse
+```
+
+It's the same for deletion.
+
+If you need to sync changes, you can include hash and/or size in the listing :
+
+```
+rclone lsf --files-only --format "ph" -R src:bucket | sort -t';' -k1 > src
+rclone lsf --files-only --format "ph" -R dst:bucket | sort -t';' -k1 > dst
+```
+
+The `comm` tool will then filter the two fields as one.