Created Big syncs with millions of files (markdown)

coffreo-lcabello 2023-07-21 17:09:40 +02:00
parent 1ccb3e37c7
commit 6ccb592edc

@ -0,0 +1,51 @@
# The problem
Rclone syncs on a directory by directory basis. If you have 10,000,000 directories with 1,000 files in and it will sync fine, but if you have a directory with 100,000,000 files in you will a lot of RAM to process it.
The log is then filled by :
```
2023/07/06 15:30:35 INFO :
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Elapsed time: 1m0.0s
```
... although HTTP REQUEST requests are made, with HTTP RESPONSE 200 in response (--dump-headers option), no copy is made.
This problem exists until at least version rclone v1.64.0-beta.7132.f1a842081.
# Workaround
We can get around the problem as follows.
- First read file or object names
```
rclone lsf --files-only -R src:bucket | sort > src
rclone lsf --files-only -R dst:bucket | sort > dst
```
- Now use comm to find what files/objects need to be transferred
```
comm -23 src dst > need-to-transfer
comm -13 src dst > need-to-delete
```
You now have a list of files you need to transfer from src to dst and another list of files in dst that aren't in src so should likely be deleted.
Then break the need-to-transfer file up into chunks of (say) 10,000 lines with something like split -l 10000 need-to-transfer and run this on each chunk to transfer 10,000 files at a time. The --files-from and the --no-traverse means that this won't list the source or the destination so will avoid using too much memory.
```
rclone copy src:bucket dst:bucket --files-from need-to-transfer-aa --no-traverse
```
It's the same for deletion.
If you need to sync changes, you can include hash and/or size in the listing :
```
rclone lsf --files-only --format "ph" -R src:bucket | sort -t';' -k1 > src
rclone lsf --files-only --format "ph" -R dst:bucket | sort -t';' -k1 > dst
```
The `comm` tool will then filter the two fields as one.