The missed update can cause incorrect before-cleaning cache stats
and a pre-mature condition broadcast in purgeOld before the cache
space use is reduced below the quota.
Add an exponentially increasing delay during retries up ENOSPC error
to avoid exhausting the 10 retries too soon when the cache space
recovery from item resets is not available from the file system yet
or consumed by other large cache writes.
Item reset is invoked by cache cleaner for synchronous recovery
from ENOSPC errors. The reset operation removes the cache file and
closes/reopens the downloaders. Although most parts of reset and
other item operations are done with the item mutex held, the mutex
is released during fd.WriteAt and downloaders calls. We used preAccess
and postAccess calls to serialize Reset, ReadAt, and Open, but missed
some other item operations. The patch adds preAccess/postAccess
calls in Sync, Truncate, Close, WriteAt, and rename.
A failed item reset is saved in the errItems for retryFailedResets
to process. If the item gets closed before the retry, the item may
have been removed from the c.item array. Previous code did not
account for this condition. This patch adds the check for the
exitence of the retry items in retryFailedResets.
The downloaders.Close() call acquires the downloaders' mutex before
calling the wait group wait and the main downloaders thread has a
periodical (5 seconds interval) call to kick its waiters and the
waiter dispatch function tries to get the mutex. So a deadlock can
occur if the Close() call starts, gets the mutex, while the main
downloader thread already got the timer's tick and proceeded to
call kickWaiters. The deadlock happens when the Close call gets
the mutex between the timer's kick and the main downloader thread
gets the mutex first. So it's a pretty short period of time and
it probably explains why the problem has not surfaced, maybe
something like tens of nanoseconds out of 5 seconds (~10^^-8).
It took 5 days of continued stressing the Close calls for the
deadlock to appear.
Before this change if a file was removed from the cache while rclone
is running then rclone would not notice and proceed to re-create it
full of zeros.
This change notices files that we expect to have data in going missing
and if they do logs an ERROR recovers.
It isn't recommended deleting files from the cache manually with
rclone running!
See: https://forum.rclone.org/t/corrupted-data-streaming-after-vfs-meta-files-removed/18997Fixes#4602
Before this change the error message was produced for every file which
was confusing users.
After this change we check for EOF and return from ReadAt at that
point.
See: https://forum.rclone.org/t/rclone-1-53-release/18880/10
This patch provides the support of synchronous cache space recovery
to allow read threads to recover from ENOSPC errors when cache space
can be recovered from cache items that are not in use or safe to be
reset/emptied .
The patch complements the existing cache cleaning process in two ways.
Firstly, the existing cache cleaning process is time-driven that runs
periodically. The cache space can run out while the cache cleaner
thread is still waiting for its next scheduled run. The io threads
encountering ENOSPC return an internal error to the applications
in this case even when cache space can be recovered to avoid this
error. This patch addresses this problem by having the read threads
kick the cache cleaner thread in this condition to recover cache
space preventing unnecessary ENOSPC errors from being seen by the
applications.
Secondly, this patch enhances the cache cleaner to support cache
item reset. Currently the cache purge process removes cache
items that are not in use. This may not be sufficient when the
total size of the working set exceeds the cache directory's
capacity. Like in the current code, this patch starts the purge
process by removing cache files that are not in use. Cache items
whose access times are older than vfs-cache-max-age are removed first.
After that, other not-in-use items are removed in LRU order until
vfs-cache-max-size is reached. If the vfs-cache-max-size (the quota)
is still not reached at this time, this patch adds a cache reset
step to reset/empty cache files that are still in use but not
dirtied. This enables application processes to continue without
seeing an error even when the working set depletes the cache space
as long as there is not a large write working set hoarding the
entire cache space.
By design this patch does not add ENOSPC error recovery for write
IOs. Rclone does not empty a write cache item until the file data
is written back to the backend upon close. Allowing more cache
space to be consumed by dirty cache items when the cache space is
already running low would increase the risk of exhausting the cache
space in a way that the vfs mount becomes unreadable.
Before this change we set the modtime of the cache file when all
writers had finished.
This has the unfortunate effect that the file is uploaded with the
wrong modtime which means on backends which can't set modtimes except
when uploading files it is wrong.
This change sets the modtime of the cache file immediately in the
cache and in turn sets the modtime in the file info.
Before this fix, download threads would fill up the buffer and then
timeout even though data was still being read from them. If the client
was streaming slower than network speed this caused the downloader to
stop and be restarted continuously. This caused more potential for
skips in the download and unecessary network transactions.
This patch fixes that behaviour - as long as a downloader is being
read from more often than once every 5 seconds, it won't timeout.
This was done by:
- kicking the downloader whenever ensureDownloader is called
- making the downloader loop if it has already downloaded past the maxOffset
- making setRange() always kick the downloader
This is preparation for getting the Accounting to check the context,
buf first we need to get it in place. Since this is one of those
changes that makes lots of noise, this is in a seperate commit.
Previous to the fix, if an item was being uploaded and it was renamed,
the upload would fail with missing checksum errors.
This change cancels any uploads in progress if the file is renamed.
This was caused by the signal to stop buffering being ignored when
there was no buffer!
This is fixed by explicitly checking for no buffering and stopping.
Before this change, if we restarted an upload after a restart then the
file would get uploaded but never added to the directory listings.
This change makes sure we add virtual items to the directory cache
when reloading the cache so that they show up properly.
- fix deadlock when cancelling upload
- fix double upload and panic after cancelled upload
- fix cancelation strategy of uploading files
- don't cancel uploads if we don't modify the file
- cancel uploads if we do modify the file
- fix deadlock between Item and writeback
- fix confusion about whether writeback item was being uploaded
- fix cornercases in cancelling uploads and removing files
Item
- Remove unused method getName
- Fix Truncate on unopened file
- Fix bug when downloading segments to fill out file on close
- Fix bug when WriteAt extends the file and we don't mark space as used
downloader
- Retry failed waiters every 5 seconds
- Download to multiple places at once in the stream
- Restart as necessary
- Timeout unused downloaders
- Close reader if too much data skipped
- Only use one file handle as use item for writing
- Implement --vfs-read-chunk-size and --vfs-read-chunk-limit
- fix deadlock between asyncbuffer and vfs cache downloader
- fix display of stream abandoned error which should be ignored
On file Remove
- cancel any writebacks in progress
- ignore error message deleting non existent file if file was in the
process of being uploaded
Writeback
- Don't transfer the file if it has disappeared in the meantime
- Take our own copy of the file name to avoid deadlocks
- Fix delayed retry logic
- Wait for upload to finish when cancelling upload
Fix race condition in item saving
Fix race condition in vfscache test
Make sure we delete the file on the error path - this makes cascading
failures much less likely
This allows reads to only read part of the file and it keeps on disk a
cache of what parts of each file have been loaded.
File data itself is kept in sparse files.