Resolve "Duplicate user file" using a table lock
Closes #504 (closed)
What went wrong
There was be an edge case, and the happened a couple of times in production, where a user somehow uploads the same file simultaneous. If the backend has multiple workers, it could happen that those upload requests are processed by two different workers. That could result in that the same file is uploaded twice, and would result in a duplicate entry. Because it is expected that the user file with the same hash and original name only has one entry in the core_userfile
table, it could go wrong in some places. This was for me, the only way to replicate the problem.
The fix
We acquire a SHARE UPDATE EXCLUSIVE
lock just before the user file is created. This will prevent the other worker from inserting a new entry. The lock ends when the transaction is committed. During the lock, it is possible to fetch all the other entries in the table. So it does not block user file related operations.
How tot test
Checkout the develop
branch, replace the upload_user_file
method with the code below. This introduces two sleep moments before the check if the user file already exists and during the rendering of the thumbnails. Now start the development backend server with multiple workers gunicorn --workers=5 -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker baserow.config.asgi:application
. When everything runs you can open two browser windows, and try to upload the same not existing file simultaneously. When both uploads finished, you will see that two entries of the same file have been created in the core_userfile
table. If you checkout the branch related to this merge request, add two sleeps in the same position and you will see that you upload the same not existing file simultaneously again, that it will create just one entry.
def upload_user_file(self, user, file_name, stream, storage=None):
"""
Saves the provided uploaded file in the provided storage. If no storage is
provided the default_storage will be used. An entry into the user file table
is also created.
:param user: The user on whose behalf the file is uploaded.
:type user: User
:param file_name: The provided file name when the file was uploaded.
:type file_name: str
:param stream: An IO stream containing the uploaded file.
:type stream: IOBase
:param storage: The storage where the file must be saved to.
:type storage: Storage
:raises InvalidFileStreamError: If the provided stream is invalid.
:raises FileSizeToLargeError: If the provided content is too large.
:return: The newly created user file.
:rtype: UserFile
"""
if not hasattr(stream, "read"):
raise InvalidFileStreamError("The provided stream is not readable.")
size = stream_size(stream)
if size > settings.USER_FILE_SIZE_LIMIT:
raise FileSizeTooLargeError(
settings.USER_FILE_SIZE_LIMIT, "The provided file is too large."
)
storage = storage or default_storage
hash = sha256_hash(stream)
file_name = truncate_middle(file_name, 64)
from time import sleep
sleep(5)
existing_user_file = UserFile.objects.filter(
original_name=file_name, sha256_hash=hash
).first()
if existing_user_file:
return existing_user_file
extension = pathlib.Path(file_name).suffix[1:].lower()
mime_type = mimetypes.guess_type(file_name)[0] or ""
unique = self.generate_unique(hash, extension)
# By default the provided file is not an image.
image = None
is_image = False
image_width = None
image_height = None
# Try to open the image with Pillow. If that succeeds we know the file is an
# image.
try:
image = Image.open(stream)
is_image = True
image_width = image.width
image_height = image.height
except IOError:
pass
user_file = UserFile.objects.create(
original_name=file_name,
original_extension=extension,
size=size,
mime_type=mime_type,
unique=unique,
uploaded_by=user,
sha256_hash=hash,
is_image=is_image,
image_width=image_width,
image_height=image_height,
)
from time import sleep
sleep(5)
# If the uploaded file is an image we need to generate the configurable
# thumbnails for it. We want to generate them before the file is saved to the
# storage because some storages close the stream after saving.
if image:
self.generate_and_save_image_thumbnails(image, user_file, storage=storage)
# When all the thumbnails have been generated, the image can be deleted
# from memory.
del image
# Save the file to the storage.
full_path = self.user_file_path(user_file)
storage.save(full_path, stream)
# Close the stream because we don't need it anymore.
stream.close()
return user_file
Feedback
All feedback or ideas are welcome here. If you think this could have impact on other operations, then I would love to hear about it.