Skip to content

Resolve "Duplicate user file" using a table lock

Bram Wiepjes requested to merge 504-duplicate-user-file into develop

Closes #504 (closed)

What went wrong

There was be an edge case, and the happened a couple of times in production, where a user somehow uploads the same file simultaneous. If the backend has multiple workers, it could happen that those upload requests are processed by two different workers. That could result in that the same file is uploaded twice, and would result in a duplicate entry. Because it is expected that the user file with the same hash and original name only has one entry in the core_userfile table, it could go wrong in some places. This was for me, the only way to replicate the problem.

The fix

We acquire a SHARE UPDATE EXCLUSIVE lock just before the user file is created. This will prevent the other worker from inserting a new entry. The lock ends when the transaction is committed. During the lock, it is possible to fetch all the other entries in the table. So it does not block user file related operations.

How tot test

Checkout the develop branch, replace the upload_user_file method with the code below. This introduces two sleep moments before the check if the user file already exists and during the rendering of the thumbnails. Now start the development backend server with multiple workers gunicorn --workers=5 -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker baserow.config.asgi:application. When everything runs you can open two browser windows, and try to upload the same not existing file simultaneously. When both uploads finished, you will see that two entries of the same file have been created in the core_userfile table. If you checkout the branch related to this merge request, add two sleeps in the same position and you will see that you upload the same not existing file simultaneously again, that it will create just one entry.

    def upload_user_file(self, user, file_name, stream, storage=None):
        """
        Saves the provided uploaded file in the provided storage. If no storage is
        provided the default_storage will be used. An entry into the user file table
        is also created.

        :param user: The user on whose behalf the file is uploaded.
        :type user: User
        :param file_name: The provided file name when the file was uploaded.
        :type file_name: str
        :param stream: An IO stream containing the uploaded file.
        :type stream: IOBase
        :param storage: The storage where the file must be saved to.
        :type storage: Storage
        :raises InvalidFileStreamError: If the provided stream is invalid.
        :raises FileSizeToLargeError: If the provided content is too large.
        :return: The newly created user file.
        :rtype: UserFile
        """

        if not hasattr(stream, "read"):
            raise InvalidFileStreamError("The provided stream is not readable.")

        size = stream_size(stream)

        if size > settings.USER_FILE_SIZE_LIMIT:
            raise FileSizeTooLargeError(
                settings.USER_FILE_SIZE_LIMIT, "The provided file is too large."
            )

        storage = storage or default_storage
        hash = sha256_hash(stream)
        file_name = truncate_middle(file_name, 64)

        from time import sleep
        sleep(5)

        existing_user_file = UserFile.objects.filter(
            original_name=file_name, sha256_hash=hash
        ).first()

        if existing_user_file:
            return existing_user_file

        extension = pathlib.Path(file_name).suffix[1:].lower()
        mime_type = mimetypes.guess_type(file_name)[0] or ""
        unique = self.generate_unique(hash, extension)

        # By default the provided file is not an image.
        image = None
        is_image = False
        image_width = None
        image_height = None

        # Try to open the image with Pillow. If that succeeds we know the file is an
        # image.
        try:
            image = Image.open(stream)
            is_image = True
            image_width = image.width
            image_height = image.height
        except IOError:
            pass

        user_file = UserFile.objects.create(
            original_name=file_name,
            original_extension=extension,
            size=size,
            mime_type=mime_type,
            unique=unique,
            uploaded_by=user,
            sha256_hash=hash,
            is_image=is_image,
            image_width=image_width,
            image_height=image_height,
        )

        from time import sleep
        sleep(5)

        # If the uploaded file is an image we need to generate the configurable
        # thumbnails for it. We want to generate them before the file is saved to the
        # storage because some storages close the stream after saving.
        if image:
            self.generate_and_save_image_thumbnails(image, user_file, storage=storage)

            # When all the thumbnails have been generated, the image can be deleted
            # from memory.
            del image

        # Save the file to the storage.
        full_path = self.user_file_path(user_file)
        storage.save(full_path, stream)

        # Close the stream because we don't need it anymore.
        stream.close()

        return user_file

Feedback

All feedback or ideas are welcome here. If you think this could have impact on other operations, then I would love to hear about it.

Edited by Bram Wiepjes

Merge request reports