Import data when creating a table asynchronously via a Celery task instead of via the web-frontend

What do we want?

Be able to import large CSV/JSON/... files painlessly while creating a new table. For example, a CSV file of 400 columns x 100,000 rows should work without browser freezes or backend timeout and without the feeling to break something.

How is this working now?

  • You can open a modal to create a new table and configure the import: which kind of input file and how to import it.
  • The data are processed by the frontend to display a preview of the import
  • When the user validate the import configuration, the data are transformed to a generic JSON pivot format and then send to the backend to create the table
  • The maximum file size you can import this way is 15Mb
  • You can only import data for new table

What are the issues?

For small files this is ok but for bigger ones:

  • The import can take a very long time (7 minutes for a ~500Mo CSV file without upload time) which weans:
    • A gunicorn worker is blocked during the whole processing -> less worker to respond to other requests and risks of timeout
    • The user can't use baserow in the meantime
    • The user has no feedback on the progress of the import
  • The browser seems frozen during many steps of the process
  • If one row fails, the whole import fails which can be frustrating if you want to import many rows

What is going on?

  • Actually the frontend isn't so slow to process the data but some spinners are missing and state are changed too late
  • The transferred JSON format is verbose because we repeat the field name for each values of each rows so upload can be longer that expected (ex ~500Mo CSV file === ~800Mo JSON)
  • On the backend, for a 7 minutes processing:
    • ~3min are dedicated for the data validation done by a DRF serializer
    • ~30s to create the models for the bulk update
    • ~3min for the bulk update himself

What can we functionally improve?

  • Show to the user that something is processing behind the scene [MAJOR]
  • Avoid the "stuck browser" effect [MAJOR]
  • Make the preview appears faster [MEDIUM]
  • Show a progress bar during the upload [MAJOR]
  • Improve upload transfer time [MINOR]
  • Show another progress bar during the backend processing like export or Airtable import [MAJOR]
  • Allow the user to close the import modal and doing something else during the processing [MEDIUM]
  • Don't fail the whole import on row import error but import what is possible and show a report at the end [MAJOR]

Technical solutions

  • Frontend

    • Improve UI by showing spinners when it's necessary (while creating/updating preview for example)
    • Change UI state BEFORE doing tasks that can freeze the browser
    • Don't parse the whole file to display the preview but just the first X lines
    • Show a progress bar while the data are uploading
    • Show a progress bar during the import processing
    • Add an interface to display currently running import (and other tasks)
    • We migth be able to use webworkers (not really sure we can win something here because we need to transfer the data to the webworker)
  • Backend

    • Delay the import in a celery task
    • Be more relax on data validation -> don't use a DRF serializer and allow rows to fail. Do not validate the data beforehand, if a row fails, let it fail and report the error
    • Send a report (by email? In the import modal? in a specific page?) to the user at the end of the processing
    • Do the import by batch of 100 rows and update the progress indicator
  • Data transfer

    • Change JSON format to use Array of values instead of Array of object
    • Transform JSON to protobuf before sending data
    • Gzipped data (if it's not already the case)

Few warning points

  • What about undo/redo?
  • Realtime update of the table?
  • Hide table while importing <- use trash flag?

Other concerns

  • We want to be able to import data into an existing table at some point -> use batch create/update API
  • At some point we want the import to be able to guess the field types?

Questions

  • How do we store the sent data: in a file? in a database Blob?

Plans

  • v1 (with feature flag):
    • Add a long running job manager and move the main backend processing to a job
    • Refactor Airtable job
    • No report yet
    • Improve UI responsiveness and feeling (Spinner, updates, preview, ...) [AUTONOMOUS]
  • v2 (without feature flag):
    • Sent the import report to the user at the end of the processing by email or inside the import modal
    • Show a progress bar during the upload [AUTONOMOUS]
    • Show a progress bar during the processing
    • The two progress bar must be the same at the end. The upload represent ~ 20% and the server processing ~ 80%.
  • v3:
    • Import file to add_new/replace rows.
    • Change JSON format to use Array of values [AUTONOMOUS] [If we have time]
    • Refactor export jobs? [If we have time]
  • VNext
    • Add a UI to show which tasks are running and their statuses
Edited by Jérémie Pardou