Import data when creating a table asynchronously via a Celery task instead of via the web-frontend

What do we want?

Be able to import large CSV/JSON/... files painlessly while creating a new table. For example, a CSV file of 400 columns x 100,000 rows should work without browser freezes or backend timeout and without the feeling to break something.

How is this working now?

You can open a modal to create a new table and configure the import: which kind of input file and how to import it.
The data are processed by the frontend to display a preview of the import
When the user validate the import configuration, the data are transformed to a generic JSON pivot format and then send to the backend to create the table
The maximum file size you can import this way is 15Mb
You can only import data for new table

What are the issues?

For small files this is ok but for bigger ones:

The import can take a very long time (7 minutes for a ~500Mo CSV file without upload time) which weans:
- A gunicorn worker is blocked during the whole processing -> less worker to respond to other requests and risks of timeout
- The user can't use baserow in the meantime
- The user has no feedback on the progress of the import
The browser seems frozen during many steps of the process
If one row fails, the whole import fails which can be frustrating if you want to import many rows

What is going on?

Actually the frontend isn't so slow to process the data but some spinners are missing and state are changed too late
The transferred JSON format is verbose because we repeat the field name for each values of each rows so upload can be longer that expected (ex ~500Mo CSV file === ~800Mo JSON)
On the backend, for a 7 minutes processing:
- ~3min are dedicated for the data validation done by a DRF serializer
- ~30s to create the models for the bulk update
- ~3min for the bulk update himself

What can we functionally improve?

Show to the user that something is processing behind the scene [MAJOR]
Avoid the "stuck browser" effect [MAJOR]
Make the preview appears faster [MEDIUM]
Show a progress bar during the upload [MAJOR]
Improve upload transfer time [MINOR]
Show another progress bar during the backend processing like export or Airtable import [MAJOR]
Allow the user to close the import modal and doing something else during the processing [MEDIUM]
Don't fail the whole import on row import error but import what is possible and show a report at the end [MAJOR]

Technical solutions

Frontend
- Improve UI by showing spinners when it's necessary (while creating/updating preview for example)
- Change UI state BEFORE doing tasks that can freeze the browser
- Don't parse the whole file to display the preview but just the first X lines
- Show a progress bar while the data are uploading
- Show a progress bar during the import processing
- Add an interface to display currently running import (and other tasks)
- We migth be able to use webworkers (not really sure we can win something here because we need to transfer the data to the webworker)
Backend
- Delay the import in a celery task
- Be more relax on data validation -> don't use a DRF serializer and allow rows to fail. Do not validate the data beforehand, if a row fails, let it fail and report the error
- Send a report (by email? In the import modal? in a specific page?) to the user at the end of the processing
- Do the import by batch of 100 rows and update the progress indicator
Data transfer
- Change JSON format to use Array of values instead of Array of object
- Transform JSON to protobuf before sending data
- Gzipped data (if it's not already the case)

Few warning points

What about undo/redo?
Realtime update of the table?
Hide table while importing <- use trash flag?

Other concerns

We want to be able to import data into an existing table at some point -> use batch create/update API
At some point we want the import to be able to guess the field types?

Questions

How do we store the sent data: in a file? in a database Blob?

Plans

v1 (with feature flag):
- Add a long running job manager and move the main backend processing to a job
- Refactor Airtable job
- No report yet
- Improve UI responsiveness and feeling (Spinner, updates, preview, ...) [AUTONOMOUS]
v2 (without feature flag):
- Sent the import report to the user at the end of the processing ~~by email or~~ inside the import modal
- Show a progress bar during the upload [AUTONOMOUS]
- Show a progress bar during the processing
- The two progress bar must be the same at the end. The upload represent ~ 20% and the server processing ~ 80%.
v3:
- Import file to add_new/replace rows.
- ~~Change JSON format to use Array of values [AUTONOMOUS] [If we have time]~~
- Refactor export jobs? [If we have time]
VNext
- Add a UI to show which tasks are running and their statuses

Edited Jun 03, 2022 by Jérémie Pardou