Import data when creating a table asynchronously via a Celery task instead of via the web-frontend
What do we want?
Be able to import large CSV/JSON/... files painlessly while creating a new table. For example, a CSV file of 400 columns x 100,000 rows should work without browser freezes or backend timeout and without the feeling to break something.
How is this working now?
- You can open a modal to create a new table and configure the import: which kind of input file and how to import it.
- The data are processed by the frontend to display a preview of the import
- When the user validate the import configuration, the data are transformed to a generic JSON pivot format and then send to the backend to create the table
- The maximum file size you can import this way is 15Mb
- You can only import data for new table
What are the issues?
For small files this is ok but for bigger ones:
- The import can take a very long time (7 minutes for a ~500Mo CSV file without upload time) which weans:
- A gunicorn worker is blocked during the whole processing -> less worker to respond to other requests and risks of timeout
- The user can't use baserow in the meantime
- The user has no feedback on the progress of the import
- The browser seems frozen during many steps of the process
- If one row fails, the whole import fails which can be frustrating if you want to import many rows
What is going on?
- Actually the frontend isn't so slow to process the data but some spinners are missing and state are changed too late
- The transferred JSON format is verbose because we repeat the field name for each values of each rows so upload can be longer that expected (ex ~500Mo CSV file === ~800Mo JSON)
- On the backend, for a 7 minutes processing:
- ~3min are dedicated for the data validation done by a DRF serializer
- ~30s to create the models for the bulk update
- ~3min for the bulk update himself
What can we functionally improve?
- Show to the user that something is processing behind the scene [MAJOR]
- Avoid the "stuck browser" effect [MAJOR]
- Make the preview appears faster [MEDIUM]
- Show a progress bar during the upload [MAJOR]
- Improve upload transfer time [MINOR]
- Show another progress bar during the backend processing like export or Airtable import [MAJOR]
- Allow the user to close the import modal and doing something else during the processing [MEDIUM]
- Don't fail the whole import on row import error but import what is possible and show a report at the end [MAJOR]
Technical solutions
-
Frontend
- Improve UI by showing spinners when it's necessary (while creating/updating preview for example)
- Change UI state BEFORE doing tasks that can freeze the browser
- Don't parse the whole file to display the preview but just the first X lines
- Show a progress bar while the data are uploading
- Show a progress bar during the import processing
- Add an interface to display currently running import (and other tasks)
- We migth be able to use webworkers (not really sure we can win something here because we need to transfer the data to the webworker)
-
Backend
- Delay the import in a celery task
- Be more relax on data validation -> don't use a DRF serializer and allow rows to fail. Do not validate the data beforehand, if a row fails, let it fail and report the error
- Send a report (by email? In the import modal? in a specific page?) to the user at the end of the processing
- Do the import by batch of 100 rows and update the progress indicator
-
Data transfer
- Change JSON format to use Array of values instead of Array of object
- Transform JSON to protobuf before sending data
- Gzipped data (if it's not already the case)
Few warning points
- What about undo/redo?
- Realtime update of the table?
- Hide table while importing <- use trash flag?
Other concerns
- We want to be able to import data into an existing table at some point -> use batch create/update API
- At some point we want the import to be able to guess the field types?
Questions
- How do we store the sent data: in a file? in a database Blob?
Plans
- v1 (with feature flag):
- Add a long running job manager and move the main backend processing to a job
- Refactor Airtable job
- No report yet
- Improve UI responsiveness and feeling (Spinner, updates, preview, ...) [AUTONOMOUS]
- v2 (without feature flag):
- Sent the import report to the user at the end of the processing
by email orinside the import modal - Show a progress bar during the upload [AUTONOMOUS]
- Show a progress bar during the processing
- The two progress bar must be the same at the end. The upload represent ~ 20% and the server processing ~ 80%.
- Sent the import report to the user at the end of the processing
- v3:
- Import file to add_new/replace rows.
Change JSON format to use Array of values [AUTONOMOUS] [If we have time]- Refactor export jobs? [If we have time]
- VNext
- Add a UI to show which tasks are running and their statuses
Edited by Jérémie Pardou