Implement crash detection (#479) · Issues · Whitebox / Whitebox

Implement crash detection

# Description Context: https://gitlab.com/whitebox-aero/whitebox/-/issues/395#note_3024312906 We need to implement a way to detect crashes. Proposal from the context above pasted: ## 1. Did crash happen Create an agent process (separate docker service based on `backend`) that manages a `BootHistory` (suggest a name :D) model. On start it: - checks if there is an existing entry without `shutdown_time` - if so, populates it with shutdown timemarks entry as `status=CRASHED` - creates a new entry with current timestamp - waits for a termination signal, writes timestamp to current entry, shuts down This would however require the agent process to start AND wait a little bit until other services start. Can be simply done with a random port being opened as a simple socket server on a separate thread, and health check would probe the port for being open. ## 2. When did the crash happen Linux by default doesn't have something directly useful here. Our best bet is to check for newest `something` (`somewhere`) and rely on that info. By far what comes to mind: - `journalctl` entries - irregular on a setup like ours (and I don't feel like spamming it with garbage either) - `journalctl -b -1 -r | head -1` to get the latest event from previous boot - can we interface with host journalctl from inside container? - find latest file change timestamp in `/var/log` prior to current boot time - we should check how reliable this is - something may be constantly writing, but that can be overwritten on boot before our script kicks in - have a heartbeat file, writing to it every ~1s, but that means that we have to make sure stuff is actually flushed to the SD card - there is a `sync` command that flushes all write operations to a disk - This is probably the best method to give us an actually accurate timestamp, but: - Do we need this precision? - We might need to get acquainted with FS operations so that we don't degrade/kill the SD card prematurely (?) ## 3. Allowing the system to react to crashes Kernel would always be able to access everything, but I'd discourage plugins from interacting with that model altogether. To prevent future havoc from system's distributed nature, we could expose e.g. last 3/10/whatever boot infos in `whitebox.{current,last}_boot` as some `dict`/`namedtuple` dump of the model object (and later `whitebox.boot_history`, in case history is useful), and plugins would be able to check it on load or whenever. # Scope - Implement crash detection agent - Make modular so that `journalctl` or else can easily be replaced with a different mechanism - Track crashes in database - Expose boot history info in `whitebox` object for development convenience # MRs - https://gitlab.com/whitebox-aero/whitebox/-/merge_requests/225

issue