Implement crash detection

Description

We need to implement a way to detect crashes. Proposal from the context above pasted:

1. Did crash happen

Create an agent process (separate docker service based on backend) that manages a BootHistory (suggest a name :D) model. On start it:

checks if there is an existing entry without shutdown_time
- if so, populates it with shutdown timemarks entry as status=CRASHED
creates a new entry with current timestamp
waits for a termination signal, writes timestamp to current entry, shuts down

This would however require the agent process to start AND wait a little bit until other services start. Can be simply done with a random port being opened as a simple socket server on a separate thread, and health check would probe the port for being open.

2. When did the crash happen

Linux by default doesn't have something directly useful here. Our best bet is to check for newest something (somewhere) and rely on that info. By far what comes to mind:

journalctl entries - irregular on a setup like ours (and I don't feel like spamming it with garbage either)
- journalctl -b -1 -r | head -1 to get the latest event from previous boot
- can we interface with host journalctl from inside container?
find latest file change timestamp in /var/log prior to current boot time
- we should check how reliable this is - something may be constantly writing, but that can be overwritten on boot before our script kicks in
have a heartbeat file, writing to it every ~1s, but that means that we have to make sure stuff is actually flushed to the SD card
- there is a sync command that flushes all write operations to a disk
- This is probably the best method to give us an actually accurate timestamp, but:
  - Do we need this precision?
  - We might need to get acquainted with FS operations so that we don't degrade/kill the SD card prematurely (?)

3. Allowing the system to react to crashes

Kernel would always be able to access everything, but I'd discourage plugins from interacting with that model altogether.

To prevent future havoc from system's distributed nature, we could expose e.g. last 3/10/whatever boot infos in whitebox.{current,last}_boot as some dict/namedtuple dump of the model object (and later whitebox.boot_history, in case history is useful), and plugins would be able to check it on load or whenever.

Scope

Implement crash detection agent
- Make modular so that journalctl or else can easily be replaced with a different mechanism
Track crashes in database
Expose boot history info in whitebox object for development convenience

MRs

!225 (merged)

Edited Mar 08, 2026 by Milos