Implement crash detection
Description
Context: #395 (comment 3024312906)
We need to implement a way to detect crashes. Proposal from the context above pasted:
1. Did crash happen
Create an agent process (separate docker service based on backend) that manages a BootHistory (suggest a name :D) model. On start it:
- checks if there is an existing entry without
shutdown_time- if so, populates it with shutdown timemarks entry as
status=CRASHED
- if so, populates it with shutdown timemarks entry as
- creates a new entry with current timestamp
- waits for a termination signal, writes timestamp to current entry, shuts down
This would however require the agent process to start AND wait a little bit until other services start. Can be simply done with a random port being opened as a simple socket server on a separate thread, and health check would probe the port for being open.
2. When did the crash happen
Linux by default doesn't have something directly useful here. Our best bet is to check for newest something (somewhere) and rely on that info. By far what comes to mind:
-
journalctlentries - irregular on a setup like ours (and I don't feel like spamming it with garbage either)-
journalctl -b -1 -r | head -1to get the latest event from previous boot - can we interface with host journalctl from inside container?
-
- find latest file change timestamp in
/var/logprior to current boot time- we should check how reliable this is - something may be constantly writing, but that can be overwritten on boot before our script kicks in
- have a heartbeat file, writing to it every ~1s, but that means that we have to make sure stuff is actually flushed to the SD card
- there is a
synccommand that flushes all write operations to a disk - This is probably the best method to give us an actually accurate timestamp, but:
- Do we need this precision?
- We might need to get acquainted with FS operations so that we don't degrade/kill the SD card prematurely (?)
- there is a
3. Allowing the system to react to crashes
Kernel would always be able to access everything, but I'd discourage plugins from interacting with that model altogether.
To prevent future havoc from system's distributed nature, we could expose e.g. last 3/10/whatever boot infos in whitebox.{current,last}_boot as some dict/namedtuple dump of the model object (and later whitebox.boot_history, in case history is useful), and plugins would be able to check it on load or whenever.
Scope
- Implement crash detection agent
- Make modular so that
journalctlor else can easily be replaced with a different mechanism
- Make modular so that
- Track crashes in database
- Expose boot history info in
whiteboxobject for development convenience