bpo.repo.symlink.create running while apkindex sign job is running results in deleted packages
This morning, the following happened for master/armv7:
- a package build job completed and triggered the api callback code on bpo
- bpo.repo.build() -> bpo.repo.symlink.create() ran
- apkindex sign job started
- bpo.repo.build() -> bpo.repo.symlink.create() ran again, started by the images build timer (runs every hour)
- apkindex sign job finished, callback code ran
- the symlink repo was almost empty
- packages that were not in the symlink repo were deleted from the final repo (pretty much all of them)
Other arches and branches are not affected.
I've been analyzing logs and code for some hours, and I couldn't find the exact line of code that triggers this... there is a shutil.rmtree running on the symlink repo at the start of bpo.repo.symlink.create(), and then the symlink repo gets re-created. It seems that this is the cause, but in the log it seems that this code already ran through at the time when the callback from the sign job comes in.
Regardless, the proper fix should be (mid-term):
- with each bpo.repo.symlink.create(), use a different, new temporary directory for the symlink repo
- send the ID of that temp dir along with the job
- only when the job returns with a given ID, and it is no longer needed, remove that symlink repo dir
- do the same for the apkindex file generated from the symlink repo: instead of overwriting the last one, create a new one with a unique name and delete it when it is no longer used
- (clean up temp files on restart)
- create regular backups of the binary packages on the mirror server, so we could recover from such bugs quickly if necessary: #99
Implementing this will take time, and I'm already behind with backlog on a lot of things, so I won't be able to do it now.
Now I'll do the following, as short-term solution:
- we have a backup from january of the repository, from which we copied master/armv7 to the repository
- I've just created a new manual backup of the entire package repository in case this should happen again.
- I'll add a patch that prevents bpo.repo.symlink.create from running when called through the images build timer. This should prevent this bug from occurring in practice (in theory, other code paths are possible, which is why I want to implement the proper fix above).
- when bpo restarts, it will remove outdated packages from that january backup and start building missing packages
EDIT: using the backup we had brought its own problems, and most packages needed to be rebuilt anyway. so I'll just let it rebuild all of them.