Investigate: Pod failures and PVC's

During testing, it was common to bring down a Redis deployment in Kubernetes, and start up a new one. During this procedure, it was noticed that the volume that housed the data assigned to a given Pod was NOT destroyed. This was interesting and raises some interesting questions when it comes to operating Redis in a production manner inside of Kubernetes. Utilize this issue to continue our learning into how Persistent Volumes work inside of Kubernetes, with a target of Redis.

  1. What does Redis do, if it comes back online with a PV that was never destroyed when the Pod was destroyed?
  2. If the data is old, does Redis know to properly re-sync the data?
  3. What impact does this have on a Redis deployment and troubleshooting aspects?
  4. What monitoring should/can we add to validate that a volume is not the root cause of an issue?
  5. What runbooks additions should be considered when it comes to maintenance procedures and changes to a Redis deployment?
  6. When testing things, are there things we should document to ensure that we are testing items concisely?
  7. Can/Should we ever attempt to remove a PV with a Pod for any reason?
  8. ...

Milestones

  • Runbooks updated to account for any procedural items
  • Readiness review is prepped with information regarding this investigation as we see fit
  • Monitoring, if able, is updated to assist for troubleshooting
  • Knowledge is spread about

Helm explicitly does not remove PVC's from a deployment: https://github.com/helm/helm/issues/5156

Reference: [Kubernetes Persistent Volumes]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

Reference slack thread which initiated this conversation: https://gitlab.slack.com/archives/C02GKLY9XGF/p1643830364771829

Edited by John Skarbek