Docker Image 24.04 rebase - non-root user issue.
# Root Cause Analysis: Ubuntu 24.04 Base Image Causing Boot Loop in Existing Containers ## Overview When updating our Docker base image from Ubuntu 22.04 to Ubuntu 24.04, existing containers failed to start and got stuck in a boot loop. This issue was traced to a conflict caused by the presence of a default `ubuntu` user in the new 24.04 image. Because of how user IDs (UIDs) were assigned, our own non-root user could no longer access mounted files. ## Affected Versions - **4.4.4**: Unaffected - **4.4.5 – 4.4.6**: Affected by the user-permissions conflict - **4.4.7**: Will include the fix to remove the shipped `ubuntu` user and restore normal operation ## Timeline 1. **Initial Update (Version 4.4.5)** - A pull request changing the base image from `ubuntu:22.04` to `ubuntu:24.04` was merged. - Local tests on fresh containers passed without issues. 2. **Deployment (Version 4.4.6)** - The updated Docker image was rolled out to production. - Existing containers attempted to restart and entered a boot loop. 3. **Investigation** - Discovered that `ubuntu:24.04` images ship with a default `ubuntu` user (UID 1000). - Our container Dockerfile created a new non-root user (assumed UID 1000), causing a UID conflict. - Mismatched UIDs prevented file access on mounted volumes. 4. **Resolution & Testing** - Explored two options: 1. Use the shipped `ubuntu` user and remove the custom user. 2. Remove the shipped `ubuntu` user and continue creating our custom user. - Chose option 2 to maintain better control over user privileges (especially `sudo` usage). - A fix was implemented and tested; new builds confirmed that both fresh and existing containers now operate correctly. ## Root Cause 1. **User ID Conflict** - Ubuntu 24.04 ships with a default `ubuntu` user with UID 1000. - Our non-root user creation expected UID 1000, causing mismatched permissions and locked file access. 2. **Inherited Permissions** - Mounted volumes rely on consistent UID mappings. - Any conflict in UID ownership leads to immediate permission errors. 3. **Inconsistent Testing** - Fresh-container testing did not reveal the conflict. - The issue appeared only when existing containers with persistent volumes were restarted. ## Impact - **Production Outage** - Affected versions (4.4.5 – 4.4.6) caused existing Docker containers to go into a boot loop, resulting in downtime or service disruption. - **Time & Effort** - Engineering time was spent diagnosing and resolving the UID and permission mismatch. ## Resolution ### High-Level Fix 1. **Remove the shipped `ubuntu` user** ```dockerfile RUN touch /var/mail/ubuntu \ && chown ubuntu /var/mail/ubuntu \ && userdel -r ubuntu ``` - Ensures UID 1000 is freed up for our custom user. 2. ** Continue to create our own non-root user** ```dockerfile && useradd -g root -M crafty \ && mkdir /crafty \ && chown -R crafty:root /crafty ``` - We maintain one non-root user (`crafty`) with the appropriate permissions and ownership. 3. **Retain sudo Usage** - Removing the default `ubuntu` user prevents unauthorized or duplicate sudo usage. - Our custom user is not a sudoer, where we use sudo during container init, to step down safely as required by our workflow. 4. **Target Release** - This fix will be included in version **4.4.7**, ensuring no further conflicts for users upgrading from unaffected versions (4.4.4) or already affected versions (4.4.5 – 4.4.6). 5. **Users deploying after 4.4.4** - Users who deployed a fresh instance of the problematic versions, will need to repair permissions after upgrading to 4.4.7, this can be done by placing a file (empty text file will do) in the `import/` mount and restarting the container. The file can be removed **after crafty has fully booted** ### Validation - [x] **Fresh Build Testing** - Verified that containers built from scratch function correctly and that our custom user is assigned UID 1000 as intended. - [x] **Upgrade Testing** - Tested existing containers with persistent volumes to confirm they start without permission issues under the updated image. ## Preventive Measures 1. **Regular Image Audits** - Prior to upgrading base images, review release notes (especially new Ubuntu LTS versions) to identify default user or permission changes. 2. **Automated Testing for Existing Containers** - Expand CI/CD pipelines to include restarting containers with existing volumes to quickly catch user-permission conflicts. 3. **Version Documentation** - [x] Clearly document which versions are affected and note the resolved version (4.4.7). - Keep a changelog that highlights significant changes to base images or user-management practices. 4. **Upstream Communication** - Continue monitoring Ubuntu release bugs (e.g., [Launchpad #2005129](https://bugs.launchpad.net/cloud-images/+bug/2005129)) and Docker community channels for changes in default user configurations. ## Conclusion Switching to `ubuntu:24.04` in MR !812 - [Release v4.4.5](https://gitlab.com/crafty-controller/crafty-4/-/releases/v4.4.5) introduced a default `ubuntu` user with UID 1000, conflicting with our own non-root user creation process. This conflict primarily affected existing containers with persistent volumes in versions **4.4.5 – 4.4.6**. By removing the default user and relying on our custom user, we resolved the UID conflict. This change is tested, verified, and slated for release in **version 4.4.7**, ensuring continuity and stability for our deployments.
issue