Docker Image 24.04 rebase - non-root user issue.
# Root Cause Analysis: Ubuntu 24.04 Base Image Causing Boot Loop in Existing Containers
## Overview
When updating our Docker base image from Ubuntu 22.04 to Ubuntu 24.04, existing containers failed to start and got stuck in a boot loop. This issue was traced to a conflict caused by the presence of a default `ubuntu` user in the new 24.04 image. Because of how user IDs (UIDs) were assigned, our own non-root user could no longer access mounted files.
## Affected Versions
- **4.4.4**: Unaffected
- **4.4.5 – 4.4.6**: Affected by the user-permissions conflict
- **4.4.7**: Will include the fix to remove the shipped `ubuntu` user and restore normal operation
## Timeline
1. **Initial Update (Version 4.4.5)**
- A pull request changing the base image from `ubuntu:22.04` to `ubuntu:24.04` was merged.
- Local tests on fresh containers passed without issues.
2. **Deployment (Version 4.4.6)**
- The updated Docker image was rolled out to production.
- Existing containers attempted to restart and entered a boot loop.
3. **Investigation**
- Discovered that `ubuntu:24.04` images ship with a default `ubuntu` user (UID 1000).
- Our container Dockerfile created a new non-root user (assumed UID 1000), causing a UID conflict.
- Mismatched UIDs prevented file access on mounted volumes.
4. **Resolution & Testing**
- Explored two options:
1. Use the shipped `ubuntu` user and remove the custom user.
2. Remove the shipped `ubuntu` user and continue creating our custom user.
- Chose option 2 to maintain better control over user privileges (especially `sudo` usage).
- A fix was implemented and tested; new builds confirmed that both fresh and existing containers now operate correctly.
## Root Cause
1. **User ID Conflict**
- Ubuntu 24.04 ships with a default `ubuntu` user with UID 1000.
- Our non-root user creation expected UID 1000, causing mismatched permissions and locked file access.
2. **Inherited Permissions**
- Mounted volumes rely on consistent UID mappings.
- Any conflict in UID ownership leads to immediate permission errors.
3. **Inconsistent Testing**
- Fresh-container testing did not reveal the conflict.
- The issue appeared only when existing containers with persistent volumes were restarted.
## Impact
- **Production Outage**
- Affected versions (4.4.5 – 4.4.6) caused existing Docker containers to go into a boot loop, resulting in downtime or service disruption.
- **Time & Effort**
- Engineering time was spent diagnosing and resolving the UID and permission mismatch.
## Resolution
### High-Level Fix
1. **Remove the shipped `ubuntu` user**
```dockerfile
RUN touch /var/mail/ubuntu \
&& chown ubuntu /var/mail/ubuntu \
&& userdel -r ubuntu
```
- Ensures UID 1000 is freed up for our custom user.
2. ** Continue to create our own non-root user**
```dockerfile
&& useradd -g root -M crafty \
&& mkdir /crafty \
&& chown -R crafty:root /crafty
```
- We maintain one non-root user (`crafty`) with the appropriate permissions and ownership.
3. **Retain sudo Usage**
- Removing the default `ubuntu` user prevents unauthorized or duplicate sudo usage.
- Our custom user is not a sudoer, where we use sudo during container init, to step down safely as required by our workflow.
4. **Target Release**
- This fix will be included in version **4.4.7**, ensuring no further conflicts for users upgrading from unaffected versions (4.4.4) or already affected versions (4.4.5 – 4.4.6).
5. **Users deploying after 4.4.4**
- Users who deployed a fresh instance of the problematic versions, will need to repair permissions after upgrading to 4.4.7, this can be done by placing a file (empty text file will do) in the `import/` mount and restarting the container. The file can be removed **after crafty has fully booted**
### Validation
- [x] **Fresh Build Testing**
- Verified that containers built from scratch function correctly and that our custom user is assigned UID 1000 as intended.
- [x] **Upgrade Testing**
- Tested existing containers with persistent volumes to confirm they start without permission issues under the updated image.
## Preventive Measures
1. **Regular Image Audits**
- Prior to upgrading base images, review release notes (especially new Ubuntu LTS versions) to identify default user or permission changes.
2. **Automated Testing for Existing Containers**
- Expand CI/CD pipelines to include restarting containers with existing volumes to quickly catch user-permission conflicts.
3. **Version Documentation**
- [x] Clearly document which versions are affected and note the resolved version (4.4.7).
- Keep a changelog that highlights significant changes to base images or user-management practices.
4. **Upstream Communication**
- Continue monitoring Ubuntu release bugs (e.g., [Launchpad #2005129](https://bugs.launchpad.net/cloud-images/+bug/2005129)) and Docker community channels for changes in default user configurations.
## Conclusion
Switching to `ubuntu:24.04` in MR !812 - [Release v4.4.5](https://gitlab.com/crafty-controller/crafty-4/-/releases/v4.4.5) introduced a default `ubuntu` user with UID 1000, conflicting with our own non-root user creation process. This conflict primarily affected existing containers with persistent volumes in versions **4.4.5 – 4.4.6**. By removing the default user and relying on our custom user, we resolved the UID conflict. This change is tested, verified, and slated for release in **version 4.4.7**, ensuring continuity and stability for our deployments.
issue