More information will be added as we investigate the issue.
For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.
Summary for CMOC notice / Exec summary:
Customer Impact: Human-friendly 1-sentence statement on impacted
Service Impact: service:: labels of services impacted by this incident
Impact Duration: start time UTC - end time UTC ( duration in minutes )
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Security Note:
If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.
This issue now has the CorrectiveActionsNeeded label, this label will be removed automatically when there is
at least one related issue that is labeled with corrective action or ~"infradev".
Having an issue related with these labels helps to ensure a similar incident doesn't happen again.
If you are certain that this incident doesn't require any corrective actions, add the
CorrectiveActionsNotNeeded label to this issue with a note explaining why.
Thanks for taking part in this incident! It looks like this incident needs
an async Incident Review issue, please use the Incident Review link in
the incident's description to create one.
We're posting this message because this issue meets the following criteria:
If you are certain that this incident doesn't require an incident review, add the
IncidentReviewNotNeeded label to this issue with a note explaining why.
Many Ci::BuildFinishedWorker jobs are failing with PG::CheckViolation: ERROR: no partition of relation "p_ci_finished_build_ch_sync_events" found for row DETAIL: Partition key of the failing row contains (partition) = (168).
An example of the error from PG logs:
2024-04-28 09:48:07.877 GMT,"gitlab","gitlabhq_production",1427433,"10.217.20.9:17624",662e1aa6.15c7e9,260,"INSERT",2024-04-28 09:45:10 GMT,177/1101055628,0,ERROR,23514,"no partition of relation ""p_ci_finished_build_ch_sync_events"" found for row","Partition key of the failing row contains (partition) = (168).",,,,,"/*application:sidekiq,correlation_id:b8aae849a078cb16cdf3a1de58085dc4,jid:41e3c37f5395ba2259b7ef90,endpoint_id:Ci::BuildFinishedWorker,db_config_name:ci*/ INSERT INTO ""p_ci_finished_build_ch_sync_events"" (""build_id"",""build_finished_at"") VALUES (6733307398, '2024-04-28 09:46:19.044123') ON CONFLICT (""build_id"",""partition"") DO UPDATE SET ""build_finished_at""=excluded.""build_finished_at"" RETURNING ""build_id"",""partition""",,,"","client backend",,129464415442806422
There are some relevant logs:
2024-04-28 09:12:49.607 GMT,"gitlab","gitlabhq_production",1297861,"10.218.7.2:47108",662e1302.13cdc5,4,"idle",2024-04-28 09:12:34 GMT,140/1340927854,0,LOG,00000,"statement: /*application:web,db_config_name:ci*/ ALTER TABLE ""p_ci_finished_build_ch_sync_events"" ALTER COLUMN ""partition"" SET DEFAULT 168",,,,,,,,,"","client backend",,0
I'm not a database expert, but it looks like Postgres failed to create that partition (for unknown reasons) but still updated the default partition. This table uses sliding list partition strategy so that a new partition for that database is created dynamically. We implemented partition management in Rails (here and here). The default update is in the same DB transaction with partition creation. Could there be a bug that makes partition creation fail silently?
Anyway, we should raise and investigate this outside the scope of this incident, though
@mbobin@pedropombeiro Were we able to find a root-cause for this incident? Do we require more info, so that we can capture that if the incident reoccurred?
@ahmadsherif I was out and didn't get the chance to dig deeper into this. But from the error message I'd say that we changed the default value for the partition column before the partition was actually created. I don't know if we store any long term data about partition management to confirm this hypothesis.
First of all, thank you for taking part in this incident.
We're posting this message because this issue meets the following criteria:
This incident is open
No activity in the past 3 days (since 2024-05-13T09:23:06.950Z)
We'd like to ask you to help us out and determine how we should act on this issue.
Incident issues in IncidentMitigated state should only remain open for ongoing incidents
If there is a reason that it should remain open, please add a note explaining
the lack of activity, otherwise please consider closing, and starting a separate follow-up investigation issue if needed.
First of all, thank you for taking part in this incident.
We're posting this message because this issue meets the following criteria:
This incident is open
No activity in the past 3 days (since 2024-05-17T00:01:13.647Z)
We'd like to ask you to help us out and determine how we should act on this issue.
Incident issues in IncidentMitigated state should only remain open for ongoing incidents
If there is a reason that it should remain open, please add a note explaining
the lack of activity, otherwise please consider closing, and starting a separate follow-up investigation issue if needed.
First of all, thank you for taking part in this incident.
We're posting this message because this issue meets the following criteria:
This incident is open
No activity in the past 3 days (since 2024-05-21T21:42:13.393Z)
We'd like to ask you to help us out and determine how we should act on this issue.
Incident issues in IncidentMitigated state should only remain open for ongoing incidents
If there is a reason that it should remain open, please add a note explaining
the lack of activity, otherwise please consider closing, and starting a separate follow-up investigation issue if needed.