[gprd]production>Ci::Build.where(status: 'running').where('ci_builds.updated_at < ?',1.hour.ago).countD,[2021-05-04T21:32:17.430351#24432] DEBUG -- : (15017.8ms) /*application:console*/ SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."status" = 'running' AND (ci_builds.updated_at < '2021-05-04 20:32:02.411723')
I don't see any indexes on updated_at, so it's likely this query is expensive because we have to look at all running builds and iterate through them.
We do have an index on created_at, which we could use to limit the search space. However, we can see that even this query appears to require a lot of heap fetches:
@abrandl@stanhu@allison.browne it took longer than expected, but the new machine (with PG12) is up and catching up now – I think it will be available on May 13 and we'll switch Joe bot / #database-lab Slack channel to work with it.
@allison.browne Do you suggest adding a new composite index with created_at? I'm not sure if this is acceptable to add another index into this table, but perhaps this is fine, we would need to ask @abrandl.
That being said, I'm working on &5909 (closed). One this is done we will be able to use ci_pending_builds table to find builds we should remove from a queue. We could benefit from fixing StuckCiJobsWorker before shipping ci_pending_builds, but if this would require adding another index, then perhaps it would be better to ship ci_pending_builds first (after performing a manual cleanup on production) 🤔
@cheryl.li - I am moving out a few items assigned to %14.0 to make room for this item. Can we get this item weighed please? I am also adding this to the 14.0 planning issue
If we want to avoid adding new indexes to this large table, I wonder if we should use the created_at index as part of the query since we know that updated_at >= created_at. That way this job can do its work by scanning in increments of time using created_at, and then filter by updated_at.
@allison.browne Can I get you to weight this infradev issue please? This just came in as a p1 and we are looking to prioritize next milestone. Thanks 🙏
I'm going to bump severity to severity1 because of our findings described in #331417 (comment 579597803). This bug has a direct impact on availability and reliability.
Some of the StuckCiJobsWorker transactions are taking almost up to 10 minutes before failing. In addition to the application problems that the failures are causing, these long running transactions could be a risk to database avialability.
@andrewn, I've optimized the StuckCiJobsWorker queries for pending builds, but we hit a roadblock for running builds. For running builds, adding created_at to the query still leads to a timeout on the production replica and we don't want to add any more indexes to this table, so we may have to re-architect the tables or the way we are deleting stuck/timed out jobs.
In the meantime would moving these select queries to the replicas rather than the primary mitigate some of the risks to database availability?
@andrewn, unfortunately, it kind of depends. What we've seen for both the new queries and the existing queries is that they time out on the prod replica at high traffic times of day but sometimes look as if they are performing acceptably in database-lab or the archive replica.
For instance, the optimized version looks like it takes 5.155 ms on database-lab !63314 (closed), but it still times out on the production replica sometimes: !63314 (comment 594170188).
In the archive replica, I'm seeing 143 seconds for the existing query ATM.
I created a follow-up to drop builds in a migration that may have been missed due to the timeouts and won't be seen by the queries in the final solution (included a lookback window).
For the first MR I've resolved all sentry errors related to queries for pending builds that have timed out and I haven't seen any more come in, so far. 🎊
I'm currently somewhat blocked on the second MR (related to running builds). Marius attempted to run the optimization on the production replica and it timed out but it does not time out on the production archive. ATM, I only have access to the archive. I've filed an access request. I'm not sure exactly how I will optimize it from here once I get access ATM.
@cheryl.li, I'm also blocked because I don't have a clear path forward to optimize this query ATM although I'm actively discussing with team members and working on some database lab experiments.
@jreporter@cheryl.li, I've added issues for 2 more strategies we might use to mitigate the Timeouts and make this worker more resilient to timeouts. I did start to move forward with #334399 (closed), given this is P1. However, I just wanted to confirm that I should be focusing on this still considering the solution scope has changed.
I'm also still exploring whether to create an issue for this piece still:
Potentially a new worker that gets scheduled when builds start running (This depends on if we are completely blocked on optimizing the exiting queries/indexes and I'm currently investigating some index changes so I'll wait to open a new issue.)
The indexes take almost a full workday to add so it's slow to investigate possibilities here.
@allison.browne - if it’s reasonable I’d like to focus on another P1 and get back to this, after we deliver another P1. If you need help choosing one let me know! Cc @cheryl.li
@andrewn I'm wondering if we can get your opinion on how we can move forward with this infradevseverity1 issue. @allison.browne highlighted a few other approaches she can take #291010 (comment 607031753), but at this point I'd say there's not a clear path forward on how to best mitigate the remaining issues. We do have a few MRs merged in %14.0, but this is an issue that will take a number of iterations and MRs to collectively address the problem. It has not been prioritized in %14.2 given other priorities, but I'd also like your feedback on whether this still should be considered an S2.
@jreporter No I'd say the way forward is still not entirely clear. Let me ping a few folks to see if we feel like the severity1 is still accurate here.
@clefelhocz1 Allison is on PTO and the related MR is workflowin review. We have a few other pieces of work finishing up and another engineer can pick this up next (although the ramp up time might be significant and she will be back on Monday).
Yeah @cheryl.li that makes perfect sense. #334399 (closed) represents the work for a good idea that came out of the first refactoring !68891 (merged), so we should get it into production and then revisit this issue to see how the timings are.
@cheryl.li, I wanted to note here that our efforts so far have led to some improvements here already. In the last 8 days there has been 9 times the query for running builds has timed out:
This is opposed to the original kibana logs posted which show the queries timing out 72 times in a 24 hour period. The job ran once an hour and was retried up to 3 times.
So the issue title 'StuckCiJobsWorker always times out' is no longer accurate but there are still areas for improvement as it still sometimes times out. The queries behave erratically based on the state of postgres vacuuming and we sometimes see sub 300ms performance for the same queries that take many seconds (on rare occasions) causing a PG::QueryCanceled.
@cheryl.li@allison.browne It seems that we managed to fix the problem with canceling outdated pending builds with StuckCiJobsWorker! This is really good, it has a significant positive impact on our production systems, especially the new pending builds queuing system.
I suggest that we close this issue (Ci::StuckJobsWorker has been fixed), and open a new issue about Ci::StuckBuilds::DropRunningWorker" timing out. The scope of the new issue will be much smaller and the severity is also lower. /cc @clefelhocz1
Async update: I still saw some timeouts with the new query.
So I opened a new MR which is merged and have been rolling that out. So far so good when it comes to the rollout but I'd like to wait a week to check performance since the queries behave erratically.
I have a hypothesis here that the updates triggered by the BuildFinishedWorker may be what causes the high number of dead tuples and dirty rows, that lead to longer execution times and timeouts. I've noticed that the timeouts tend to start with the time period where we execute the most BuildFinishedWorker's which is 12 UTC on weekdays.
@allison.browne So it looks like we've solidly narrowed the problem to the DropRunningService, the others look great. What do you think of:
Moving the read-query to a replica? The majority of the timeout failures are happening on the first (read) query. #332617
Batching the read? If we can set up the worker to at least make progress on every run, we'll at least get some amount work done and maybe eventually catch up. Not sure how this would fit into the cache-warming idea we've talked about, but it might.
@drew, yeah, this morning I've been seeing some issues with SELECT queries in the DropRunningService still despite the optimizations. It looked like there were some vacuuming issues caused by the incident this morning, but I'm not sure how related that is.
Moving the read-query to a replica? The majority of the timeout failures are happening on the first (read) query.
Sounds good, I know we were waiting on support for idempotent jobs, but we could also remove the idempotent designation, for now, just to see how read-only performs.
Batching the read? If we can set up the worker to at least make progress on every run, we'll at least get some amount work done and maybe eventually catch up. Not sure how this would fit into the cache-warming idea we've talked about, but it might.
Can you elaborate on this one? I would say we are currently 'batching the read' but we could change our approach to how we 'batch the read'. Right now we select 100 in one query update those builds with 100+ queries and then select 100 more until they are all dropped and none can be selected.
I had started an MR to use EachBatch or Keyset pagination, instead of the current approach, which relies on the last 100 jobs being updated and therefore no longer part of the result set for the next SELECT query with a 100 limit. It's possible that could improve things although it would be an experiment.
I have noticed that the job rarely gets through the 1st select 100; statement when it fails (based on the json.db_primary_count in kibana), which makes me think that the updates causing vacuuming to be out of date are not the updates in the job itself, but the updates happening outside the Stuck Job Workers.
Since the queries are the same for the first query using EachBatch, and the first query typically fails, I hypothesis that switching to EachBatch won't improve things much.
I do wonder if knocking down the batch size may help, but I also wonder if we've done all we can in the worker itself to optimize this and the problems lie outside the worker. For instance, large table size -> needs partitioning.
Sounds good, I know we were waiting on support for idempotent jobs, but we could also remove the idempotent designation, for now, just to see how read-only performs.
Ah right. I think both of us just wrote reminders into that issue (now linked above)
I would say we are currently 'batching the read' but we could change our approach to how we 'batch the read'. Right now we select 100 in one query update those builds with 100+ queries and then select 100 more until they are all dropped and none can be selected.
Where do we do this? I'm looking at the current state of DropRunningService and it looks like we have a go at the whole table at once.