Geo: Be more descriptive about why a job has been skipped
Description
The geo.log
on my local machine contained the following:
{"severity":"INFO","time":"2018-12-04T04:56:16.083Z","class":"Geo::RepositoryVerification::Secondary::ShardWorker","message":"Started scheduler","job_id":"a8d311b897cb55b127c702d3","shard":"default"}
{"severity":"INFO","time":"2018-12-04T04:56:16.088Z","class":"Geo::RepositoryVerification::Secondary::ShardWorker","message":"Finished scheduler","job_id":"a8d311b897cb55b127c702d3","shard":"default","total_loops":0,"duration":0.007761,"reason":"skipped"}
With the most interesting part being "reason":"skipped"
. Now, looking at how reason
is populated in Geo::Scheduler::SchedulerWorker#perform()
, we have the following:
reason = loop do
break :node_disabled unless node_enabled?
break :skipped if should_be_skipped?
update_jobs_in_progress
update_pending_resources
break :over_time if over_time?
break :complete unless resources_remain?
# If we're still under the limit after refreshing from the DB, we
# can end after scheduling the remaining transfers.
last_batch = reload_queue?
schedule_jobs
@loops += 1
break :last_batch if last_batch
break :lease_lost unless renew_lease!
sleep(1)
end
Which is more a reason type as it's essentially an enum of reason types. Looking at the code above, it's not clear why the job has been skipped, only that is has been skipped because should_be_skipped?()
has returned true
. The actual reason in my case was the GeoBackoffDelay.should_be_skipped?()
returned true.
Proposal
I propose we also include more detail as to why it's been skipped to aid in debugging, perhaps a reason_detail
, which could look like:
{"severity":"INFO","time":"2018-12-04T04:56:16.088Z","class":"Geo::RepositoryVerification::Secondary::ShardWorker","message":"Finished scheduler","job_id":"a8d311b897cb55b127c702d3","shard":"default","total_loops":0,"duration":0.007761,"reason":"skipped","reason_detail":"GeoBackoffDelay: Backoff of 5 mins in affect"}